Structured Data

Vol. 13 No. 9 – November-December 2015

Structured Data

Code Hoarding:
Committing to commits, and the beauty of summarizing graphs

Dear KV, Why are so many useful features of open-source projects hidden under obscure configuration options that mean they’ll get little or no use? Is this just typically poor documentation and promotion, or is there something that makes these developers hide their code? It’s not as if the code seems broken. When I turned these features on in some recent code I came across, the system remained stable under test and in production. I feel that code should either be used or removed from the system. If the code is in a source-code repository, then it’s not really lost, but it’s also not cluttering the rest of the system. Use It or Lose It

by George Neville-Neil

Accountability in Algorithmic Decision-making:
A view from computational journalism

Every fiscal quarter automated writing algorithms churn out thousands of corporate earnings articles for the AP (Associated Press) based on little more than structured data. Companies such as Automated Insights, which produces the articles for AP, and Narrative Science can now write straight news articles in almost any domain that has clean and well-structured data: finance, sure, but also sports, weather, and education, among others. The articles aren’t cardboard either; they have variability, tone, and style, and in some cases readers even have difficulty distinguishing the machine-produced articles from human-written ones.

by Nicholas Diakopoulos

Non-volatile Storage:
Implications of the Datacenter’s Shifting Center

For the entire careers of most practicing computer scientists, a fundamental observation has consistently held true: CPUs are significantly more performant and more expensive than I/O devices. The fact that CPUs can process data at extremely high rates, while simultaneously servicing multiple I/O devices, has had a sweeping impact on the design of both hardware and software for systems of all sizes, for pretty much as long as we’ve been building them.

by Mihir Nanavati, Malte Schwarzkopf, Jake Wires, Andrew Warfield

The Verification of a Distributed System:
A practitioner’s guide to increasing confidence in system correctness

Leslie Lamport, known for his seminal work in distributed systems, famously said, "A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable." Given this bleak outlook and the large set of possible failures, how do you even begin to verify and validate that the distributed systems you build are doing the right thing?

by Caitie McCaffrey

Time is an Illusion.:
Lunchtime doubly so. - Ford Prefect to Arthur Dent in "The Hitchhiker’s Guide to the Galaxy", by Douglas Adams

One of the more surprising things about digital systems - and, in particular, modern computers - is how poorly they keep time. When most programs ran on a single system this was not a significant issue for the majority of software developers, but once software moved into the distributed-systems realm this inaccuracy became a significant challenge. Few programmers have read the most important paper in this area, Leslie Lamport’s "Time, Clocks, and the Ordering of Events in a Distributed System" (1978), and only a few more have come to appreciate the problems they face once they move into the world of distributed systems.

by George Neville-Neil

How Sysadmins Devalue Themselves:
And how to track on-call coverage

Q: Dear Tom, How can I devalue my work? Lately I’ve felt like everyone appreciates me, and, in fact, I’m overpaid and underutilized. Could you help me devalue myself at work? A: Dear Reader, Absolutely! I know what a pain it is to lug home those big paychecks. It’s so distracting to have people constantly patting you on the back. Ouch! Plus, popularity leads to dates with famous musicians and movie stars. (Just ask someone like Taylor Swift or Leonardo DiCaprio.) Who wants that kind of distraction when there’s a perfectly good video game to be played?

by Thomas A. Limoncelli

Schema.org: Evolution of Structured Data on the Web:
Big data makes common schemas even more necessary.

Separation between content and presentation has always been one of the important design aspects of the Web. Historically, however, even though most Web sites were driven off structured databases, they published their content purely in HTML. Services such as Web search, price comparison, reservation engines, etc. that operated on this content had access only to HTML. Applications requiring access to the structured data underlying these Web pages had to build custom extractors to convert plain HTML into structured data. These efforts were often laborious and the scrapers were fragile and error-prone, breaking every time a site changed its layout.

by R. V. Guha, Dan Brickley, Steve MacBeth

Immutability Changes Everything:
We need it, we can afford it, and the time is now.

There is an inexorable trend toward storing and sending immutable data. We need immutability to coordinate at a distance, and we can afford immutability as storage gets cheaper. This article is an amuse-bouche sampling the repeated patterns of computing that leverage immutability. Climbing up and down the compute stack really does yield a sense of déjà vu all over again.

by Pat Helland

The Paradox of Autonomy and Recognition:
Thoughts on trust and merit in software team culture

Who doesn’t want recognition for their hard work and contributions? Early in my career I wanted to believe that if you worked hard, and added value, you would be rewarded. I wanted to believe in the utopian ideal that hard work, discipline, and contributions were the fuel that propelled you up the corporate ladder. Boy, was I wrong.

by Kate Matsudaira