Sort By:

Don't Get Stuck in the "Con" Game:
Consistency, convergence, and confluence are not the same! Eventual consistency and eventual convergence aren't the same as confluence, either.

"Eventual consistency" is a popular phrase with a fuzzy definition. People are even inconsistent in their use of consistency. But two other terms, "convergence" and "confluence", that have crisper definitions and are more easily understood.

by Pat Helland | August 5, 2021


Real-world String Comparison:
How to handle Unicode sequences correctly

In many languages a string comparison is a pitfall for beginners. With any Unicode string as input, a comparison often causes problems even for advanced users. The semantic equivalence of different characters in Unicode requires a normalization of the strings before comparing them. This article shows how to handle Unicode sequences correctly. The comparison of two strings for equality often raises questions concerning the difference between comparison by value, comparison of object references, strict equality, and loose equality. The most important aspect is semantic equivalence.

by Torsten Ullrich | July 29, 2021


Digging into Big Provenance (with SPADE):
A user interface for querying provenance

Several interfaces exist for querying provenance. Many are not flexible in allowing users to select a database type of their choice. Some provide query functionality in a data model that is different from the graph-oriented one that is natural for provenance. Others have intuitive constructs for finding results but have limited support for efficiently chaining responses, as needed for faceted search. This article presents a user interface for querying provenance that addresses these concerns and is agnostic to the underlying database being used.

by Ashish Gehani, Raza Ahmad, Hassan Irshad, Jianqiao Zhu, Jignesh Patel | July 19, 2021


Baleen Analytics:
Large-scale filtering of data provides serendipitous surprises.

Data analytics hoovers up anything it can find and we are finding patterns and insights that weren't available before, with implications for both data analytics and for messaging between services and microservices. It seems that a pretty good understanding among many different sources allows more flexibility and interconnectivity. Increasingly, flexibility dominates perfection.

by Pat Helland | January 7, 2021


Data on the Outside vs. Data on the Inside:
Data kept outside SQL has different characteristics from data kept inside.

This article describes the impact of services and trust on the treatment of data. It introduces the notions of inside data as distinct from outside data. After discussing the temporal implications of not sharing transactions across the boundaries of services, the article considers the need for immutability and stability in outside data. This leads to a depiction of outside data as a DAG of data items being independently generated by disparate services.

by Pat Helland | August 2, 2020


The Way We Think About Data:
Human inspection of black-box ML models; reclaiming ownership of data

The two papers I’ve chosen for this issue of acmqueue both challenge the way we think about and use data, though in very different ways. In "Stop Explaining Black-box Machine-learning Models for High-stakes Decisions and Use Interpretable Models Instead," Cynthia Rudin makes the case for models that can be inspected and interpreted by human experts. The second paper, "Local-first Software: You Own Your Data, in Spite of the Cloud," describes how to retain sovereignty over your data.

by Adrian Colyer | February 18, 2020


Space Time Discontinuum:
Combining data from many sources may cause painful delays.

Back when you had only one database for an application to worry about, you didn’t have to think about partial results. You also didn’t have to think about data arriving after some other data. It was all simply there. Now, you can do so much more with big distributed systems, but you have to be more sophisticated in the tradeoff between timely answers and complete answers.

by Pat Helland | November 18, 2019


The Singular Success of SQL:
SQL has a brilliant future as a major figure in the pantheon of data representations.

SQL has a brilliant past and a brilliant future. That future is not as the singular and ubiquitous holder of data but rather as a major figure in the pantheon of data representations. What the heck happens when data is not kept in SQL?

by Pat Helland | August 2, 2016


The Science of Managing Data Science:
Lessons learned managing a data science research team

What are they doing all day? When I first took over as VP of Engineering at a startup doing data mining and machine learning research, this was what the other executives wanted to know. They knew the team was super smart, and they seemed like they were working really hard, but the executives had lots of questions about the work itself. How did they know that the work they were doing was the "right" work? Were there other projects they could be doing instead? And how could we get this research into the hands of our customers faster?

by Kate Matsudaira | April 29, 2015


A Primer on Provenance:
Better understanding of data requires tracking its history and context.

Assessing the quality or validity of a piece of data is not usually done in isolation. You typically examine the context in which the data appears and try to determine its original sources or review the process through which it was created. This is not so straightforward when dealing with digital data, however: the result of a computation might have been derived from numerous sources and by applying complex successive transformations, possibly over long periods of time.

by Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, Andy Hopper | April 10, 2014


Provenance in Sensor Data Management:
A cohesive, independent solution for bringing provenance to scientific research

In today’s information-driven workplaces, data is constantly being moved around and undergoing transformation. The typical business-as-usual approach is to use e-mail attachments, shared network locations, databases, and more recently, the cloud. More often than not, there are multiple versions of the data sitting in different locations, and users of this data are confounded by the lack of metadata describing its provenance or in other words, its lineage. The ProvDMS project at the Oak Ridge National Laboratory (ORNL) described in this article aims to solve this issue in the context of sensor data.

by Zachary Hensley, Jibonananda Sanyal, Joshua New | January 23, 2014


Hazy: Making it Easier to Build and Maintain Big-data Analytics:
Racing to unleash the full potential of big data with the latest statistical and machine-learning techniques.

The rise of big data presents both big opportunities and big challenges in domains ranging from enterprises to sciences. The opportunities include better-informed business decisions, more efficient supply-chain management and resource allocation, more effective targeting of products and advertisements, better ways to "organize the world’s information," faster turnaround of scientific discoveries, etc.

by Arun Kumar, Feng Niu, Christopher Ré | January 23, 2013


The World According to LINQ:
Big data is about more than size, and LINQ is more than up to the task.

Programmers building Web- and cloud-based applications wire together data from many different sources such as sensors, social networks, user interfaces, spreadsheets, and stock tickers. Most of this data does not fit in the closed and clean world of traditional relational databases. It is too big, unstructured, denormalized, and streaming in realtime. Presenting a unified programming model across all these disparate data models and query languages seems impossible at first. By focusing on the commonalities instead of the differences, however, most data sources will accept some form of computation to filter and transform collections of data.

by Erik Meijer | August 30, 2011


Storage Strife:
Beware keeping data in binary format

Where I work we are very serious about storing all of our data, not just our source code, in our source-code control system. When we started the company we made the decision to store as much as possible in one place. The problem is that over time we have moved from a pure programming environment to one where there are other people - the kind of people who send e-mails using Outlook and who keep their data in binary and proprietary formats.

by George V. Neville-Neil | May 5, 2011


The Case Against Data Lock-in:
Want to keep your users? Just make it easy for them to leave.

Engineers employ many different tactics to focus on the user when writing software: for example, listening to user feedback, fixing bugs, and adding features that their users are clamoring for. Since Web-based services have made it easier for users to move to new applications, it’s becoming even more important to focus on building and retaining user trust. We’ve found that an incredibly effective way to earn and maintain user trust is to make it easy for users to leave your product with their data in tow. This not only prevents lock-in and engenders trust, but also forces your team to innovate and compete on technical merit.

by Brian W Fitzpatrick, JJ Lueck | October 8, 2010


Other People’s Data:
Companies have access to more types of external data than ever before. How can they integrate it most effectively?

Every organization bases some of its critical decisions on external data sources. In addition to traditional flat file data feeds, Web services and Web pages are playing an increasingly important role in data warehousing. The growth of Web services has made data feeds easily consumable at the departmental and even end-user levels. There are now more than 1,500 publicly available Web services and thousands of data mashups ranging from retail sales data to weather information to United States census data. These mashups are evidence that when users need information, they will find a way to get it.

by Stephen Petschulat | November 13, 2009


Latency and Livelocks:
Sometimes data just doesn’t travel as fast as it should.

Dear KV: My company has a very large database with all of our customer information. The database is replicated to several locations around the world to improve performance locally, so that when customers in Asia want to look at their data, they don’t have to wait for it to come from the United States, where my company is based...

by George Neville-Neil | April 28, 2008


It Isn’t Your Father’s Realtime Anymore:
The misuse and abuse of a noble term

Isn’t it a shame the way the term realtime has become so misused? I’ve noticed a slow devolution since 1982, when realtime systems became the main focus of my research, teaching, and consulting. Over these past 20-plus years, I have watched my beloved realtime become one of the most overloaded, overused, and overrated terms in the lexicon of computing. Worse, it has been purloined by users outside of the computing community and has been shamelessly exploited by marketing opportunists.

by Phillip Laplante | February 23, 2006


The Cost of Data:
Semi-structured data is the result of economics.

In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing. As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it.

by Chris Suver | December 8, 2005


Beyond Relational Databases:
There is more to data access than SQL.

The number and variety of computing devices in the environment are increasing rapidly. Real computers are no longer tethered to desktops or locked in server rooms. PDAs, highly mobile tablet and laptop devices, palmtop computers, and mobile telephony handsets now offer powerful platforms for the delivery of new applications and services. These devices are, however, only the tip of the iceberg. Hidden from sight are the many computing and network elements required to support the infrastructure that makes ubiquitous computing possible.

by Margo Seltzer | April 21, 2005


Would You Like Some Data with That?:
You know wireless technology has arrived when the Golden Arches announce they’ll be equipping franchises with wireless hotspots.

Just a few months ago, McDonald’s Corporation unveiled its plan for a pilot wireless access program at 10 restaurants in Manhattan. Several hundred restaurants at various metropolitan centers are to follow later in the year. Combine this with Intel’s recent announcement of built-in wireless (802.11) support as part of its new Centrino chipset, and you can reasonably conclude that ubiquitous wireless access may soon be upon us.

July 30, 2003