Pointers in Far Memory:
A rethink of how data and computations should be organized
Effectively exploiting emerging far-memory technology requires consideration of operating on richly connected data outside the context of the parent process. Operating-system technology in development offers help by exposing abstractions such as memory objects and globally invariant pointers that can be traversed by devices and newly instantiated compute. Such ideas will allow applications running on future heterogeneous distributed systems with disaggregated memory nodes to exploit near-memory processing for higher performance and to independently scale their memory and compute resources for lower cost.
Sharpening Your Tools:
Updating bulk_extractor for the 2020s
This article presents our experience updating the high-performance Digital forensics tool BE (bulk_extractor) a decade after its initial release. Between 2018 and 2022, we updated the program from C++98 to C++17. We also performed a complete code refactoring and adopted a unit test framework. DF tools must be frequently updated to keep up with changes in the ways they are used. A description of updates to the bulk_extractor tool serves as an example of what can and should be done.
More Than Just Algorithms:
A discussion with Alfred Spector, Peter Norvig, Chris Wiggins, Jeannette Wing, Ben Fried, and Michael Tingley
Dramatic advances in the ability to gather, store, and process data have led to the rapid growth of data science and its mushrooming impact on nearly all aspects of the economy and society. Data science has also had a huge effect on academic disciplines with new research agendas, new degrees, and organizational entities. The authors of a new textbook, Data Science in Context: Foundations, Challenges, Opportunities, share their ideas about the impact of the field on nearly all aspects of the economy and society.
Literate Executables
Literate executables redefine the relationship between compiled binaries and source code to be that of chicken and egg, so it's easy to derive either from the other. This episode of Drill Bits provides a general-purpose literacy tool and showcases the advantages of literacy by retrofitting it onto everyone's favorite command-line utility.
Crash Consistency:
Keeping data safe in the presence of crashes is a fundamental problem.
Keeping data safe in the presence of crashes is a fundamental problem in storage systems. Although the high-level ideas for crash consistency are relatively well understood, realizing them in practice is surprisingly complex and full of challenges. The systems research community is actively working on solving this challenge, and the papers examined here offer three solutions.
FHIR: Reducing Friction in the Exchange of Healthcare Data:
A discussion with James Agnew, Pat Helland, and Adam Cole
With the full clout of the Centers for Medicare and Medicaid Services currently being brought to bear on healthcare providers to meet high standards for patient data interoperability and accessibility, it would be easy to assume the only reason this goal wasn't accomplished long ago is simply a lack of will. Interoperable data? How hard can that be? Much harder than you think, it turns out. To dig into why this is the case, we asked Pat Helland, a principal architect at Salesforce, to speak with James Agnew (CTO) and Adam Cole (senior solutions architect) of Smile CDR, a Toronto, Ontario-based provider of a leading platform used by healthcare organizations to achieve FHIR (Fast Healthcare Interoperability Resources) compliance.
Persistent Memory Allocation:
Leverage to move a world of software
A lever multiplies the force of a light touch, and the right software interfaces provide formidable leverage in multiple layers of code: A familiar interface enables a new persistent memory allocator to breathe new life into an enormous installed base of software and hardware. Compatibility allows a persistent heap to slide easily beneath a widely used scripting-language interpreter, thereby endowing all scripts with effortless on-demand persistence.
Autonomous Computing:
We frequently compute across autonomous boundaries but the implications of the patterns to ensure independence are rarely discussed.
Autonomous computing is a pattern for business work using collaborations to connect fiefdoms and their emissaries. This pattern, based on paper forms, has been used for centuries. Here, we explain fiefdoms, collaborations, and emissaries. We examine how emissaries work outside the autonomous boundary and are convenient while remaining an outsider. And we examine how work across different fiefdoms can be initiated, run for long periods of time, and eventually be completed.
The Planning and Care of Data:
Rearranging buckets for no good reason
Questions such as, "How do we secure this data?" work only if you ask them at the start, and not when some lawyers or government officials are sitting in a conference room, rooting through your data and logs, and making threatening noises under their breath. All the things we care about with our data require forethought, but it seems in our rush to create "stakeholder value" we are willing to sacrifice these important attributes and just act like data gourmands, until, like Mr.
Persistence Programming:
Are we doing this right?
A few years ago, my team was working on a commercial Java development project for Enhanced 911 (E911) emergency call centers. We were frustrated by trying to meet the data-storage requirements of this project using the traditional model of Java over an SQL database. After some reflection about the particular requirements (and nonrequirements) of the project, we took a deep breath and decided to create our own custom persistence layer from scratch.
Don't Get Stuck in the "Con" Game:
Consistency, convergence, and confluence are not the same! Eventual consistency and eventual convergence aren't the same as confluence, either.
"Eventual consistency" is a popular phrase with a fuzzy definition. People are even inconsistent in their use of consistency. But two other terms, "convergence" and "confluence", that have crisper definitions and are more easily understood.
Real-world String Comparison:
How to handle Unicode sequences correctly
In many languages a string comparison is a pitfall for beginners. With any Unicode string as input, a comparison often causes problems even for advanced users. The semantic equivalence of different characters in Unicode requires a normalization of the strings before comparing them. This article shows how to handle Unicode sequences correctly. The comparison of two strings for equality often raises questions concerning the difference between comparison by value, comparison of object references, strict equality, and loose equality. The most important aspect is semantic equivalence.
Digging into Big Provenance (with SPADE):
A user interface for querying provenance
Several interfaces exist for querying provenance. Many are not flexible in allowing users to select a database type of their choice. Some provide query functionality in a data model that is different from the graph-oriented one that is natural for provenance. Others have intuitive constructs for finding results but have limited support for efficiently chaining responses, as needed for faceted search. This article presents a user interface for querying provenance that addresses these concerns and is agnostic to the underlying database being used.
Baleen Analytics:
Large-scale filtering of data provides serendipitous surprises.
Data analytics hoovers up anything it can find and we are finding patterns and insights that weren't available before, with implications for both data analytics and for messaging between services and microservices. It seems that a pretty good understanding among many different sources allows more flexibility and interconnectivity. Increasingly, flexibility dominates perfection.
Data on the Outside vs. Data on the Inside:
Data kept outside SQL has different characteristics from data kept inside.
This article describes the impact of services and trust on the treatment of data. It introduces the notions of inside data as distinct from outside data. After discussing the temporal implications of not sharing transactions across the boundaries of services, the article considers the need for immutability and stability in outside data. This leads to a depiction of outside data as a DAG of data items being independently generated by disparate services.
The Way We Think About Data:
Human inspection of black-box ML models; reclaiming ownership of data
The two papers I’ve chosen for this issue of acmqueue both challenge the way we think about and use data, though in very different ways. In "Stop Explaining Black-box Machine-learning Models for High-stakes Decisions and Use Interpretable Models Instead," Cynthia Rudin makes the case for models that can be inspected and interpreted by human experts. The second paper, "Local-first Software: You Own Your Data, in Spite of the Cloud," describes how to retain sovereignty over your data.
Space Time Discontinuum:
Combining data from many sources may cause painful delays.
Back when you had only one database for an application to worry about, you didn’t have to think about partial results. You also didn’t have to think about data arriving after some other data. It was all simply there. Now, you can do so much more with big distributed systems, but you have to be more sophisticated in the tradeoff between timely answers and complete answers.
The Singular Success of SQL:
SQL has a brilliant future as a major figure in the pantheon of data representations.
SQL has a brilliant past and a brilliant future. That future is not as the singular and ubiquitous holder of data but rather as a major figure in the pantheon of data representations. What the heck happens when data is not kept in SQL?
The Science of Managing Data Science:
Lessons learned managing a data science research team
What are they doing all day? When I first took over as VP of Engineering at a startup doing data mining and machine learning research, this was what the other executives wanted to know. They knew the team was super smart, and they seemed like they were working really hard, but the executives had lots of questions about the work itself. How did they know that the work they were doing was the "right" work? Were there other projects they could be doing instead? And how could we get this research into the hands of our customers faster?
A Primer on Provenance:
Better understanding of data requires tracking its history and context.
Assessing the quality or validity of a piece of data is not usually done in isolation. You typically examine the context in which the data appears and try to determine its original sources or review the process through which it was created. This is not so straightforward when dealing with digital data, however: the result of a computation might have been derived from numerous sources and by applying complex successive transformations, possibly over long periods of time.
Provenance in Sensor Data Management:
A cohesive, independent solution for bringing provenance to scientific research
In today’s information-driven workplaces, data is constantly being moved around and undergoing transformation. The typical business-as-usual approach is to use e-mail attachments, shared network locations, databases, and more recently, the cloud. More often than not, there are multiple versions of the data sitting in different locations, and users of this data are confounded by the lack of metadata describing its provenance or in other words, its lineage. The ProvDMS project at the Oak Ridge National Laboratory (ORNL) described in this article aims to solve this issue in the context of sensor data.
Hazy: Making it Easier to Build and Maintain Big-data Analytics:
Racing to unleash the full potential of big data with the latest statistical and machine-learning techniques.
The rise of big data presents both big opportunities and big challenges in domains ranging from enterprises to sciences. The opportunities include better-informed business decisions, more efficient supply-chain management and resource allocation, more effective targeting of products and advertisements, better ways to "organize the world’s information," faster turnaround of scientific discoveries, etc.
The World According to LINQ:
Big data is about more than size, and LINQ is more than up to the task.
Programmers building Web- and cloud-based applications wire together data from many different sources such as sensors, social networks, user interfaces, spreadsheets, and stock tickers. Most of this data does not fit in the closed and clean world of traditional relational databases. It is too big, unstructured, denormalized, and streaming in realtime. Presenting a unified programming model across all these disparate data models and query languages seems impossible at first. By focusing on the commonalities instead of the differences, however, most data sources will accept some form of computation to filter and transform collections of data.
Storage Strife:
Beware keeping data in binary format
Where I work we are very serious about storing all of our data, not just our source code, in our source-code control system. When we started the company we made the decision to store as much as possible in one place. The problem is that over time we have moved from a pure programming environment to one where there are other people - the kind of people who send e-mails using Outlook and who keep their data in binary and proprietary formats.
The Case Against Data Lock-in:
Want to keep your users? Just make it easy for them to leave.
Engineers employ many different tactics to focus on the user when writing software: for example, listening to user feedback, fixing bugs, and adding features that their users are clamoring for. Since Web-based services have made it easier for users to move to new applications, it’s becoming even more important to focus on building and retaining user trust. We’ve found that an incredibly effective way to earn and maintain user trust is to make it easy for users to leave your product with their data in tow. This not only prevents lock-in and engenders trust, but also forces your team to innovate and compete on technical merit.
Other People’s Data:
Companies have access to more types of external data than ever before. How can they integrate it most effectively?
Every organization bases some of its critical decisions on external data sources. In addition to traditional flat file data feeds, Web services and Web pages are playing an increasingly important role in data warehousing. The growth of Web services has made data feeds easily consumable at the departmental and even end-user levels. There are now more than 1,500 publicly available Web services and thousands of data mashups ranging from retail sales data to weather information to United States census data. These mashups are evidence that when users need information, they will find a way to get it.
Latency and Livelocks:
Sometimes data just doesn’t travel as fast as it should.
Dear KV: My company has a very large database with all of our customer information. The database is replicated to several locations around the world to improve performance locally, so that when customers in Asia want to look at their data, they don’t have to wait for it to come from the United States, where my company is based...
It Isn’t Your Father’s Realtime Anymore:
The misuse and abuse of a noble term
Isn’t it a shame the way the term realtime has become so misused? I’ve noticed a slow devolution since 1982, when realtime systems became the main focus of my research, teaching, and consulting. Over these past 20-plus years, I have watched my beloved realtime become one of the most overloaded, overused, and overrated terms in the lexicon of computing. Worse, it has been purloined by users outside of the computing community and has been shamelessly exploited by marketing opportunists.
The Cost of Data:
Semi-structured data is the result of economics.
In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing. As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it.
Beyond Relational Databases:
There is more to data access than SQL.
The number and variety of computing devices in the environment are increasing rapidly. Real computers are no longer tethered to desktops or locked in server rooms. PDAs, highly mobile tablet and laptop devices, palmtop computers, and mobile telephony handsets now offer powerful platforms for the delivery of new applications and services. These devices are, however, only the tip of the iceberg. Hidden from sight are the many computing and network elements required to support the infrastructure that makes ubiquitous computing possible.
Would You Like Some Data with That?:
You know wireless technology has arrived when the Golden Arches announce they’ll be equipping franchises with wireless hotspots.
Just a few months ago, McDonald’s Corporation unveiled its plan for a pilot wireless access program at 10 restaurants in Manhattan. Several hundred restaurants at various metropolitan centers are to follow later in the year. Combine this with Intel’s recent announcement of built-in wireless (802.11) support as part of its new Centrino chipset, and you can reasonably conclude that ubiquitous wireless access may soon be upon us.