XML and JSON Are Like Cardboard:
Cardboard surrounds and protects stuff as it crosses boundaries.
In cardboard, the safety and care for stuff is the important reason for its existence. Similarly, in XML and JSON the safety and care of the data, both in transit and in storage, are why we bother.
Write Amplification Versus Read Perspiration:
The tradeoffs between write and read
In computing, there’s an interesting trend where writing creates a need to do more work. You need to reorganize, merge, reindex, and more to make the stuff you wrote more useful. If you don’t, you must search or do other work to support future reads.
Transactions and Serverless are Made for Each Other:
If serverless platforms could wrap functions in database transactions, they would be a good fit for database-backed applications.
Database-backed applications are an exciting new frontier for serverless computation. By tightly integrating application execution and data management, a transactional serverless platform enables many new features not possible in either existing serverless platforms or server-based deployments.
Too Big to Fail:
Visibility leads to debuggability.
Our project has been rolling out a well-known, distributed key/value store onto our infrastructure, and we’ve been surprised - more than once - when a simple increase in the number of clients has not only slowed things, but brought them to a complete halt. This then results in rollback while several of us scour the online forums to figure out if anyone else has seen the same problem. The entire reason for using this project’s software is to increase the scale of a large system, so I have been surprised at how many times a small increase in load has led to a complete failure.
The Power of Babble:
Expect to be constantly and pleasantly befuddled
Metadata defines the shape, the form, and how to understand our data. It is following the trend taken by natural languages in our increasingly interconnected world. While many concepts can be communicated using shared metadata, no one can keep up with the number of disparate new concepts needed to have a common understanding.
The Pathologies of Big Data:
Scale up your datasets enough and all your apps will come undone. What are the typical problems and where do the bottlenecks generally surface?
What is "big data" anyway? Gigabytes? Terabytes? Petabytes? A brief personal memory may provide some perspective. In the late 1980s at Columbia University I had the chance to play around with what at the time was a truly enormous "disk": the IBM 3850 MSS (Mass Storage System). The MSS was actually a fully automatic robotic tape library and associated staging disks to make random access, if not exactly instantaneous, at least fully transparent. In Columbia’s configuration, it stored a total of around 100 GB. It was already on its way out by the time I got my hands on it, but in its heyday, the early to mid-1980s, it had been used to support access by social scientists to what was unquestionably "big data" at the time: the entire 1980 U.S.
Statistics for Engineers:
Applying statistical techniques to operations data
Modern IT systems collect an increasing wealth of data from network gear, operating systems, applications, and other components. This data needs to be analyzed to derive vital information about the user experience and business performance. For instance, faults need to be detected, service quality needs to be measured and resource usage of the next days and month needs to be forecast.
Schema.org: Evolution of Structured Data on the Web:
Big data makes common schemas even more necessary.
Separation between content and presentation has always been one of the important design aspects of the Web. Historically, however, even though most Web sites were driven off structured databases, they published their content purely in HTML. Services such as Web search, price comparison, reservation engines, etc. that operated on this content had access only to HTML. Applications requiring access to the structured data underlying these Web pages had to build custom extractors to convert plain HTML into structured data. These efforts were often laborious and the scrapers were fragile and error-prone, breaking every time a site changed its layout.
Scalable SQL:
How do large-scale sites and applications remain SQL-based?
One of the leading motivators for NoSQL innovation is the desire to achieve very high scalability to handle the vagaries of Internet-size workloads. Yet many big social Web sites and many other Web sites and distributed tier 1 applications that require high scalability reportedly remain SQL-based for their core data stores and services. The question is, how do they do it?
Research for Practice: Distributed Consensus and Implications of NVM on Database Management Systems:
Expert-curated Guides to the Best of CS Research
First, how do large-scale distributed systems mediate access to shared resources, coordinate updates to mutable state, and reliably make decisions in the presence of failures? Second, while consensus concerns distributed shared state, our second selection concerns the impact of hardware trends on single-node shared state.
ORM in Dynamic Languages:
O/R mapping frameworks for dynamic languages such as Groovy provide a different flavor of ORM that can greatly simplify application code.
A major component of most enterprise applications is the code that transfers objects in and out of a relational database. The easiest solution is often to use an ORM (object-relational mapping) framework, which allows the developer to declaratively define the mapping between the object model and database schema and express database-access operations in terms of objects. This high-level approach significantly reduces the amount of database-access code that needs to be written and boosts developer productivity.
Numbers Are for Computers, Strings Are for Humans:
How and where software should translate data into a human-readable form
Unless what you are processing, storing, or transmitting are, quite literally, strings that come from and are meant to be shown to humans, you should avoid processing, storing, or transmitting that data as strings. Remember, numbers are for computers, strings are for humans. Let the computer do the work of presenting your data to the humans in a form they might find palatable. That’s where those extra bytes and instructions should be spent, not doing the inverse.
Immutability Changes Everything:
We need it, we can afford it, and the time is now.
There is an inexorable trend toward storing and sending immutable data. We need immutability to coordinate at a distance, and we can afford immutability as storage gets cheaper. This article is an amuse-bouche sampling the repeated patterns of computing that leverage immutability. Climbing up and down the compute stack really does yield a sense of déjà vu all over again.
If You Have Too Much Data, then "Good Enough" Is Good Enough:
In today’s humongous database systems, clarity may be relaxed, but business needs can still be met.
Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp and accurate. New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.
Identity by Any Other Name:
The complex cacophony of intertwined systems
New emerging systems and protocols both tighten and loosen our notions of identity, and that’s good! They make it easier to get stuff done. REST, IoT, big data, and machine learning all revolve around notions of identity that are deliberately kept flexible and sometimes ambiguous. Notions of identity underlie our basic mechanisms of distributed systems, including interchangeability, idempotence, and immutability.
How Will Astronomy Archives Survive the Data Tsunami?:
Astronomers are collecting more data than ever. What practices can keep them ahead of the flood?
Astronomy is already awash with data: currently 1 PB of public data is electronically accessible, and this volume is growing at 0.5 PB per year. The availability of this data has already transformed research in astronomy, and the STScI now reports that more papers are published with archived data sets than with newly acquired data. This growth in data size and anticipated usage will accelerate in the coming few years as new projects such as the LSST, ALMA, and SKA move into operation. These new projects will use much larger arrays of telescopes and detectors or much higher data acquisition rates than are now used.
Extract, Shoehorn, and Load:
Data doesn’t always fit nicely into a new home.
It turns out that the business value of ill-fitting data is extremely high. The process of taking the input data, discarding what doesn’t fit, adding default or null values for missing stuff, and generally shoehorning it to the prescribed shape is important. The prescribed shape is usually one that is amenable to analysis for deeper meaning.
Exposing the ORM Cache:
Familiarity with ORM caching issues can help prevent performance problems and bugs.
In the early 1990s, when object-oriented languages emerged into the mainstream of software development, a noticeable surge in productivity occurred as developers saw new and better ways to create software programs. Although the new and efficient object programming paradigm was hailed and accepted by a growing number of organizations, relational database management systems remained the preferred technology for managing enterprise data. Thus was born ORM (object-relational mapping), out of necessity, and the complex challenge of saving the persistent state of an object environment in a relational database subsequently became known as the object-relational impedance mismatch.
Eventually Consistent: Not What You Were Expecting?:
Methods of quantifying consistency (or lack thereof) in eventually consistent storage systems
Storage systems continue to lay the foundation for modern Internet services such as Web search, e-commerce, and social networking. Pressures caused by rapidly growing user bases and data sets have driven system designs away from conventional centralized databases and toward more scalable distributed solutions, including simple NoSQL key-value storage systems, as well as more elaborate NewSQL databases that support transactions at scale.
Eventual Consistency Today: Limitations, Extensions, and Beyond:
How can applications be built on eventually consistent infrastructure given no guarantee of safety?
In a July 2000 conference keynote, Eric Brewer, now VP of engineering at Google and a professor at the University of California, Berkeley, publicly postulated the CAP (consistency, availability, and partition tolerance) theorem, which would change the landscape of how distributed storage systems were architected. Brewer’s conjecture--based on his experiences building infrastructure for some of the first Internet search engines at Inktomi--states that distributed systems requiring always-on, highly available operation cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers.
Edge Computing:
Scaling resources within multiple administrative domains
Creating edge computing infrastructures and applications encompasses quite a breadth of systems research. Let’s take a look at the academic view of edge computing and a sample of existing research that will be relevant in the coming years.
Don’t Settle for Eventual Consistency:
Stronger properties for low-latency geo-replicated storage
Geo-replicated storage provides copies of the same data at multiple, geographically distinct locations. Facebook, for example, geo-replicates its data (profiles, friends lists, likes, etc.) to data centers on the east and west coasts of the United States, and in Europe. In each data center, a tier of separate Web servers accepts browser requests and then handles those requests by reading and writing data from the storage system.
Disambiguating Databases:
Use the database built for your access model.
The topic of data storage is one that doesn’t need to be well understood until something goes wrong (data disappears) or something goes really right (too many customers). Because databases can be treated as black boxes with an API, their inner workings are often overlooked. They’re often treated as magic things that just take data when offered and supply it when asked. Since these two operations are the only understood activities of the technology, they are often the only features presented when comparing different technologies.
Deduplicating Devices Considered Harmful:
A good idea, but it can be taken too far
During the research for their interesting paper, "Reliably Erasing Data From Flash-based Solid State Drives," delivered at the FAST (File and Storage Technology) workshop at San Jose in February, Michael Wei and his co-authors from the University of California, San Diego discovered that at least one flash controller, the SandForce SF-1200, was by default doing block-level deduplication of data written to it. The SF-1200 is used in SSDs (solid-state disks) from, among others, Corsair, ADATA, and Mushkin.
Databases of Discovery:
Open-ended database ecosystems promote new discoveries in biotech. Can they help your organization, too?
The National Center for Biotechnology Information is responsible for massive amounts of data. A partial list includes the largest public bibliographic database in biomedicine, the U.S. national DNA sequence database, an online free full text research article database, assembly, annotation, and distribution of a reference set of genes, genomes, and chromosomes, online text search and retrieval systems, and specialized molecular biology data search engines. At this writing, NCBI receives about 50 million Web hits per day, at peak rates of about 1,900 hits per second, and about 400,000 BLAST searches per day from about 2.5 million users.
Data in Flight:
How streaming SQL technology can help solve the Web 2.0 data crunch.
Web applications produce data at colossal rates, and those rates compound every year as the Web becomes more central to our lives. Other data sources such as environmental monitoring and location-based services are a rapidly expanding part of our day-to-day experience. Even as throughput is increasing, users and business owners expect to see their data with ever-decreasing latency. Advances in computer hardware (cheaper memory, cheaper disks, and more processing cores) are helping somewhat, but not enough to keep pace with the twin demands of rising throughput and decreasing latency.
Data Sketching:
The approximate approach is often faster and more efficient.
Do you ever feel overwhelmed by an unending stream of information? It can seem like a barrage of new email and text messages demands constant attention, and there are also phone calls to pick up, articles to read, and knocks on the door to answer. Putting these pieces together to keep track of what’s important can be a real challenge. In response to this challenge, the model of streaming data processing has grown in popularity. The aim is no longer to capture, store, and index every minute event, but rather to process each observation quickly in order to create a summary of the current state.
DAML: The Contract Language of Distributed Ledgers:
A discussion between Shaul Kfir and Camille Fournier
We’ll see the same kind of Cambrian explosion we witnessed in the web world once we started using mutualized infrastructure in public clouds and frameworks. It took only three weeks to learn enough Ruby on Rails and Heroku to push out the first version of a management system for that brokerage. And that’s because I had to think only about the models, the views, and the controllers. The hardest part, of course, had to do with building a secure wallet.
Crashproofing the Original NoSQL Key-Value Store
Fortifying software to protect persistent data from crashes can be remarkably easy if a modern file system handles the heavy lifting. This episode of Drill Bits unveils a new crash-tolerance mechanism that vaults the venerable gdbm database into the league of transactional NoSQL data stores. We'll motivate this upgrade by tracing gdbm's history. We'll survey the subtle science of crashproofing, navigating a minefield of traps for the unwary. We'll arrive at a compact and rugged design that leverages modern file-system features, and we'll tour the production-ready implementation of this design and its ergonomic interface.
Consistently Eventual:
For many data items, the work never settles on a value.
Applications are no longer islands. Not only do they frequently run distributed and replicated over many cloud-based computers, but they also run over many hand-held computers. This makes it challenging to talk about a single truth at a single place or time. In addition, most modern applications interact with other applications. These interactions settle out to impact understanding. Over time, a shared opinion emerges just as new interactions add increasing uncertainty. Many business, personal, and computational "facts" are, in fact, uncertain. As some changes settle, others meander from place to place. With all the regular, irregular, and uncleared checks, my understanding of our personal joint checking account is a bit hazy.
Cluster Scheduling for Data Centers:
Expert-curated Guides to the Best of CS Research: Distributed Cluster Scheduling
This installment of Research for Practice features a curated selection from Malte Schwarzkopf, who takes us on a tour of distributed cluster scheduling, from research to practice, and back again. With the rise of elastic compute resources, cluster management has become an increasingly hot topic in systems R&D, and a number of competing cluster managers including Kubernetes, Mesos, and Docker are currently jockeying for the crown in this space.
Business Process Minded:
Transcript of interview with Edwin Khodabakchian, vice president of product development at Oracle
A new paradigm created to empower business system analysts by giving them access to meta-data that they can directly control to drive business process management is about to sweep the enterprise application arena. In an interview with ACM Queuecast host Michael Vizard, Oracle vice president of product development Edwin Khodabakchian explains how the standardization of service-oriented architectures and the evolution of the business process execution language are coming together to finally create flexible software architectures that can adapt to the business rather than making the business adapt to the software.
Bringing Arbitrary Compute to Authoritative Data:
Many disparate use cases can be satisfied with a single storage system.
While the term ’big data’ is vague enough to have lost much of its meaning, today’s storage systems are growing more quickly and managing more data than ever before. Consumer devices generate large numbers of photos, videos, and other large digital assets. Machines are rapidly catching up to humans in data generation through extensive recording of system logs and metrics, as well as applications such as video capture and genome sequencing. Large data sets are now commonplace, and people increasingly want to run sophisticated analyses on the data.
Back under a SQL Umbrella:
Unifying serving and analytical data; using a database for distributed machine learning
Procella is the latest in a long line of data processing systems at Google. What’s unique about it is that it’s a single store handling reporting, embedded statistics, time series, and ad-hoc analysis workloads under one roof. It’s SQL on top, cloud-native underneath, and it’s serving billions of queries per day over tens of petabytes of data. There’s one big data use case that Procella isn’t handling today though, and that’s machine learning. But in ’Declarative recursive computation on an RDBMS... or, why you should use a database for distributed machine learning,’ Jankov et al.
Automatically Testing Database Systems:
DBMS testing with test oracles, transaction history, and fuzzing
The automated testing of DBMS is an exciting, interdisciplinary effort that has seen many innovations in recent years. The examples addressed here represent different perspectives on this topic, reflecting strands of research from software engineering, (database) systems, and security angles. They give only a glimpse into these research strands, as many additional interesting and effective works have been proposed. Various approaches generate pairs of related tests to find both logic bugs and performance issues in a DBMS. Similarly, other isolation-level testing approaches have been proposed.
Always-on Time-series Database: Keeping Up Where There's No Way to Catch Up:
A discussion with Theo Schlossnagle, Justin Sheehy, and Chris McCubbin
What if you found you needed to provide for the capture of data from disconnected operations, such that updates might be made by different parties at the same time without conflicts? And what if your service called for you to receive massive volumes of data almost continuously throughout the day, such that you couldn't really afford to interrupt data ingest at any point for fear of finding yourself so far behind present state that there would be almost no way to catch up?
All Your Database Are Belong to Us:
In the big open world of the cloud, highly available distributed objects will rule.
In the database world, the raw physical data model is at the center of the universe, and queries freely assume intimate details of the data representation (indexes, statistics, metadata). This closed-world assumption and the resulting lack of abstraction have the pleasant effect of allowing the data to outlive the application. On the other hand, this makes it hard to evolve the underlying model independently from the queries over the model.
Achieving Digital Permanence:
The many challenges to maintaining stored information and ways to overcome them
Today’s Information Age is creating new uses for and new ways to steward the data that the world depends on. The world is moving away from familiar, physical artifacts to new means of representation that are closer to information in its essence. We need processes to ensure both the integrity and accessibility of knowledge in order to guarantee that history will be known and true.
A co-Relational Model of Data for Large Shared Data Banks:
Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.
Fueled by their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way, noSQL databases have been one of the hottest topics in our field recently. With a plethora of open source and commercial offerings and a surrounding cacophony of technical terms, however, it is hard for businesses and practitioners to see the forest for the trees.
A Tribute to Jim Gray
Computer science attracts many very smart people, but a few stand out above the others, somehow blessed with a kind of creativity that most of us are denied. Names such as Alan Turing, Edsger Dijkstra, and John Backus come to mind. Jim Gray is another.
A Conversation with Steve Hagan:
At Oracle, distributed development is a way of life.
Oracle Corporation, which bills itself as the world’s largest enterprise software company, with $10 billion in revenues, some 40,000 employees, and operations in 60 countries, has ample opportunity to put distributed development to the test. Among those on the front lines of Oracle’s distributed effort is Steve Hagan, the engineering vice president of the Server Technologies division, based at Oracle’s New England Development Center in Nashua, New Hampshire, located clear across the country from Oracle’s Redwood Shores, California, headquarters.
A Conversation with Pat Selinger:
Leading the way to manage the world’s information
Take Pat Selinger of IBM and James Hamilton of Microsoft and put them in a conversation together, and you may hear everything you wanted to know about database technology and weren’t afraid to ask. Selinger, IBM Fellow and vice president of area strategy, information, and interaction for IBM Research, drives the strategy for IBM’s research work spanning the range from classic database systems through text, speech, and multimodal interactions. Since graduating from Harvard with a Ph.D. in applied mathematics, she has spent almost 30 years at IBM, hopscotching between research and development of IBM’s database products.
A Conversation with Michael Stonebraker and Margo Seltzer:
Relating to databases
Over the past 30 years Michael Stonebraker has left an indelible mark on the database technology world. Stonebraker’s legacy began with Ingres, an early relational database initially developed in the 1970s at UC Berkeley, where he taught for 25 years. The Ingres technology lives on today in both the Ingres Corporation’s commercial products and the open source PostgreSQL software. A prolific entrepreneur, Stonebraker also started successful companies focused on the federated database and stream-processing markets. He was elected to the National Academy of Engineering in 1998 and currently is adjunct professor of computer science at MIT. Interviewing Stonebraker is Margo Seltzer, one of the founders of Sleepycat Software, makers of Berkeley DB, a popular embedded database engine now owned by Oracle.
A Conversation with Margo Seltzer and Mike Olson:
The history of Berkeley DB
Kirk McKusick sat down with Margo Seltzer and Mike Olson to discuss the history of Berkeley DB, for which they won the ACM Software System Award in 2021. Kirk McKusick has spent his career as a BSD and FreeBSD developer. Margo Seltzer has spent her career as a professor of computer science and as an entrepreneur of database software companies. Mike Olson started his career as a software developer and later started and managed several open-source software companies. Berkeley DB is a production-quality, scalable, NoSQL, Open Source platform for embedded transactional data management.
A Conversation with Bruce Lindsay:
Designing for failure may be the key to success.
Designing for failure may be the key to success.
A Call to Arms:
Long anticipated, the arrival of radically restructured database architectures is now finally at hand.
We live in a time of extreme change, much of it precipitated by an avalanche of information that otherwise threatens to swallow us whole. Under the mounting onslaught, our traditional relational database constructs—always cumbersome at best—are now clearly at risk of collapsing altogether. In fact, rarely do you find a DBMS anymore that doesn’t make provisions for online analytic processing. Decision trees, Bayes nets, clustering, and time-series analysis have also become part of the standard package, with allowances for additional algorithms yet to come. Also, text, temporal, and spatial data access methods have been added—along with associated probabilistic logic, since a growing number of applications call for approximated results.