Distributed Computing

RSS

Sort By:

Date
Title

Systems Correctness Practices at AWS:
Leveraging Formal and Semi-formal Methods

Building reliable and secure software requires a range of approaches to reason about systems correctness. Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces. Formal methods have been an important part of the development process - perhaps most importantly, formal specifications as test oracles that provide the correct answers for many of AWS's testing practices. Correctness testing and formal methods remain key areas of investment at AWS, accelerated by the excellent returns seen on investments in these areas already.

by Marc Brooker, Ankush Desai | February 4, 2025

Intermediate Representations for the Datacenter Computer:
Lowering the Burden of Robust and Performant Distributed Systems

We have reached a point where distributed computing is ubiquitous. In-memory application data size is outstripping the capacity of individual machines, necessitating its partitioning over clusters of them; online services have high availability requirements, which can be met only by deploying systems as collections of multiple redundant components; high durability requirements can be satisfied only through data replication, sometimes across vast geographical distances.

by Achilles Benetopoulos | February 3, 2025

Simulation: An Underutilized Tool in Distributed Systems

Simulation has a huge role to play in the advent of AI systems: We need an efficient, fast, and cost-effective way to train AI agents to operate in our infrastructure, and simulation absolutely provides that capability.

by David R. Morrison | January 27, 2025

Convergence:
Research for Practice reboot

It is with great pride and no small amount of excitement that I announce the reboot of acmqueue's Research for Practice column. For three years, beginning at its inception in 2016, Research for Practice brought both seminal and cutting-edge research - via careful curation by experts in academia - within easy reach for practitioners who are too busy building things to manage the deluge of scholarly publications. We believe the series succeeded in its stated goal of sharing "the joy and utility of reading computer science research" between academics and their counterparts in industry. We know our readers have missed it, and we are delighted to rekindle the flame after a three-year hiatus.

by Martin Kleppmann | July 15, 2022

Decentralized Computing

Feeding all relevant inputs to a central solver is the obvious way to tackle a problem, but it's not always the only way. Decentralized methods that make do with only local communication and local computation are sometimes the best way. This episode of Drill Bits reviews an elegant protocol for self-organizing wireless networks that can also solve a seemingly impossible social networking problem. The protocol preserves privacy among participants and is so simple that it can be implemented with pencil, paper, and postcards. Example software implements both the decentralized protocol and a centralized solver.

by Terence Kelly | November 16, 2020

Corp to Cloud: Google’s Virtual Desktops:
How Google moved its virtual desktops to the cloud

Over one-fourth of Googlers use internal, data-center-hosted virtual desktops. This on-premises offering sits in the corporate network and allows users to develop code, access internal resources, and use GUI tools remotely from anywhere in the world. Among its most notable features, a virtual desktop instance can be sized according to the task at hand, has persistent user storage, and can be moved between corporate data centers to follow traveling Googlers. Until recently, our virtual desktops were hosted on commercially available hardware on Google’s corporate network using a homegrown open-source virtual cluster-management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP (Google Compute Platform).

by Matt Fata, Philippe-Joseph Arida, Patrick Hahn, Betsy Beyer | August 1, 2018

Every Silver Lining Has a Cloud:
Cache is king. And if your cache is cut, you’re going to feel it.

Clearly, your management has never heard the phrase, "You get what you pay for." Or perhaps they heard it and didn’t realize it applied to them. The savings in cloud computing comes at the expense of a loss of control over your systems, which is summed up best in the popular nerd sticker that says, "The Cloud is Just Other People’s Computers." Some providers now have something called Metal-as-a-Service, which I really think ought to mean that an ’80s metal band shows up at your office, plays a gig, smashes the furniture, and urinates on the carpet, but alas, it’s just the cloud providers’ way of finally admitting that cloud computing isn’t really the right answer for all applications.

by George Neville-Neil | May 7, 2018

Watchdogs vs. Snowflakes:
Taking wild-ass guesses

That a system can randomly jam doesn’t just indicate a serious bug in the system; it is also a major source of risk. You don’t say what your distributed job-control system controls, but let’s just say I hope it’s not something with significant, real-world side effects, like a power station, jet aircraft, or financial trading system. The risk, of course, is that the system will jam, not when it’s convenient for someone to add a dummy job to clear the jam, but during some operation that could cause data loss or return incorrect results.

by George Neville-Neil | April 10, 2018

Life Beyond Distributed Transactions:
An apostate’s opinion

This article explores and names some of the practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions. Topics include the management of fine-grained pieces of application data that may be repartitioned over time as the application grows. Design patterns support sending messages between these repartitionable pieces of data.

by Pat Helland | December 12, 2016

Standing on Distributed Shoulders of Giants:
Farsighted Physicists of Yore Were Danged Smart!

If you squint hard enough, many of the challenges of distributed computing appear similar to the work done by the great physicists. Dang, those fellows were smart! Here, we examine some of the most important physics breakthroughs and draw some whimsical parallels to phenomena in the world of computing... just for fun.

by Pat Helland | June 7, 2016

Debugging Distributed Systems:
Challenges and options for validation and debugging

Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system’s communication topology can be difficult. A standard approach to gaining insight into system activity is to analyze system logs. Unfortunately, this can be a tedious and complex process. This article looks at several key features and debugging challenges that differentiate distributed systems from other kinds of software. The article presents several promising tools and ongoing research to help resolve these challenges.

by Ivan Beschastnikh, Patty Wang, Yuriy Brun, Michael D, Ernst | May 18, 2016

Should You Upload or Ship Big Data to the Cloud?:
The accepted wisdom does not always hold true.

It is accepted wisdom that when the data you wish to move into the cloud is at terabyte scale and beyond, you are better off shipping it to the cloud provider, rather than uploading it. This article takes an analytical look at how shipping and uploading strategies compare, the various factors on which they depend, and under what circumstances you are better off shipping rather than uploading data, and vice versa. Such an analytical determination is important to make, given the increasing availability of gigabit-speed Internet connections, along with the explosive growth in data-transfer speeds supported by newer editions of drive interfaces such as SAS and PCI Express.

by Sachin Date | May 3, 2016

Time is an Illusion.:
Lunchtime doubly so. - Ford Prefect to Arthur Dent in "The Hitchhiker’s Guide to the Galaxy", by Douglas Adams

One of the more surprising things about digital systems - and, in particular, modern computers - is how poorly they keep time. When most programs ran on a single system this was not a significant issue for the majority of software developers, but once software moved into the distributed-systems realm this inaccuracy became a significant challenge.

by George Neville-Neil | January 12, 2016

Evolution and Practice: Low-latency Distributed Applications in Finance:
The finance industry has unique demands for low-latency distributed systems.

Virtually all systems have some requirements for latency, defined here as the time required for a system to respond to input. Latency requirements appear in problem domains as diverse as aircraft flight controls, voice communications, multiplayer gaming, online advertising, and scientific experiments. Distributed systems present special latency considerations. In recent years the automation of financial trading has driven requirements for distributed systems with challenging latency requirements and global geographic distribution. Automated trading provides a window into the engineering challenges of ever-shrinking latency requirements, which may be useful to software engineers in other fields.

by Andrew Brook | May 4, 2015

From the EDVAC to WEBVACs:
Cloud computing for computer scientists

By now everyone has heard of cloud computing and realized that it is changing how both traditional enterprise IT and emerging startups are building solutions for the future. Is this trend toward the cloud just a shift in the complicated economics of the hardware and software industry, or is it a fundamentally different way of thinking about computing? Having worked in the industry, I can confidently say it is both.

by Daniel C. Wang | April 9, 2015

Reliable Cron across the Planet:
...or How I stopped worrying and learned to love time

This article describes Google’s implementation of a distributed Cron service, serving the vast majority of internal teams that need periodic scheduling of compute jobs. During its existence, we have learned many lessons on how to design and implement what might seem like a basic service. Here, we discuss the problems that distributed Crons face and outline some potential solutions.

by Štěpán Davidovi, Kavita Guliani | March 12, 2015

There is No Now:
Problems with simultaneity in distributed systems

Now. The time elapsed between when I wrote that word and when you read it was at least a couple of weeks. That kind of delay is one that we take for granted and don’t even think about in written media. "Now." If we were in the same room and instead I spoke aloud, you might have a greater sense of immediacy. You might intuitively feel as if you were hearing the word at exactly the same time that I spoke it. That intuition would be wrong. If, instead of trusting your intuition, you thought about the physics of sound, you would know that time must have elapsed between my speaking and your hearing.

by Justin Sheehy | March 10, 2015

Unikernels: Rise of the Virtual Library Operating System:
What if all the software layers in a virtual appliance were compiled within the same safe, high-level language framework?

Cloud computing has been pioneering the business of renting computing resources in large data centers to multiple (and possibly competing) tenants. The basic enabling technology for the cloud is operating-system virtualization such as Xen1 or VMWare, which allows customers to multiplex VMs (virtual machines) on a shared cluster of physical machines. Each VM presents as a self-contained computer, booting a standard operating-system kernel and running unmodified applications just as if it were executing on a physical machine.

by Anil Madhavapeddy, David J. Scott | January 12, 2014

Toward Software-defined SLAs:
Enterprise computing in the public cloud

The public cloud has introduced new technology and architectures that could reshape enterprise computing. In particular, the public cloud is a new design center for enterprise applications, platform software, and services. API-driven orchestration of large-scale, on-demand resources is an important new design attribute, which differentiates public-cloud from conventional enterprise data-center infrastructure. Enterprise applications must adapt to the new public-cloud design center, but at the same time new software and system design patterns can add enterprise attributes and service levels to public-cloud services.

by Jason Lango | January 6, 2014

There’s Just No Getting around It: You’re Building a Distributed System:
Building a distributed system requires a methodical approach to requirements.

Distributed systems are difficult to understand, design, build, and operate. They introduce exponentially more variables into a design than a single machine does, making the root cause of an application problem much harder to discover. It should be said that if an application does not have meaningful SLAs (service-level agreements) and can tolerate extended downtime and/or performance degradation, then the barrier to entry is greatly reduced. Most modern applications, however, have an expectation of resiliency from their users, and SLAs are typically measured by "the number of nines" (e.g., 99.9 or 99.99 percent availability per month).

by Mark Cavage | May 3, 2013

Condos and Clouds:
Constraints in an environment empower the services.

Living in a condominium has its constraints and its services. By defining the lifestyle and limits on usage patterns, it is possible to pack many homes close together and to provide the residents with many conveniences. Condo living can offer a great value to those interested and willing to live within its constraints and enjoy the sharing of common services.

by Pat Helland | November 14, 2012

Securing Elasticity in the Cloud:
Elastic computing has great potential, but many security challenges remain.

As somewhat of a technology-hype curmudgeon, I was until very recently in the camp that believed cloud computing was not much more than the latest marketing-driven hysteria for an idea that has been around for years. Outsourced IT infrastructure services, aka IaaS (Infrastructure as a Service), has been around since at least the 1980s, delivered by the telecommunication companies and major IT outsourcers. Hosted applications, aka PaaS (Platform as a Service) and SaaS (Software as a Service), were in vogue in the 1990s in the form of ASPs (application service providers).

by Dustin Owens | May 6, 2010

Why Cloud Computing Will Never Be Free:
The competition among cloud providers may drive prices downward, but at what cost?

The last time the IT industry delivered outsourced shared-resource computing to the enterprise was with timesharing in the 1980s, when it evolved to a high art, delivering the reliability, performance, and service the enterprise demanded. Today, cloud computing is poised to address the needs of the same market, based on a revolution of new technologies, significant unused computing capacity in corporate data centers, and the development of a highly capable Internet data communications infrastructure. The economies of scale of delivering computing from a centralized, shared infrastructure have set the expectation among customers that cloud-computing costs will be significantly lower than those incurred from providing their own computing.

by Dave Durkee | April 16, 2010

Monitoring and Control of Large Systems with MonALISA:
MonALISA developers describe how it works, the key design principles behind it, and the biggest technical challenges in building it.

The HEP (high energy physics) group at the California Institute of Technology started developing the MonALISA (Monitoring Agents using a Large Integrated Services Architecture) framework in 2002, aiming to provide a distributed service system capable of controlling and optimizing large-scale, data-intensive applications. Its initial target field of applications is the grid systems and the networks supporting data processing and analysis for HEP collaborations. Our strategy in trying to satisfy the demands of data-intensive applications was to move to more synergetic relationships between the applications, computing, and storage facilities and the network infrastructure.

by Iosif Legrand, Ramiro Voicu, Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan | July 30, 2009

Cloud Computing: An Overview:
A summary of important cloud-computing issues distilled from ACM CTO Roundtables

Probably more than anything we’ve seen in IT since the invention of timesharing or the introduction of the PC, cloud computing represents a paradigm shift in the delivery architecture of information services. This overview presents some of the key topics discussed during the ACM Cloud Computing and Virtualization CTO Roundtables of 2008. While not intended to replace the in-depth roundtable discussions, the overview summarizes the fundamental issues generally agreed upon by the panels and should help readers to assess the applicability of cloud computing to their application areas.

by Mache Creeger | June 12, 2009

CTO Roundtable: Cloud Computing:
Our panel of experts discuss cloud computing and how companies can make the best use of it.

Many people reading about cloud computing in the trade journals will think it’s a panacea for all their IT problems. It is not. In this CTO Roundtable discussion we hope to give practitioners useful advice on how to evaluate cloud computing for their organizations. Our focus will be on the SMB (small- to medium-size business) IT managers who are underfunded, overworked, and have lots of assets tied up in out-of-date hardware and software. To what extent can cloud computing solve their problems? With the help of five current thought leaders in this quickly evolving field, we offer some answers to that question.

by Mache Creeger | June 2, 2009

Distributed Computing Economics:
Computing economics are changing. Today there is rough price parity between: (1) one database access; (2) 10 bytes of network traffic; (3) 100,000 instructions; (4) 10 bytes of disk storage; and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.

Computing is free. The world’s most powerful computer is free (SETI@Home is a 54-teraflop machine). Google freely provides a trillion searches per year to the world’s largest online database (two petabytes). Hotmail freely carries a trillion e-mail messages per year. Amazon.com offers a free book-search tool. Many sites offer free news and other free content. Movies, sports events, concerts, and entertainment are freely available via television.

by Jim Gray | July 28, 2008

Beyond Beowulf Clusters:
As clusters grow in size and complexity, it becomes harder and harder to manage their configurations.

In the early ’90s, the Berkeley NOW Project under David Culler posited that groups of less capable machines could be used to solve scientific and other computing problems at a fraction of the cost of larger computers. In 1994, Donald Becker and Thomas Sterling worked to drive the costs even lower by adopting the then-fledgling Linux operating system to build Beowulf clusters at NASA’s Goddard Space Flight Center. By tying desktop machines together with open source tools such as PVM, MPI, and PBS, early clusters—which were often PC towers stacked on metal shelves with a nest of wires interconnecting them—fundamentally altered the balance of scientific computing.

by Philip Papadopoulos, Greg Bruno, Mason Katz | May 4, 2007

Enterprise Grid Computing:
Grid computing holds great promise for the enterprise data center, but many technical and operational hurdles remain.

I have to admit a great measure of sympathy for the IT populace at large, when it is confronted by the barrage of hype around grid technology, particularly within the enterprise. Individual vendors have attempted to plant their flags in the notionally virgin technological territory and proclaim it as their own, using terms such as grid, autonomic, self-healing, self-managing, adaptive, utility, and so forth. Analysts, well, analyze and try to make sense of it all, and in the process each independently creates his or her own map of this terra incognita, naming it policy-based computing, organic computing, and so on. Unfortunately, this serves only to further muddy the waters for most people.

by Paul Strong | August 18, 2005

Web Services and IT Management:
Web services aren’t just for application integration anymore.

Platform and programming language independence, coupled with industry momentum, has made Web services the technology of choice for most enterprise integration projects. Their close relationship with SOA (service-oriented architecture) has also helped them gain mindshare. Consider this definition of SOA: "An architectural style whose goal is to achieve loose coupling among interacting software agents. A service is a unit of work done by a service provider to achieve desired end results for a service consumer.

by Pankaj Kumar | August 18, 2005

Enterprise Software as Service:
Online services are changing the nature of software.

While the practice of outsourcing business functions such as payroll has been around for decades, its realization as online software services has only recently become popular. In the online service model, a provider develops an application and operates the servers that host it. Customers access the application over the Internet using industry-standard browsers or Web services clients. A wide range of online applications, including e-mail, human resources, business analytics, CRM (customer relationship management), and ERP (enterprise resource planning), are available.

by Dean Jacobs | August 18, 2005

Describing the Elephant: The Different Faces of IT as Service:
Terms such as grid, on-demand, and service-oriented architecture are mired in confusion, but there is an overarching trend behind them all.

In a well-known fable, a group of blind men are asked to describe an elephant. Each encounters a different part of the animal and, not surprisingly, provides a different description. We see a similar degree of confusion in the IT industry today, as terms such as service-oriented architecture, grid, utility computing, on-demand, adaptive enterprise, data center automation, and virtualization are bandied about. As when listening to the blind men, it can be difficult to know what reality lies behind the words, whether and how the different pieces fit together, and what we should be doing about the animal(s) that are being described.

by Ian Foster, Steven Tuecke | August 18, 2005