
Retrofitting: Principles and Practice
Retrofitting radically new functionality onto production software tests every skill of the programmers craft. A practical case study illuminates principles for bolting new tricks onto old dogs.
The Price of Intelligence:
Three risks inherent in LLMs
The vulnerability of LLMs to hallucination, prompt injection, and jailbreaks poses a significant but surmountable challenge to their widespread adoption and responsible use. We have argued that these problems are inherent, certainly in the present generation of models and likely in LLMs per se, and so our approach can never be based on eliminating them; rather, we should apply strategies of "defense in depth" to mitigate them, and when building and using these systems, do so on the assumption that they will sometimes fail in these directions.
The Drunken Plagiarists:
Working with Co-pilots
Before trying to use these tools, you need to understand what they do, at least on the surface, since even their creators freely admit they do not understand how they work deep down in the bowels of all the statistics and text that have been scraped from the current Internet. The trick of an LLM is to use a little randomness and a lot of text to Gauss the next word in a sentence. Seems kind of trivial, really, and certainly not a measure of intelligence that anyone who understands the term might use. But it's a clever trick and does have some applications.
Simulation: An Underutilized Tool in Distributed Systems
Simulation has a huge role to play in the advent of AI systems: We need an efficient, fast, and cost-effective way to train AI agents to operate in our infrastructure, and simulation absolutely provides that capability.
Give Engineers Problems, Not Solutions:
A simple strategy to improve solutions and boost morale
This technique is about providing the "why" instead of the "how." Instead of dictating specific solutions, present the problem and desired outcome, and let your team figure out how to solve it. This fosters creativity, shared ownership, and collaborative problem-solving. It also empowers the team to strive for the best solution.
Systems Correctness Practices at AWS:
Leveraging Formal and Semi-formal Methods
Building reliable and secure software requires a range of approaches to reason about systems correctness. Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces. Formal methods have been an important part of the development process - perhaps most importantly, formal specifications as test oracles that provide the correct answers for many of AWS's testing practices. Correctness testing and formal methods remain key areas of investment at AWS, accelerated by the excellent returns seen on investments in these areas already.
My Career-limiting Communication:
Be thoughtful about your content. You've got a lot riding on it.
Be thoughtful about how you present your content. Whether in email, documents, or slides, use punchy visuals to make content easier to digest with your most important points clearly highlighted. Make sure that data, charts, and photos are unambiguously labeled, with any caveats noted. In general, steer away from pie charts, averages, and percentages. That's because, as popular as these devices might be, they often manage to tell only part of the story and miss opportunities to highlight the relative size of datasets, outliers, or trends over time.
Intermediate Representations for the Datacenter Computer:
Lowering the Burden of Robust and Performant Distributed Systems
We have reached a point where distributed computing is ubiquitous. In-memory application data size is outstripping the capacity of individual machines, necessitating its partitioning over clusters of them; online services have high availability requirements, which can be met only by deploying systems as collections of multiple redundant components; high durability requirements can be satisfied only through data replication, sometimes across vast geographical distances. While it has arguably never been easier to procure the necessary hardware and to deploy distributed applications with a variety of tools from cluster orchestrators such as Kubernetes to newer paradigms such as functions-as-a-service, building correct and efficient distributed solutions largely remains an individual exercise.