A Conversation with Jarod Jenson:
Pinpointing performance problems
One of the industry’s go-to guys in performance improvement for business systems is Jarod Jenson, the chief systems architect for a consulting company he founded called Aeysis. He received a B.S. degree in computer science from Texas A&M University in 1995, then went to work for Baylor College of Medicine as a system administrator. From there he moved to Enron, where he played a major role in developing EnronOnline. After the collapse of Enron, Jenson worked briefly for UBS Warburg Energy before setting up his own consulting company. His focus since then has been on performance and scalability with applications at numerous companies where he has earned a reputation for quickly delivering substantial performance gains.
A High-Performance Team:
From design to production, performance should be part of the process.
You work in the product development group of a software company, where the product is often compared with the competition on performance grounds. Performance is an important part of your business; but so is adding new functionality, fixing bugs, and working on new projects. So how do you lead your team to develop high-performance software, as well as doing everything else? And how do you keep that performance high throughout cycles of maintenance and enhancement?
All Sliders to the Right:
Hardware Overkill
There are many reasons why this year's model isn't any better than last year's, and many reasons why performance fails to scale, some of which KV has covered in these pages. It is true that the days of upgrading every year and getting a free performance boost are long gone, as we're not really getting single cores that are faster than about 4GHz. One thing that many software developers fail to understand is the hardware on which their software runs at a sufficiently deep level.
Extending the Semantics of Scheduling Priorities:
Increasing parallelism demands new paradigms.
Application performance is directly affected by the hardware resources that the application requires, the degree to which such resources are available, and how the operating system addresses its requirements with regard to the other processes in the system. Ideally, an application would have access to all the resources it could use and be allowed to complete its work without competing with any other activity in the system. In a world of highly shared hardware resources and generalpurpose, time-share-based operating systems, however, no guarantees can be made as to how well resourced an application will be.
FPGAs in Data Centers:
FPGAs are slowly leaving the niche space they have occupied for decades.
This installment of Research for Practice features a curated selection from Gustavo Alonso, who provides an overview of recent developments utilizing FPGAs (field-programmable gate arrays) in datacenters. As Moore’s Law has slowed and the computational overheads of datacenter workloads such as model serving and data processing have continued to rise, FPGAs offer an increasingly attractive point in the trade-off between power and performance. Gustavo’s selections highlight early successes and practical deployment considerations that inform the ongoing, high-stakes debate about the future of datacenter- and cloud-based computation substrates.
Hadoop Superlinear Scalability:
The perpetual motion of parallel performance
We often see more than 100 percent speedup efficiency! came the rejoinder to the innocent reminder that you can’t have more than 100 percent of anything. But this was just the first volley from software engineers during a presentation on how to quantify computer system scalability in terms of the speedup metric. In different venues, on subsequent occasions, that retort seemed to grow into a veritable chorus that not only was superlinear speedup commonly observed, but also the model used to quantify scalability for the past 20 years failed when applied to superlinear speedup data.
Hidden in Plain Sight:
Improvements in the observability of software can help you diagnose your most crippling performance problems.
In December 1997, Sun Microsystems had just announced its new flagship machine: a 64-processor symmetric multiprocessor supporting up to 64 gigabytes of memory and thousands of I/O devices. As with any new machine launch, Sun was working feverishly on benchmarks to prove the machine’s performance. While the benchmarks were generally impressive, there was one in particular that was exhibiting unexpectedly low performance. The benchmark machine would occasionally become mysteriously distracted: Benchmark activity would practically cease, but the operating system kernel remained furiously busy. After some number of minutes spent on unknown work, the operating system would suddenly right itself: Benchmark activity would resume at full throttle and run to completion.
How Fast is Your Web Site?:
Web site performance data has never been more readily available.
The overwhelming evidence indicates that a Web site’s performance (speed) correlates directly to its success, across industries and business metrics. With such a clear correlation (and even proven causation), it is important to monitor how your Web site performs. So, how fast is your Web site?
Idle-Time Garbage-Collection Scheduling:
Taking advantage of idleness to reduce dropped frames and memory consumption
Google’s Chrome web browser strives to deliver a smooth user experience. An animation will update the screen at 60 FPS (frames per second), giving Chrome around 16.6 milliseconds to perform the update. Within these 16.6 ms, all input events have to be processed, all animations have to be performed, and finally the frame has to be rendered. A missed deadline will result in dropped frames. These are visible to the user and degrade the user experience. Such sporadic animation artifacts are referred to here as jank. This article describes an approach implemented in the JavaScript engine V8, used by Chrome, to schedule garbage-collection pauses during times when Chrome is idle.
Lessons from the Floor:
The manufacturing industry can teach us a lot about measuring performance in large-scale Internet services.
The January monthly service quality meeting started normally. Around the table were representatives from development, operations, marketing, and product management, and the agenda focused on the prior month’s performance. As usual, customer-impacting incidents and quality of service were key topics, and I was armed with the numbers showing the average uptime for the part of the service that I represent: MSN, the Microsoft family of services that includes e-mail, Instant Messenger, news, weather and sports, etc.
Modern Performance Monitoring:
Today’s diverse and decentralized computer world demands new thinking about performance monitoring and analysis.
The modern Unix server floor can be a diverse universe of hardware from several vendors and software from several sources. Often, the personnel needed to resolve server floor performance issues are not available or, for security reasons, not allowed to be present at the very moment of occurrence. Even when, as luck might have it, the right personnel are actually present to witness a performance "event," the tools to measure and analyze the performance of the hardware and software have traditionally been sparse and vendor-specific.
Monitoring in a DevOps World:
Perfect should never be the enemy of better.
Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better. DevOps enables highly iterative improvement within organizations. If you have no monitoring, get something; get anything. Something is better than nothing, and if you’ve embraced DevOps, you’ve already signed up for making it better over time.
Multicore CPUs for the Masses:
Will increased CPU bandwidth translate into usable desktop performance?
Multicore is the new hot topic in the latest round of CPUs from Intel, AMD, Sun, etc. With clock speed increases becoming more and more difficult to achieve, vendors have turned to multicore CPUs as the best way to gain additional performance. Customers are excited about the promise of more performance through parallel processors for the same real estate investment.
Performance Anti-Patterns:
Want your apps to run faster? Here’s what not to do.
Performance pathologies can be found in almost any software, from user to kernel, applications, drivers, etc. At Sun we’ve spent the last several years applying state-of-the-art tools to a Unix kernel, system libraries, and user applications, and have found that many apparently disparate performance problems in fact have the same underlying causes. Since software patterns are considered abstractions of positive experience, we can talk about the various approaches that led to these performance problems as anti-patterns: something to be avoided rather than emulated.
Reinventing Backend Subsetting at Google:
Designing an algorithm with reduced connection churn that could replace deterministic subsetting
Backend subsetting is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm balances the number of connections per backend task, deterministic subsetting has a high level of connection churn. Our goal at Google was to design an algorithm with reduced connection churn that could replace deterministic subsetting as the default backend subsetting algorithm.
The API Performance Contract:
How can the expected interactions between caller and implementation be guaranteed?
When you call functions in an API, you expect them to work correctly; sometimes this expectation is called a contract between the caller and the implementation. Callers also have performance expectations about these functions, and often the success of a software system depends on the API meeting these expectations. So there’s a performance contract as well as a correctness contract. The performance contract is usually implicit, often vague, and sometimes breached (by caller or implementation). How can this aspect of API design and documentation be improved?
The Time I Stole $10,000 from Bell Labs:
Or why DevOps encourages us to celebrate outages
If IT workers fear they will be punished for outages, they will adopt behavior that leads to even larger outages. Instead, we should celebrate our outages: Document them blamelessly, discuss what we've learned from them openly, and spread that knowledge generously. An outage is not an expense. It is an investment in the people who have learned from it. We can maximize that investment through management practices that maximize learning for those involved and by spreading that knowledge across the organization. Managed correctly, every outage makes the organization smarter.
Thinking Clearly about Performance:
Improving the performance of complex software is difficult, but understanding some fundamental principles can make it easier.
When I joined Oracle Corporation in 1989, performance was difficult. Only a few people claimed they could do it very well, and those people commanded high consulting rates. When circumstances thrust me into the "Oracle tuning" arena, I was quite unprepared. Recently, I’ve been introduced to the world of "MySQL tuning," and the situation seems very similar to what I saw in Oracle more than 20 years ago.
Thinking Methodically about Performance:
The USE method addresses shortcomings in other commonly used methodologies.
Performance issues can be complex and mysterious, providing little or no clue to their origin. In the absence of a starting point, performance issues are often analyzed randomly: guessing where the problem may be and then changing things until it goes away. While this can deliver results it can also be time-consuming, disruptive, and may ultimately overlook certain issues. This article describes system-performance issues and the methodologies in use today for analyzing them, and it proposes a new methodology for approaching and solving a class of issues.
Workload Frequency Scaling Law - Derivation and Verification:
Workload scalability has a cascade relation via the scale factor.
This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system.
You Don't know Jack about Application Performance:
Knowing whether you're doomed to fail is important when starting a project.
You don't need to do a full-scale benchmark any time you have a performance or capacity planning problem. A simple measurement will provide the bottleneck point of your system: This example program will get significantly slower after eight requests per second per CPU. That's often enough to tell you the most important thing: if you're going to fail.
You’re Doing It Wrong:
Think you’ve mastered the art of server performance? Think again.
Would you believe me if I claimed that an algorithm that has been on the books as "optimal" for 46 years, which has been analyzed in excruciating detail by geniuses like Knuth and taught in all computer science courses in the world, can be optimized to run 10 times faster? A couple of years ago, I fell into some interesting company and became the author of an open source HTTP accelerator called Varnish, basically an HTTP cache to put in front of slow Web servers.