The Kollected Kode Vicious

Kode Vicious - @kode_vicious

  Download PDF version of this article PDF

What Are You Trying to Pull?

A single cache miss is more expensive than many instructions.

Dear KV,

I've been reading some pull requests from a developer who has recently been working in code that I also have to look at from time to time. The code he has been submitting is full of strange changes that he claims are optimizations. Instead of simply returning a value such as 1, 0, or -1 for error conditions, he allocates a variable and then increments or decrements it, and then jumps to the return statement. I haven't bothered to check whether or not this would save instructions, because I know from benchmarking the code that those instructions are not where the majority of the function spends its time. He has argued that any instruction we don't execute saves us time, and my point is that his code is confusing and hard to read. If he could show a five or ten percent increase in speed, it might be worth considering, but he has not been able to show that in any type of test. I've blocked several of his commits, but I would prefer to have a usable argument against this type of optimization.

Pull the Other One

 

Dear Pull,

Saving instructions—how very 1990s of him. It's always nice when people pay attention to details, but sometimes they simply don't pay attention to the right ones. While KV would never encourage developers to waste instructions, given the state of modern software, it does seem like someone already has. KV would, as you did, come out on the side of legibility over the saving of a few instructions.

It seems that no matter what advances are made in languages and compilers, there are always programmers who think they're smarter than their tools, and sometimes they're right about that, but mostly they're not. Reading the output of the assembler and counting the instructions may be satisfying for some, but there had better be a lot more proof than that to justify obfuscating code. I can only imagine a module full of code that looks like this:


if (some condition) {
   retval++;
   goto out:
} else {
   retval--;
   goto out:
}
...
out:
   return(retval)

and, honestly, I don't really want to. Modern compilers, or even not so modern ones, play all the tricks that programmers used to have to play by hand—inlining, loop unrolling, and many others—and yet there are still some programmers who insist on fighting their own tools.

When the choice is between code clarity and minor optimizations, clarity must, nearly always, win. A lack of clarity is the source of bugs, and it's no good having code that's fast and wrong. First the code must be right, then the code must perform; that is the priority that any sane programmer must obey. Insane programmers, well, they're best to be avoided. Eventually they wind up moving to a Central American nation, mixing their own drugs in bathtubs, and claiming they can unlock iPhones.

The other significant problem with the suggested code is that it violates a common coding idiom. All languages, including computer languages, have idioms, as pointed out at length in The Practice of Programming by Brian W. Kernighan and Rob Pike (Addison-Wesley Professional, 1999), which I recommended to readers more than a decade ago. Let's not think about the fact that that book is still relevant, and that I've been repeating myself every decade. No matter what you think of a computer language, you ought to respect its idioms for the same reason that one has to know idioms in a human language—they facilitate communication, which is the true purpose of all languages, programming or otherwise. A language idiom grows organically from the use of a language. Most C programmers, though not all of course, will write an infinite loop in this way:


for (;;) {
}

or as


while (1) {
}

with an appropriate break statement somewhere inside to handle exiting the loop when there is an error. In fact, checking the Practice of Programming book, I find that this is mentioned early on (in section 1.3). For the return case, you mention it is common to return using a value such as 1, 0, or -1 unless the return encodes more than true, false, or error. Allocating a stack variable and incrementing or decrementing and adding a goto is not an idiom I've ever seen in code, anywhere—and now that you're on the case, I hope that I never have to.

Moving from this concrete bit of code to the abstract question of when it makes sense to allow some forms of code trickery into the mix really depends on several factors, but mostly on how much speedup can be derived from twisting the code a bit to match the underlying machine a bit more closely. After all, most of the hand optimizations you see in low-level code, in particular C and its bloated cousin C++, exist because the compiler cannot recognize a good way to map what the programmer wants to do onto the way the underlying machine actually works. Leaving aside the fact that most software engineers really don't know how a computer works, and leaving aside that what most of them were taught—if they were taught—about computers, hails from the 1970s and 1980s before superscalar processors and deep pipelines were a standard feature of CPUs, it is still possible to find ways to speed up by playing tricks on the compiler.

The tricks themselves aren't that important to this conversation; what's important is knowing how to measure their effects on the software. This is a difficult and complicated task. It turns out that simply counting instructions as your co-worker has done doesn't tell you very much about the runtime of the underlying code. In a modern CPU the most precious resource is no longer instructions, except in a very small number of compute-bound workloads. Modern systems don't choke on instructions; they drown in data. The cache effects of processing data far outweigh the overhead of an extra instruction or two, or ten. A single cache miss is a 32-nanosecond penalty, or about 100 cycles on a 3-GHz processor. A simple MOV instruction, which puts a single, constant number into a CPU's register, takes one-quarter of a cycle, according to Agner Fog at the Technical University of Denmark (http://www.agner.org/optimize/instruction_tables.pdf).

That someone has gone so far as to document this for quite a large number of processors is staggering, and those interested in the performance of their optimizations might well lose themselves in that site generally (http://www.agner.org).

The point of the matter is that a single cache miss is more expensive than many instructions, so optimizing away a few instructions isn't really going to win your software any speed tests. To win speed tests you have to measure the system, see where the bottlenecks are, and clear them if you can. That, though, is a subject for another time.

KV

Kode Vicious, known to mere mortals as George V. Neville-Neil, works on networking and operating-system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. Neville-Neil is the co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System (second edition). He is an avid bicyclist and traveler who currently lives in New York City.

Copyright © 2016 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 14, no. 2
Comment on this article in the ACM Digital Library





More related articles:

Nicole Forsgren, Eirini Kalliamvakou, Abi Noda, Michaela Greiler, Brian Houck, Margaret-Anne Storey - DevEx in Action
DevEx (developer experience) is garnering increased attention at many software organizations as leaders seek to optimize software delivery amid the backdrop of fiscal tightening and transformational technologies such as AI. Intuitively, there is acceptance among technical leaders that good developer experience enables more effective software delivery and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in as business stakeholders question the value proposition of improvements.


João Varajão, António Trigo, Miguel Almeida - Low-code Development Productivity
This article aims to provide new insights on the subject by presenting the results of laboratory experiments carried out with code-based, low-code, and extreme low-code technologies to study differences in productivity. Low-code technologies have clearly shown higher levels of productivity, providing strong arguments for low-code to dominate the software development mainstream in the short/medium term. The article reports the procedure and protocols, results, limitations, and opportunities for future research.


Ivar Jacobson, Alistair Cockburn - Use Cases are Essential
While the software industry is a fast-paced and exciting world in which new tools, technologies, and techniques are constantly being developed to serve business and society, it is also forgetful. In its haste for fast-forward motion, it is subject to the whims of fashion and can forget or ignore proven solutions to some of the eternal problems that it faces. Use cases, first introduced in 1986 and popularized later, are one of those proven solutions.


Jorge A. Navas, Ashish Gehani - OCCAM-v2: Combining Static and Dynamic Analysis for Effective and Efficient Whole-program Specialization
OCCAM-v2 leverages scalable pointer analysis, value analysis, and dynamic analysis to create an effective and efficient tool for specializing LLVM bitcode. The extent of the code-size reduction achieved depends on the specific deployment configuration. Each application that is to be specialized is accompanied by a manifest that specifies concrete arguments that are known a priori, as well as a count of residual arguments that will be provided at runtime. The best case for partial evaluation occurs when the arguments are completely concretely specified. OCCAM-v2 uses a pointer analysis to devirtualize calls, allowing it to eliminate the entire body of functions that are not reachable by any direct calls.





© ACM, Inc. All Rights Reserved.