(newest first)

  • Loki Verloren | Wed, 25 Apr 2018 10:30:09 UTC

    I really really enjoyed reading this article and I am in the process of attempting to implement a binary search tree based on a similar model, and in particular the remapping between the array store and the tree structure answered a question that I intuitively grasped but didn't have an image in my mind about how this problem of causing a page fault with every row below the first 4kb of the tree. I don't 100% get how the mapping translates to my  intended application of a left-right sorted tree, exactly, but I will come up with it a lot more easily now I have seen this.
    My whole direction in my current project was all about finding a way to exploit the far greater speed and far lower latency of CPU caches, and this is exactly demonstrating that my line of inquiry was correct.
  • Neel | Sun, 18 Mar 2018 06:51:31 UTC

    skipped bath and reading this. This is intresting
  • BlueRaja | Thu, 21 Apr 2016 21:20:30 UTC

    So you reinvented the T-tree, plus some Straw Man graphs and some comments on how you're smarter than the entire CS field.  Congrats?
  • Steve | Tue, 11 Aug 2015 03:39:02 UTC

    Google "cache oblivious algorithms" for good theoretical treatment. 
  • zeroday1 | Mon, 22 Dec 2014 21:32:45 UTC

    I often encounter the following error on a consistent basis upon visiting the famous shop network HSN.
    Error 503 Service Unavailable
    Service Unavailable
    Guru Meditation:
    XID: 1376938831
    Varnish cache server
    There really is no reason to suspect that my experience encountering this error is related to a problem on my system because I do not maintain any conflicting configurations on my pc. No matter what browser I use, I still get this error. I've visited their site before with no problem, however it almost seems like now it is a regular occurrence to come across this error.
    I have ample reason to believe this has to do with a problem on their website, specifically with regard to website maintenance, or lack there of...
    Try telling HSN that, and you might just get a variety of responses depending upon whom you speak with, but you can rest assure that no matter whom it is at HSN, they're all experts...(Pun intended of course)...
  • A. Bit Miffed | Sun, 14 Sep 2014 23:15:38 UTC

    It's rare to see this kind of hubris in a computer publication.  But, oh well.  When quantum computers become the norm, we'll see an article that Poul-Henning Kamp was doing it wrong.
  • asd | Tue, 22 May 2012 04:53:47 UTC

    buy more ram bros
  • iitg | Mon, 12 Dec 2011 09:44:22 UTC

    I think some said "buy more ram" in the comment above, that wont be of much help as we are talking about an algorithm which minimises disk access as well as cache miss! Nice article.
  • Dutch Uncle | Thu, 30 Jun 2011 16:02:25 UTC

    Lots of people have been paying attention to memory hierarchy for a long time.  I worked for SyncSort on IBM mainframe data sorting and manipulation utilities.  I can assure you that every single execution of the SyncSort utility started with an analysis pass that included comparing the effective size of data key indices to the available cache reported by the processor, as well as comparing the available RAM to the total data set size, in order to tune the operation into the best number of iterations of the best sized chunks of data.   In the universe, things do happen outside of one's personal knowledge.
  • Sepp | Fri, 25 Mar 2011 13:34:37 UTC

    An interesting comparison of Varnish, Apache Traffic Server (another proxy cache) and G-WAN (an application server with C scripts):
    Apache Traffic Server is an HTTP proxy and cache server created by Inktomi, and distributed as a commercial product before Inktomi was acquired by Yahoo.
    Yahoo says that it uses TS in to serve more than 30 Billion objects per day. They also say that TS is a "product of literally hundreds of developer-years".
  • Tom | Wed, 05 Jan 2011 19:56:36 UTC

    You have rediscovered the principles of spacial and temporal locality, one of the key things taught in any basic algorithms class. Thinking that it has anything to do with something as specific as virtual memory or hard disks shows your ignorance of computer architecture, modern or otherwise. Imagine swapping tapes back in the day.
  • Gene Ressler | Tue, 28 Dec 2010 17:44:03 UTC

    This work is good, correct, and clever.  But the criticism of CS as a discipline is uninformed and self-serving ("look at me; I'm smarter than Knuth").  In the CS education that I was privileged to receive, first at the undergrad and then again in graduate courses spread across 15 years or so (not contiguous), I was exposed to the importance of memory hierarchies in algorithm design several times in different courses. As has been mentioned, the CS literature is full of both theoretical and practical approaches to hierarchy-aware algorithm design. One quickly Googled collection is If  no one happened to focus specifically on priority queues, okay, it's terrific that someone got this out of the way. The approach and order-of-maginutude speedup are commonplace. The last I recall reading about was in garbage collection. There are probably many others. (The cite above talks about graph search, for example.)
    If there is a valid criticism here, it's that CS reference works and textbooks don't keep up very well with details of the art such as this. Rank-and-file teachers who know only the texts tend to deliver less than optimal instruction for this reason. No surprise there. Writing references is an incredible amount of work, not much valued as scholarly activity for academic types because it doesn't generate new knowledge or pull in research dollars, and it pays very poor royalties. I've known people who accuse Knuth of "wasting his talent" by producing his volumes, not to mention a typesetting system to publish them! 
    This author could begin to right the wrong he perceives by devoting himself to writing a reference rather than stopping with a snarky one-off blog post.
  • James | Fri, 29 Oct 2010 10:57:47 UTC

    @Vladimir G. Ivanovic
    You wrote:
    "Since G-WAN is closed source, you are exposing yourself 
    or your company to unbounded liability if it turns out that there 
    are security issues."
    Oh. You mean, like MICROSOFT IIS? How many times has
    it been screwed in Courts for publishing security holes daily?
    G-WAN is the first server (in history) that is faster than all
    other servers (in user-mode or kernel mode) BOTH on 
    Windows AND Linux.
    But I guess that spreading FUD was not your intent...
  • Kashyap | Wed, 04 Aug 2010 07:50:41 UTC

    Could someone please send me the PDF version - with images?
    The html link does not seem to show up the figures.
    My email id is ckkashyap [AT]
  • Wayne | Sat, 10 Jul 2010 19:58:23 UTC

    I enjoyed the article.  But I have to wonder how much cache-oblivious or cache-aware algorithms matter when using virtual machines such as with Java, or even when written in C on FreeBSD but that is running under some virtual host such as VMware?  I would think the CPU architecture might matter too, for example how many rings of memory are supported.  I suppose using virtual machines when performance is critical may not be the smartest choice.
  • Cellar | Sun, 27 Jun 2010 17:49:32 UTC

    Yes, I read the full article and with care, and no, I didn't say what you think I said. I said the graphs were pretty and I also said you made an error writing about "Solid State Disk disks". Read carefully now. I do notice that you're trying to defend subtlety in what you wrote (qv "entirely Slashdots[sic] fault" -- ignoring slashdot still leaves my comment with a valid complaint about tone) then gloss over important detail when you're counter-criticising, and I see you do that repeatedly including the "two camps" comment. As you imply is a point at the heart of your article's argument; one should never overgeneralise. In computing, that leaves significant performance wasted, and that point is well taken. Between humans, it makes for poor discussion. As I implied referring to earlier work of yours, you really ought to know better than that. Physican, heal thyself.
  • Michael Eager | Mon, 21 Jun 2010 04:41:19 UTC

    It's good to see someone investigating about a real problem and implementing a solution.  On the other hand, it's not uncommon for someone to reinvent a previously know method, especially if they are unversed in prior art.  My question:  where are the editors and reviewers?  They are the ones who should have noticed that the article seems to not have considered prior art later than the 1960's. They are the ones who should have suggested looking at the articles mentioned by other posters. The hubris represented in the tone of this article detracts from its value as an example of performance analysis and practical application.  
  • Robert | Sat, 19 Jun 2010 12:25:07 UTC

    One can believe that the author actually realised this on a train, being offline as he was. But then he should have checked if someone has already had this idea. Then he would have found out that he, in fact, missed the entire literature on the topic. Maybe that would have caused some humbleness.
    The fact that this is presented as novel work (in ACM, nothing less!), is hideous.
  • Pinco Pallino | Sat, 19 Jun 2010 09:41:48 UTC

    Fair enough, let's reason as practitioners (and let alone in peace the theorists that taught us so much about how a mass-less torsion-less friction-less rope behaves to put us in conditions to afford more practical problems, with all the due respect).
    I has the same problem (in a different kind of program) 10 years ago, first attack was replacing the Heap with a b* tree (100x speedup), then since I frequently had to re-add the lowest-key object "close to the low part" of the index the second step was splitting the index in two parts: an "head" being a pure heap with up to 1024 entries (always in RAM and in cache) and a modified b* with "the rest" (on disk, up to 512 entries per node due to the nature of my data), plus obvious strategies (when the head is full detach half of it and push it into the tree, when it's empty pull an entire node from the tree and put it on the heap): another 10x speedup.
    Why did I tell this: just to show that there is always space for improvement and this has nothing to do with others being wrong. I completely subscribe the issue of having people who study at school that "this algorithm is the best" and apply it without the slightest idea about what they are doing, but the article should have started with "There is around a plethora of algorithms for each problem at this point, each fits better to s specific real problem and sometimes it is even advisable to modify it to have it suit perfectly", not with "people who analyzed in excruciating detail and taught in all computer science courses in the world were wrong". This is an attitude that is often welcome in dick-contest style mailing lists, but is always not welcome in the scientific world.
  • James Cook | Sat, 19 Jun 2010 07:08:56 UTC

    Wow, these comments would be a lot easier to read if the paragraph breaks were kept.  Maybe ACM Queue is taking "Doing It Wrong" very seriously. :-)
  • James Cook | Sat, 19 Jun 2010 07:06:27 UTC

    So the practical problem here for Varnish is that the large data objects you want to cache consume all your physical memory, the binary-tree lifetime index is rarely used, and so the index gets swapped out?
    What about just preventing the index from getting swapped out?  Maybe by reading every 4096th element once a millisecond or so?  Or somehow telling the OS not to swap out those pages?
    It should be simple to implement, would allow you to keep the naive B-tree implementation everyone is familiar with, and would cost a small amount of physical memory on the class of machines it sounds like people use with Varnish (~80MB for 10M objects on 64-bit).  Does Varnish have other infrequently used data structures that are larger, and hence more important to allow to be paged out between uses?
    (I'm strictly a user-space guy, so this article opens my eyes to virtual memory issues.  I'm more used to worrying about processor cache lines than VM.  Thanks for writing it!)
  • xxx | Fri, 18 Jun 2010 12:02:59 UTC

    You're totally mixing up the "badness" of a heap with disk storage.
    The heap as learned in CS courses resides in the RAM, and thus you don't have page faults.
    The only thing you really discovered is an efficient way of persisting a heap on disc. And I am somehow sure that you'll find quite some publications on google scholar about that. BTW: ever thought, why B*-Trees were invented? Hint: Because you're able to read multiple keys with just one disc-access.
    It's somehow the same here for the heap.
  • Poul-Henning Kamp | Fri, 18 Jun 2010 11:57:16 UTC

    Queue has always been intended for "the practitioner", people who actually write programs that do actually computations on real data etc.  On the other side of the fence, so to speak, we have the "pure CS" research, where people "write algorithms in nothing less portable than pencil #2" as Peter Naur would quip.
    The problem I tried to point out in the article, is that CS graduates get employed as practitioners, and they apply what they learned, not always consious about, and seldom successfully able to bridge the gap from pencil #2 to "8-core, 3 level cache, virtual memory and storarge on a network'ed SAN" on their own.
    In physics, we learn about newtons laws, using weight-less ropes, friction-less pulleys, point-shaped masses all operating in a vacuum.  In civil engineering, the ropes have mass, inertie, torsion, elasticity and a number of other unpleasant properties, and the masses are nothing but complex trouble, suffering from concepts like "Elasticity Modulus" and other weird stuff.
    The response to my article broadly falls in two camps, which can be paraphrased as:  Physisists arguing that bridges and buildings is simply not their job vs. people who try to build bridges and buildings, asking "whos job is it then ?"
    The obvious answer would be "The Software Engineers".
    Maybe we ought to educate some of those...
  • xxx | Fri, 18 Jun 2010 11:47:17 UTC

    Btw: usually its more readable to first show the method (what is a b-Heap) and THEN show the results.
  • Ashwin Jayaprakash | Fri, 18 Jun 2010 06:38:29 UTC

    For other readers - this complementary article will help understand Varnish better -
  • Pinco | Thu, 17 Jun 2010 11:05:50 UTC

    Well,  the incipit of this article is terribly wrong.
    I am sorry but I summarize for you some points that are in the first algorithms class for undergraduate students:
    a) Each algorithm has a mathematically analyzable asymptotic performance, expressed in terms of computational steps
    b) Each computational step is something that can be executed with a constant number of "elementary instructions", no matter what this number is
    c) Each physical computer has the capability to execute an instruction in some time and, on average, the time required to perform a given number of instructions is proportional to that number.
    As a programmer you should choose the algorithm that produces the best time for the actual data and numbers you deal with on the actual machine you run on: believe me for about this breakthrough: if you have to sort 5 integers a bubblesort is faster that quicksort, heapsort, mergesort and anyone else. So what ?
    You simply choose a different algorithm/data-structure (existing since the sixties) that fits better the architecture/data/size of your current problem, that can easily make you go even one million times faster with or without asymptotic complexity changes. That's called being a good programmer. But please leave "improvement on algorithms" alone, at least until you have studied a bit what you are talking about.
    The claim "an algorithm that has been on the books as "optimal" for 46 years, which has been analyzed in excruciating detail by geniuses like Knuth and taught in all computer science courses in the world, can be optimized to run 10 times faster" is half wrong and half obvious.
    Is half wrong because you cannot talk about optimizing an algorithm in terms of "time", you can optimize its asymptotic computational complexity number of steps, which you did not. Is half obvious because the fact that heaps do not match well with virtual memory, with caches and with multiple execution pipelines in modern CPUs is something known since decades.
    Who decided to use a pure heap in that code was not a very good programmer, you proved to be a better one, you did not change a little bit in the history of algorithms, a good CS class may help you about the fact that you think and claim you did, there is a long way before being able to even name Knuth.
    Knowing practical implementations tactics is useful to obtain improvements is many cases, but knowing a little bit of what is behind is mandatory to understand why if you are programming a realtime control system that controls the flaps of a plane... replacing an heapsort with a quicksort "because is faster" might be a very bad idea (feel free to change the example to any case where worst case performance is more critical than average).
    Blame on ACM for having accepted this article.
  • Pinco | Thu, 17 Jun 2010 10:49:56 UTC

    Well,  the incipit of this article is terribly wrong.
    I am sorry but I summarize for you some points that are in the first algorithms class for undergraduate students:
    a) Each algorithm has a mathematically analyzable asymptotic performance, expressed in terms of computational steps
    b) Each computational step is something that can be executed with a constant number of "elementary instructions", no matter what this number is
    c) Each physical computer has the capability to execute an instruction in some time and, on average, the time required to perform a given number of instructions is proportional to that number.
    No the claim "an algorithm that has been on the books as "optimal" for 46 years, which has been analyzed in excruciating detail by geniuses like Knuth and taught in all computer science courses in the world, can be optimized to run 10 times faster" is half wrong and half obvious.
    Is half wrong because you cannot talk about optimizing an algorithm in terms of "time", you can optimize its asymptotic computational complexity number of steps, which you did not. Is half obvious because the fact that heaps do not match well with virtual memory, with caches and with multiple execution pipelines in modern CPUs is something known since decades.
    Who decided to use a pure heap in that code was not a very good programmer, you proved to be a better one, you did not change a little bit in the history of algorithms, a good CS class may help you about the fact that you think and claim you did, there is a long way before being able to even name Knuth.
  • Neil Conway | Thu, 17 Jun 2010 07:46:58 UTC

    K-heaps to reduce VM faults were proposed in 1991 by Naor et al.: "Performance of priority queue structures in a virtual memory environment" from The Computer Journal 34:5 ( , From a quick skim, Naor et al.'s proposed data structure is more-or-less identical to the one discussed in this article.
  • Ger Hobbelt | Thu, 17 Jun 2010 02:15:21 UTC

    Sigh. I regret this will be a partial or, for those who are able to read between lines, complete rehash of what has been said above, but I'll say it anyway.
    First off, kuddo's for developing some good thoughts on the night train. Chapeau.
    Second, a bit of criticism that many seem to need reminding: education /prepares/ you for the rest of your life (by offering you a plethora of more or less basic concepts to use during the remainder of your (professional) life. Education does /not/ provide you with all the knowledge you're going to need then.
    In other words: [higher] education should achieve three things: (a) teach you the basics (in CS101 and onward), (b) train you to grok and maybe even develop new concepts, most of whom are building on those basic blocks you were taught at school, and (c) the awareness that your education does not stop once you receive that diploma or title, but that you should /seek/ out new information and attempt to comprehend it so you can employ it your own work.
    I'm 41, and my personal impression is that education achieves (a) completely (or you'd fail the tests), more or less achieves (b) and your personal attitude is a strong determinant for the rate of success regarding (c). /Lack/ of sustained curiousity kills the cat, at long term.
    The point is made before, more or less eloquently, that you show, as happens with so many CS/IT professionals, to have neglected to stay abreast of current research in your field. For those that moan here about information accessibility; CiteSeer and other sites have a very large number of freely downloadable papers, which report on cache conscious, cache oblivious and other hardware architecture aware algorithms, which expand on the work of Donald Knuth and others. This may read as harsh criticism; if you experience it like that, I'm sorry but so be it. I like the direct approach you took in your text; I hope I got this response down in kind, as it isn't much of a criticism but rather a 'hear! hear!' with a few side notes. I am glad this was written, in the way it was done. Nothing wrong with that.
    The criticism is this: The basic architectural model in fig. 7 is not outdated and still a good training model for /CS101/. /CS201/ (or whichever comes after CS101) would then discuss the shortcomings of said model and expand on it, ending up with a model where CPU caches (L1, L2, L3, ...), [D]RAM and HDD storage, their use and their characteristics become part of the picture, leading to a reconsideration of the previously taught algorithms in light of the 'new' model and the impact on scale(N) vs. performance (O() and C). That's where the cache-conscious algorithms (like your interesting reinvention of the B-tree-similar binary heap), 'cache oblivious', etc. ones show on the radar. A simple 'googling' for cache conscious binary heaps will deliver several papers more or less on subject; the key here is that you need to know that phrase 'cache conscious' in order to quickly find them; that's what the 'continued interest' a.k.a. education goal (c) should have delivered to you at this or an earlier time. Staying abreast doesn't mean leafing through your favorite IT magazines is good enough. That's the whole bloody difference between education and higher education; you're expected to be able to find and read published papers. That ability will help you to consider new technology for your projects, as it appears [publicly]. Or at least the choice to use or /not/ use it, given development time&cost restraints vs. performance requirements.
  • Ger Hobbelt | Thu, 17 Jun 2010 02:14:33 UTC

    Calling the fig.7 model 'outdated' is simply showing you got it wrong. It was never outdated; all you did was recall a very basic model which was taught early on, and it had its usefulness then, as it still has today, in CS101. Yes, it's probably not a good model once you enter CS201, but heck, weren't classes meant to increase in difficulty, yet persist in memory beyond the test? ;-)
    It's comparable to calculus as taught on primary schools: the first model you learn is simple and it only considers positive integers. I was angry at my teachers for 'lying' to me regarding the question '7-5=2. Right. Now what's 5-7?': "Son, that's impossible." Turned out it wasn't. Same with Knuth, Fig.7 and all the rest: They haven't been lying to us; they are correct. Within the bounds of the model they defined up front. O(f(n)) only turns out to be a white lie/fallacy when you adjust/enhance your model to /more closely/ mimic actual reality. That's what this is all about. It's generally accessible knowledge since the 1990's at least which, very broadly speaking, says that B-tree-like organisations/layouts significantly benefit algorithm performance in light of a multi-level memory access model. That's what you just realized and applied to your problem. That shows insight. There's always been register banks, CPU caches, and other 'memory levels' since the dawn of computer time. If you seek the 'ultimate' in performance, or at least something that surpasses your current system, you need to model these aspects and [re]consider what you're doing. A lot of researchers have been doing that, all the time. All you and I got to do, as a minimum, is make sure we find access to their work and read up. Being a member of the ACM gives you an edge. It's up to you to use that edge.
    For those who seek to learn even after leaving school: sites like CiteSeer (and google academics) provide fast and easy access to ongoing research. Including articles which translate fundamental research to practice. It takes time and some perseverance to keep up, but wasn't that part of the teachings of your education? It is for me.
    All in all, nice work. A thorough article which satisfies both technically and (thanks to slashdot headline editors?) as a wake-up call for some as well. Thanks for writing it, in the style it was written. Direct and sure hitting a nerve. If anything, it's thought-provoking on several levels, which is a highly meritable achievement in itself.
    PS: I am not an educator (have sometimes been a part-time one, though). My only regrettable bias is that I tend to meet quite a few colleagues who seem to have forgotten that the ability to learn algorithms (or anything math related) is not just a useful nerve-ending to help get you your degree in higher ed.
  • G Roelofs | Wed, 16 Jun 2010 21:51:36 UTC

    Minor note:  as far as big-O notation goes, log2 = log10 = ln = log-whatever.  (Sole difference is a multiplicative constant.)  So feel free to drop the "2"; it's just noise and may even get misinterpreted as log-squared.
  • GSK | Wed, 16 Jun 2010 21:06:39 UTC

    Sadly, few top computer scientists are good programmers and vice versa. Apparently you're not an exception, as your ignorance about this new-fangled thing called "cache oblivious algorithms" shows.
  • Fini Alring | Wed, 16 Jun 2010 09:28:37 UTC

    Very good article Poul-Henning, it is nice to see that some people still care more about actual performance than theoretical.
  • pduey | Wed, 16 Jun 2010 03:26:35 UTC

    What is your algorithm's performance on a Turing machine? Computer Science science is about studying and presenting computational theory. So, to blast educational CS programs because they don't update their curriculum every 15 minutes when a hardware optimization is announced is to not understand science. Varnish is awesome, but it's an example of excellent engineering, not to be confused with science. To accuse modern programmers of laziness because they would prefer to write more productive code that solves a problem now in reasonable time with a small expenditure on more memory (or other resources) is to lack a grasp of why we write software in the first place, which is to solve problems. I appreciate the artistic programmer and engineer, but to criticize anyone who endeavors to solve problems with software, even if their solution isn't the "best" or most "elegant" doesn't serve anyone. A little perspective is always nice. Sure, varnish beats squid in certain tests, but squid was there first and was instrumental in the growth of the internet. 
  • Daemeon Reiydelle | Wed, 16 Jun 2010 00:09:14 UTC

    Microkernel/Kernel/MP Operating System as a discipline, Programming as a discipline, database design as a discipline, each make simplifying assumptions about the under & overlying layers. Those assumptions facilitate the normal operation but cause extreme issues in the boundaries. University has only limited time to teach at the undergraduate levels the impact of assuming the assumptions are broadly and generally applicable.
    This excellent example of cross-discipline analysis and optimization only occurs at the odd moments of outside the box introspection. This introspection is both learned and requires time to develop & perfect. Neither the learning nor the time for introspection are cost justified by the (potential) practitioner. One does this for one's own pleasure. It is one of the aspects that tends to differentiate open source solutions from for-profit solutions. The former, encouraging personal ego and self-gratification drive participation, whereas the latter drives 'get it out the door good enough to work' so one can move on to the next problem.
    I would encourage the reader to see the value of both perspectives: inexpensive and highly functional computers & apps from Intel/Microsoft and Knuth, Dijkstra, and a plethora of others.
    I am not troubled by this fast paced world, I do not believe that time pressure is holding us back, rather a few elite men formed the Royal Society, a cast of millions thinking part time (or not so part time) formed the Open Source world. 
  • Donald Haase | Tue, 15 Jun 2010 23:44:45 UTC

    "The past 30 or 40 years of hardware and operating-systems development seems to have only marginally impinged on the agenda in CS departments' algorithmic analysis sections, and as far as my anecdotal evidence, it has totally failed to register in the education they provide."
    At CMU, one of the requirements for a BS in CS was to take a course called Introduction to Computer Systems. The course was developed at the school and is taught at a number of others now (I believe somewhere around 10-15, though that figure is about 2 years out of date). The entire course was about having students recognize that the notion of a computer as your figure 7 demonstrates is simply an easy theoretical model, while real world systems were much more segmented/complex. 
    About 15% of the course was even specifically spent deconstructing a common algorithm, and showing its failings due to its simplified architectural approach (by making a faster version via various hardware-aware optimization methods). All that is to say: not all schools leave their CS students unaware of this sort of issue. Hell, if it didn't sink in by the end of that course, about 50% of students also take "Operating System Design and Implementation" which among other things has you write your own paging system.
    Anyways, this is a fantastic realization, and I'm sure that many people are facepalming about it right now.
  • Jon-Pierre Gentil | Tue, 15 Jun 2010 22:11:58 UTC

    To everyone who says "can't you just buy more ram?":  That's what's wrong with so many things these days.  Programmers (as a generalization) have become immensely lazy in regards to resource management because someone will always say "just put more ram in it."  If we actually stopped and looked at what we do, and be scientific about it instead of code monkeys just churning out corporate source code, we could probably do ten times as much with the hardware that we have nowadays.  Thank you sir for writing this article up, it's inspiring and necessary.  Please continue to do the great work that you do, and than you for sharing it with the world.
  • Vladimir G. Ivanovic | Tue, 15 Jun 2010 20:12:13 UTC

    @James Since G-WAN is closed source, you are exposing yourself or your company to unbounded liability if it turns out that there are security issues.
  • lizardb0y | Tue, 15 Jun 2010 20:10:41 UTC

  • Chad Brewbaker | Tue, 15 Jun 2010 19:51:47 UTC

    The Varnish source code sounded elegant so I tracked it down.
    Nice to see the author also posted a link to his source code for this article.
  • john slimick | Tue, 15 Jun 2010 19:10:07 UTC

    I enjoyed this article. As the last man standing in a small computer science program, I declined the opportunity to convert CS into how to write nifty web apps in VB. I will be doing our last database course as one concerned with the underpinnings -- hash functions, hash tables, B+ trees, concurrency, security, protection plus the required intro to SQL.  I am depressed that more isn't immediately accessible on the cost of all the virtual memory,  layers, virtual IO, hotsy-totsy graphic interfaces, etc. Recently I bought a new car. The credit manager moaned and complained about his DOS based app. I replied that, first, it works, and, second, it is more secure than the usual quick-and-dirty web apllication.
  • Bill | Tue, 15 Jun 2010 18:37:10 UTC

    Wow academia back off. The guy was just trying to relate his own experience optimizing a particular algorithm. Since some readers may simply be practitioners of the art, the article is useful. Not all readers are interested in recounting all the prior and existing research ad nauseam. He was just trying to give a concrete example of how blind application of optimal algorithms can result in suboptimal performance. Get a grip.
  • dave wade | Tue, 15 Jun 2010 18:33:56 UTC

    I think this is more a comment on the fact that folks don't believe, or even understand that they need write efficient code any more!. Its very like the old Fortran IV matrix and paged memory issue, where a program that varied the last subscript fastest, accessing each the elements of each "row" of the matrix in turn  caused horrendous paging, because the data was stored with the elements of the column in consecutive storage locations.... 
  • Joshua Drake | Tue, 15 Jun 2010 18:23:53 UTC

    This is the web, please link your footnotes.
  • Nick Murphy | Tue, 15 Jun 2010 18:22:11 UTC

    Sadly, like so many others, this article takes a valid and useful technical point ("being oblivious to the underlying VM architecture can sometimes cost you dearly in performance") and sullies it in utterly tangential social commentary delivered in an unhelpfully condescending tone (in this case an unjustified and unjustifiable indictment of CS education).  Any good OS course (and most of them are, in fact, good) teaches students to think about these kinds of issues.  But students have to first learn a) how to program at all, and b) how to do algorithmic analysis at all, and condemning those prerequisites for making simplifying assumptions like uniform memory access is like condemning a parent trying to teach their child to ride a bike for not including a rudimentary discussion of gear differentials.  The world would be grateful if PHK would stick to things he knows, like kernel hacking and server performance analysis, rather than making ignorant comments about things he doesn't know, like pedagogy and university curricula.
  • Darron | Tue, 15 Jun 2010 16:16:33 UTC

    Interesting...  but it's quite apparent from the comments that this isn't new. (as anyone who's worked with writing code for small embedded platforms could tell you)  It's somewhat arrogant, but it (or the discussion of it) brings up one very good point.  If the current work on such algorithms and more importantly the mental process involved in considering the running environment of your algorithms is only taught at graduate+ level, something is seriously wrong.  Academically, the state of the art may improve.  However, the main point of college is to train the practicioners of the day.  Tell you what.  Try just teaching a little two week section where students play with writing programs on PICs or Arduinos.  Demonstrate how the target environment matters to various basic algorithms.  The students would love it, and so would their future employers.  The primary concepts here are NOT that difficult.
  • Angelo Schneider | Tue, 15 Jun 2010 15:23:41 UTC

    I did something similar in the early 90s when I started playing with operator::new in C++. My idea was that the nodes of a tree/list structure and the nodes of the data objects managed in the node should be close together in terms of memory  addresses. 
    Simply speaking when ever a data object was created with operator::new, also a few tree/list nodes where allocated. The effect was, that the next_ptr or left_son_ptr, right_son_ptr and data_ptr pointers where close to the "this" pointer. 
  • Brian Krupp | Tue, 15 Jun 2010 13:32:22 UTC

    I really enjoyed reading this article, sometimes we have to step back to what we have as "standards" in the CS world and make sure they are not outdated like you did here
  • Christoph Bartoschek | Tue, 15 Jun 2010 12:54:07 UTC

    The last time we looked at cache oblivious heaps they were hard to implement and if implemented slower than binary heaps for our datasets that fit entirely into the RAM 
    Can anyone point me to an cache oblivious heap algorithm or an implementation that provides a speed up if only the caches between the CPU and RAM are involved and not any disk backend?
  • Leonard Norrgård | Tue, 15 Jun 2010 12:18:34 UTC

    Readers may also find the author's architect notes on Varnish interesting, available here:
  • Colin Guthrie | Tue, 15 Jun 2010 12:04:53 UTC

    So, will Memristors make both the old, non-vm way of thinking AND your new analysis wrong? :p
  • myne | Tue, 15 Jun 2010 11:14:56 UTC

    Y'know, not being a DBA, or even a lowly windows admin any more, I look at this and think "Nice, but couldn't you just buy more RAM?"
    It seems to be a general rule with computers. If in doubt, feed it more RAM.
    Hell, it's cheap enough these days. 16gb of RAM is only 4x what I have in my desktop. 
    It was ~$100. 128gb shouldn't be out of the question if your database really needs it.
  • Poul-Henning Kamp | Tue, 15 Jun 2010 10:30:28 UTC

    I would like to point out that I never wrote that Knuth was wrong, that headline is entirely Slashdots fault.  The closest I get is to point out that if you read Knuths books for advise on performance, they are likely to be wildly misleading.  We really need to get MIX and MMIX upgraded with some seriously bad memory hardware.
    And the point of the article is: Why isn't that stuff in the CS courses already ?
    Please read and understand the article before commenting.  The graphs _are_ based on SSD performance.
  • Edward Cullen | Tue, 15 Jun 2010 10:21:10 UTC

    While it *may* have been better not to say "Knuth got it wrong" and not forgetting Knuth's immortal words "Premature optimisation is the root of all evil", the point remains that we don't live in a world where memory access is always O(1). In the world of high-performance servers, where even a 5% performance saving equates to BIG money, this article serves as a reminder that theoretical models are only a *starting* point for designs.
  • none | Tue, 15 Jun 2010 10:20:41 UTC

    Pardon the pun, but...
    you obliviously did it wrong!
  • Donny Viszneki | Tue, 15 Jun 2010 10:18:36 UTC

    Being very interested in this subject myself, I've often wondered if we'll ever see a multitasking model based on I/O, and a notion of I/O that includes paging.
  • James | Tue, 15 Jun 2010 10:13:01 UTC

    G-WAN is doing better:
  • apurv | Tue, 15 Jun 2010 10:03:19 UTC

    You seem to pick a book from 1960s, and ignore all the recent work on cache-concious data structures. 
    Why do you have to use a heap anyway? Strict LRU isn't required - there exist so many approximations to it. In particular, might prove useful to you.
  • Cellar | Tue, 15 Jun 2010 09:07:41 UTC

    Good story, pretty graphs... and still doing it wrong. I'm not talking about minor failure to see the full picture like in "Solid State Disk disk", but on the incessant ad-hominem talk that successfully paints the bike shed an "outrageous" colour, but gets old within mere sentences.
    You, sir, desperately need a course in polite writing, in the "hard on the issue, soft on the person" sense taught in negotiation, to avoid continuing to paint yourself an arrogant pompous ass. We already have more than our fair share in "IT", and there's no need to stick to it especially if you can do better. And since you can do other stuff, maybe you can do this too. I say this in the nicest way possible, of course.
  • bob | Tue, 15 Jun 2010 09:02:46 UTC

    An article about data structures citing only 2 papers from the 60's?
    If you go back 5 years more, maybe you can reinvent Quicksort as well.
  • Anand Vidwansa | Tue, 15 Jun 2010 08:07:21 UTC

    While there is strict need to demonstrate the effectiveness of the used approach by some rigid statistics, I like the notion of awareness of Virtual Memory underlying the applications. I understand the cost of a page fault is very high and if it can be avoided, it would be a tremendous performance win. And when operations scale to millions of objects, it would definitely make a difference.
  • Mark Rustad | Tue, 15 Jun 2010 05:53:55 UTC

    Although I was a CS major, and was a member of ACM for many years, I eventually switched to IEEE because I identified better with the realities reflected in the work there more than imaginary models. This result reminds me of letters to SIGPLAN way back. A paper was published that came from the University of Illinois that proposed that a proportional search was better than a binary search, and they had data to show that. Unfortunately their results were not reproduced by others, which were reported in letters responding to the article. I recognized why the original results were not reproduced immediately. The University of Illinois had a Control Data Cyber 170-750 (which incorporated essentially a 7600 CPU). Such a CPU has a divide instruction that is much faster, relatively, than other operations compared with other CPUs. Other lesser machines of the era, like a VAX even with a floating-point accelerator, have divide operations that are much slower relative to their other operations. So the specifics even trip up the algorithm analysts. Hmm. I wonder how the VM characteristics of the VAX would compare with the non-VM characteristics of the Cyber? Oh well, both of those machines are puny by today's standards.
    The point is that specifics matter. Over the course of 40 years of programming I have learned that lesson more than once.
  • BigJeff | Tue, 15 Jun 2010 03:39:42 UTC

    If you re-read his simulations, he assumed SSD level speeds (1ms).  Multiply everything by another factor of 10 to get spinning disk disparity.
    In other words: no, the problem doesn't go away, because SSD is still an order of magnitude slower than RAM.  Failing to take that into account will leave you with real performance that is significantly worse than theoretical performance.
  • Alan Tam | Tue, 15 Jun 2010 03:27:00 UTC

    I am horrified that ACM Queue accepts articles of this quality. While there is a genuine need to consider various levels of caching in modern systems (author calls it VM), mentioning it without talking about IO-efficiency, external memory data structures and algorithms, and cache-obliviousness is undermining the enormous amount of effort many basic researchers have spent in the past decade in the area.
    I can appreciate the need for the eye-catching title or the criticism against the CS community for not putting enough emphasis on this topic in undergraduate education, but the mere promotion of one particular new work without mentioning all the relevant works others have contributed to the same field is just ignorance and heroism.
  • Antek | Tue, 15 Jun 2010 02:41:57 UTC

    As someone who works with univeristy and PhD students a lot I can only say that I have to concur with you, CS education today hasn't evolved a lot from where it was when I was in the benches.
    I would actually even go as far as saying that most CS programs have gone back in quality as languages like Java etc. have hidden all the 'nasty' internals from students. And this inturn has a detremental effect on student's knowledge of what goes on under the hood, especially when things are no longer in memory.
    Recently I saw a great example of how little the average students actually know of lower level system activity in a database design where the status field on a table was defined as a NVARCHAR(255) and the content of that field was... 'enabled' / 'disabled'! as string :D , instead of using a bit field which would have done the trick just fine.
  • Totti | Tue, 15 Jun 2010 02:38:32 UTC

    nice article with a nice inside view spiced with historical background. You should think about to publish it in a CS journal.
    Nevertheless, which program did you use to create the artwork (graphs). Its really nice and I never saw this layout and format before.
  • evanh | Tue, 15 Jun 2010 02:19:18 UTC

    You've been Slashdotted ...
    "Varnish does not ignore the fact that memory is virtual; it actively exploits it."   No, Varnish actively exploits paging to disc ... which is not virtual memory.
    And, well, dah, software that ignores the sort of discrepancies as the lag and speed diff between HDD and RAM and then abused to the levels in the above article is doomed.  Just proves that coding for the hardware is still the right way no matter how much the IT dimwits think they can ignore it.
  • Gary | Tue, 15 Jun 2010 02:16:44 UTC

    You had me until you said the ZX81, C64, and TRS-80 were toys. That was totally unnecessary. Good post nonetheless.
  • Chris Hamilton | Tue, 15 Jun 2010 01:20:50 UTC

    As pointed out above by several people, this is indeed similar to existing concepts, if not identical.  In fact, there is a whole branch of theoretical computer science dedicated to finding algorithms that are optimal once you consider memory paging.  You are considering two levels of memory (main RAM, and swapped to disk), much like the I/O model of computation (google "I/O efficient algorithm for XXX" and you'll find lots of hits).  Such models require to you to tune the parameter B (8 in your example) for best performance on a particular piece of hardware.  There are also cache oblivious algorithms, which are actually optimal on any piece of hardware, across all levels of cache (L1, L2, RAM, disk, network storage, ...) regardless of their relatives sizes.  As suggested by Paulo Silveira, these models analyze algorithms not by the number of arithmetic operations, but rather by the number of incurred I/O operations, which is much more important when data gets cached.  Note that most I/O-efficient and cache-oblivious algorithms are simultaneously optimal in the classic RAM model.
    A lot of these ideas are relatively young (10-15 years) and not very well known, with little penetration in the real world except at universities and companies dealing with extremely large datasets (as you noted, I don't think there are any CS101 textbooks covering these ideas; usually people don't see them until grad school).  There do exist some libraries (STXXL and TPIE come immediately to mind) that have implemented many common algorithms and data structures to try to get some traction for these ideas (which as you have noted, provide dramatic real-world gains).  Not to diminish what you've done, but I'd suggest a look at the literature and the existing tools to see what other people have done as well.
  • Noah Mendelsohn | Tue, 15 Jun 2010 01:09:04 UTC

    I'm very sorry about the formatting of that last comment.  When I typed it, paragraphs were carefully separated by spaces!
  • Noah Mendelsohn | Tue, 15 Jun 2010 01:08:22 UTC

    This is really lovely work, congratulations.  Just to prove that some of us were thinking along somewhat related lines some time ago, perhaps you'll be interested that the VM/Passthrough system written in the mid-1970s used a page-sensitive buffer allocation scheme, for more or less the same reasons you describe.  In a nutshell, Passthrough provided a telnet-like service for IBM VM/370 mainframes, and the server side ran in a virtual machine, and thus with paged virtual memory underneath.  Getting the buffer management in the virtual machine to perform well was tricky:  we eventually supported several hundred interactive users, on hardware that had just a few megabytes, ran at just a few MIPS, and most importantly, was primarily busy with other tasks while Passthrough ran in the background.
    The pertinent article is Mendelsohn, N., Linehan, M.H., and Anzick, W.J., Reflections on VM/Passthrough: A facility for interactive networking, IBM Systems Journal, Vol. 22, Nos. 1 and 2. 1983.  Unfortunately, the full text is now behind an IEEE paywall at , but I'll quote a few brief sections that may be of interest:
    ========begin quote=========
    Holding down the working set for Pass-Through has the following benefits related to scheduling and performance:
    * If a working set is small enough, the host operating system may be able to keep the entire working set resident for long periods. Overhead for the paging of the working set is thus reduced.
    *If paging is necessary, delays are minimized for small working sets. These delays result from disk transfers necessary to free space in memory, as well as from those that bring in the Pass-Through working set.
    *The path length of the operating system logic for paging tends to dominate the CPU overhead for Pass-Through. Excessive use of memory thus translates into additional CPU time billed to Pass-Through.
    For all of these reasons, more effort has been spent on reducing the Pass-Through working set than on directly minimizing path length.
    To minimize the working set for data buffers, a page-sensitive buffer allocation scheme has been developed, whereby the memory allocators in Pass-Through always attempt to allocate new buffers within pages that already contain active buffers
    ========end quote=========
    Algorithmically, your Varnish/B-heap work is much more interesting.  The specific algorithm implemented by us was somewhat ad-hoc, and in any case unlikely to be of interest for other applications.  
    It is interesting to see that these issues do keep coming up after all these years, and I especially agree that it's unfortunate to see how little of the early work on virtual machines and virtual memories is quoted or taught in the recent literature or CS curricula.
    (And in one more bit of small world trivia, I'm pleased to say that I had the privilege in 1970 of writing just a bit of FORTRAN code that for some years was run on the Ferranti Atlas at the University of London, though the code was in no way optimized for the virtual memory system, which indeed, I was unaware of at the time.)
  • Craig | Tue, 15 Jun 2010 00:57:33 UTC

    So, when Exabyte SSD drives become common place such that common "swap" is non-existant, will this argument go away or just switch to getting around I/O bottle neck/parallel bottle neck related issues?
  • Martin T | Tue, 15 Jun 2010 00:31:35 UTC

    Let me see if I get this right. If your heap is so big that it can't fit in memory, then using your modifications will make the swap behaviour much more effective. 
    They do teach this kind of algorithms in University. They are called "IO-effective" or "cache oblivious " and they are rather interesting.
  • ransak | Tue, 15 Jun 2010 00:28:31 UTC

    Great article, I love seeing breakthroughs that have been overlooked for years. It's almost a guilty pleasure. A minor quibble, but you had a small typo. It's "Ponte facto, Cæsar transit", not Ponto.
  • me | Mon, 14 Jun 2010 22:51:16 UTC

    Correct me if I am wrong here but I was always thought it was C*O(f(n))  where C was some function of the 'overhead'.  In this case you have discovered the overhead of the memory subsystem.
    So in the case you have it is the C was bigger than the actual algorithm itself.  So in your particular case you have mapped 300 gig to 16 gig.  So yeah there will be some overhead especially if you do not have locality of data (which is what you did in your new alg).  In cases where there is no backing store could you end up with a slower thing?  From your graphs it doesnt look like it.
    I do agree that CS programs seem to 'gloss over' the C part of those equations.  The sort of thing you are going on about would be very good for everyone to know about.  Constants (at least when I was in college) were usually 'factored out' or just tacked on with a +c.  The down side is that the '+c' part gets in the way of teaching how to think about the different algs in the first place.  Unfortunately the side effect is we forget about it!
    But you have also come across what I tell others 'put it in memory and it be faster than disk, put it in cache and it will be faster still, put your data on the same cache page and it will smoke'.  That is just the way the architecture is.  I also like what you did with putting 'like' items near each other.  Seen that trick a few times before in the old VGA days in DOS.  Neat to see it applied to other things.
  • Poul-Henning Kamp | Mon, 14 Jun 2010 20:33:03 UTC

    @Shriphani:  You're right, I should have caught that brain-o.
    @Joseph: Wikileaks didn't want to touch it :-)  I'll probably get back on that topic later.
    @Hatem:  I can absolutely guarantee you that I invented the B-heap as presented: I did it in a train with no Internet connection.  However, I don't claim to have been the first to have done so, and I would be seriously disappointed if that was the case.
    @Paulo:  Not updating that assumption about memory access delays, is the same as ignoring all computers from VAX11 and forward in CS education.
    @Itai:  Nice presentation, now write a CS101 datastructure textbook where it is covered :-)  (PS: please post pdf's, powerpoint renders badly for the Windows-emancipated)
  • Itai Lahan | Mon, 14 Jun 2010 19:28:04 UTC

    Couldn't help but link to a presentation I prepared on cache oblivious B-trees -
  • Paulo Silveira | Sun, 13 Jun 2010 04:45:18 UTC

    Varnish is just an incredible piece of code, but saying "Knuth might be terribly misleading on the performance of the binary heap, possibly even by an order of magnitude. On the way home, also on the train, I wrote a simulation that proved my hunch right" is just nonsense. 
    His analysis is done considering a RAM model where "Random refers to the idea that any piece of data can be returned in a constant time, regardless of its physical location and whether or not it is related to the previous piece of data." (wikipedia). This means that any memory access consumes O(1). In other words, he has already prevented this, since it is defined that he is not considering memory access can cost more in some cases.
    You should have started by your seventh diagram, and redefining costs to access memory (ie, not O(1) in all cases), than you could present a new analysis for almost every single algorithm (not only heapsort) based on the new time consumptions for different memory regions access.
  • Hatem Nassrat | Sun, 13 Jun 2010 00:54:14 UTC

    Stop thinking/saying you invented this, this is non other than the cache oblivious binary heap
    Can anyone just post an article on this queue, where is QA
  • sac | Sun, 13 Jun 2010 00:50:05 UTC

    figure 5 was mis-mapped to a non-existent memory space. besides, the article is faster without it; with figure 5, 12 servers were using 100% of their cpu -- without, only three servers are required and they sit 90% idle.
  • Joseph Kern | Sun, 13 Jun 2010 00:02:10 UTC

    So where's figure 8? If the computer model is outdated can you post an updated one? Great research.
  • Tom Limoncelli | Sat, 12 Jun 2010 23:18:35 UTC

    How dare you pay attention to how computers actually work!  Back into your cage, Mr. I-live-in-the-real-world!  There's no place for you here!
  • Shriphani Palakodety | Sat, 12 Jun 2010 22:25:37 UTC

    This is a very interesting article and the FreeBSD programmers are obviously one of the top programmers in the community. I wanted to point out that the traditional binary heap doesn't go n -> {2n, 2n + 1}. It goes n -> {2n + 1, 2n + 2}. If n were 0 (the root of the tree), its children should be at positions 1 and 2 in the array used to represent the heap,  not 0 and 1. Other than that, this is a very nice article and thanks for sharing it with us.
  • Bruno Martinez | Sat, 12 Jun 2010 20:17:16 UTC

    The author should take a break from his 15 years of bit twiddling to read on cache-oblivious algorithms.
  • dave | Sat, 12 Jun 2010 17:48:04 UTC

    thank you for using your genius to make the world a better place.
    "The past 30 or 40 years of hardware and operating-systems development seems to have only marginally impinged on the agenda in CS departments' algorithmic analysis sections," 
    //and as far as my anecdotal evidence, 
    "it has totally failed to register in the education they provide."
    your article.
Leave this field empty

Post a Comment:

(Required - 4,000 character limit - HTML syntax is not allowed and will be removed)