(newest first)

  • Nick Black | Sat, 27 Jun 2009 15:33:34 UTC

    I'm dismayed that this article didn't mention Ulrich Drepper's "The Need for Asynchronous, Zero-Copy Network I/O" (, especially given a recent history of API's going where Drepper takes them (see NGPT vs NPTL (although this was admittedly implementation rather than interface)).
  • William Ward | Tue, 26 May 2009 21:12:37 UTC

    I do not believe the multihoming argument was clear enough for some of the commenters.  Multihoming != NIC bonding.  Your mobile phone has interfaces on multiple disjoint networks.  Having the option of advertising your mobile carrier based IP as well as a traditional LAN IP would be advantageous for both roaming as well as redundant accessibility.
  • mh | Fri, 22 May 2009 09:09:08 UTC

    jhb is dead right on. What kills high throughput is the difficulty of receiving. If you look at every high performance alternative to the socket API, you will see they ALL have some kind of "preparation" step where the sender warns the receiver about the (big) data it is going to send, so the receiver stack can make arrangements in advance. This is the main issue preventing the socket API and (its underlying stack) to achieve high throughput; the socket API provides nothing more than a bare stream with absolutely zero advance notice and zero vision past the next byte, so the receiving side cannot come prepared.
    Concerning reliable multicast there indeed will never be any silver bullet, just ad-hoc solutions. The problem is fundamentally too hard.
  • SimpleSimon | Wed, 20 May 2009 07:39:24 UTC

    Author says:
    "it is quite impressive for an API to have remained in use and largely unchanged for 27 years".
    Like so many others, he doesn't seem to have a clue where general purpose computing REALLY started - the IBM 360 series of mainframes, a full 20 years before "sockets". There are APIs from then that are still in common use today.
    Heck - there are even PC-based interfaces older than that.
    THis guy is probably one of those that thinks Intel invented "virtual storage" when they introduced the 80386. Of course, fact is, Intel joined that game 20 years after the big boys.
  • Jason P Sage | Wed, 20 May 2009 06:25:49 UTC

    Well, I think IP4 and IP6 are fine. Sometimes the tendancy for everyone to keep "fixing what isn't broken" in the name of new technology just doesn't gain anything... and instead antiquates software that was fine, well thought out, and the bugs squashed.. just to start fresh... with new bugs.
    I think it's important to consider the "weight" on the infrastructure which is already pretty tanked due to spam and other useless data transmissions... like "STREAMING" constantly.
    Protocol variations - regardless of the proposed "function calls" and definations of what these would do, if modified to "help streaming" would likely be utilized as a throttling mechanism. In turn, people would make video game servers and web pages that piggy backed the "Streaming Mechanism" as it would likely be given "more speed" for the simple fact it was the streaming "pipe". 
    The fact is, when solid API's are maintained as with UNIX file system and Sockets, software matures.. and you get a solid foundation. When the foundation is rock solid, you can TRULY stand on the shoulders of the developers before you and innovate!
    Rewriting the wheel all the time and calling it new and awesome gets us no where and creates more IT chaos/incompatibilities/complexity.
    I wish I could write a program that wouldn't be outdated as soon as I wrote it because IT is more about trend than substance these days.
    Sockets, Unix Fileio... 27 years plus? Grep? WOW! That's decent software.. that runs on any OS practically... That's how it should be done :)
  • jhb | Thu, 14 May 2009 13:43:38 UTC

    The comments about using mmap'd buffers for socket buffers are close, but there are some problems.  Sending data is the simplest.  You can sort of do this now with sendfile(), but the real gain would be doing it when generated data rather than data in a pre-existing file.  You can still do that by creating scratch files and doing your data generation into an mmap'd buffer of that file.  However, then you run into the problem that the kernel does not provide notification when a sendfile() request is completely done with the data in the file.  You could perhaps work around this by creating a new scratch file (which you unlink() after creating it and munmap() after calling sendfile() and rely on the various reference counts to free it when sendfile() is done), but that is a good bit of overhead.  With notification of completed sendfile()'s, you could perhaps amortize that overhead across many write()'s by creating a single large, sparse scratch file for send buffers.
    Receiving is a good bit harder, however.  The problem is that to really do zero copy, you want to use userland's mmap'd buffer as a RX buffer in the NIC's RX ring.  However, you can't predict ahead of time which entries in the RX rings will be receiving packets for a given socket.  You could perhaps make NICs smarter such that they have separate RX rings for specific sockets, but then you can run into issues with either 1) consuming more wired physical RAM for per-socket RX buffers or 2) dropping incoming packets because even though you have the same amount of RX buffer space, not all of it is able to receive every packet as they do currently.  The other approach for the RX case would be to map the RX buffers into userspace after the data was received.  However, to avoid security concerns, multiple RX buffers could never share a physical page (since VM operates on physical pages) (*BSD systems typically use 2048 byte buffers for RX which means on x86 you have 2 buffers sharing a physical page).  You would also need a way for userland to "release" the buffer back to the OS.  A problem with this approach is that the VM operations to map/unmap buffers can be slow, especially if you are dealing with small buffers.  If it wasn't, we would all be using IO-Lite by now.
    The one thing I will say about multicast UDP is that while it is indeed useful, many folks actually want multicast data that has TCP-like properties in terms of reliability and in-order delivery.  Thus, you have N different flavors of solutions to deal with this (and none of them standardized that I am aware of).  In some cases folks fall back to retransmitting lost packets via sideband channels.  In the case of a one-way or very high latency link, the sender may simply choose to send all the data multiple times (using something like forward error correction (FEC)) and hope that the receiver gets at least one copy of each datum such that it can reassemble the stream.  That said, if one came up with a silver bullet for reliable multicast (I'm not holding my breath), I don't think it would require a change to the existing socket API to support it.
    Regarding aio(4), the API is not great, and the implementations I am aware of aren't a performance gain.  I know of two classes of implementations: 1) implement AIO in userland by spawning threads that basically do select() / read/write() loops (or even synchronous I/O instead of select()), or 2) implement AIO in the kernel by spawning kernel threads that do synchronous I/O in the kernel.  Both cases aren't really any different performance-wise than simply using select/poll/kevent/epoll with non-blocking sockets in userland.  One possible benefit is that perhaps the programmer can write less code, but that may be debatable as AIO has its own complexity.
  • Herbie Robinson | Thu, 14 May 2009 09:59:48 UTC

    Whether one uses a wait/read call or gets a callback when data comes in is a red herring.  In order to provide proper protection, the OS is going to have to establish a protected address space, do CPU accounting, cross the kernel boundary and do some sort of scheduling whether its a read or a callback.  The only real difference between the models is which end of the stack you start executing at.
    The socket API has had multi-homing for decades if one uses one socket per interface -- you select the interface by specifying an IP address when you bind.  I guess if one could establish a single connection to/from multiple IP addresses, the driver could multiplex the data over all the interfaces (as opposed to having the application do it).  That would be somewhat helpful.  Of course there is no well established standardized way to get a list of interfaces -- that would seem more like the real obstacle to multi-homing.  
    Note that bonding is not multi-homing because it only uses a single IP address and the second link is a hot standby (the definition of multi-homing is that there is more than one IP address on a computer).  There is also something called link aggregation that will allow one to have multiple NICs for a single IP and use the NICs simultaneously for increased traffic.  This is also not multi-homing.
    Event loops (i.e., using select) can be sped up by putting the sockets in NDELAY mode and issuing the I/O calls until one gets EAGAIN (this will reduce the number of calls to select).
    Memory copying is not a big issue with current CPUs:  Cache misses and locking are the big issues.  I've been working for a while with a TCP stack that copies twice on output and twice on input and the input copies are likely to be on different CPUs.  All that copying is less expensive than the once per interrupt read of the hardware register that says what caused on the interrupt!  The other big hits are masking interrupts and locking.  The data copies are a distant forth in terms of cost.
    What's really missing from the API is a way to read and write multiple packets in a single system call.  Doesn't matter with TCP, but it does for any non-stream protocol.
  • Howard Chu | Thu, 14 May 2009 00:47:32 UTC

    The author comes really close to making some good points, but misses the mark each time.
    The socket API is not inherently 1:1 connection-oriented; it clearly has supported UDP and broadcast since its inception, and multicast since shortly thereafter. It's clear that multicast is the right answer for most of these net-wide streaming applications, but apparently there aren't many programmers who understand multicast; it seems George isn't even aware of it...
    As others have already pointed out, VMS has an excellent async I/O API and it has had it for over 30 years. So one doesn't need to look too hard to find alternatives that work well. Granted, async I/O in POSIX is pathetic... It would be pretty trivial to extend a POSIX kernel to allow a user to create mmap'd buffers for use with sockets. The whole issue of "the kernel losing its mapped memory if the user process goes away" is a total red herring - you don't give a region of kernel memory to the user process, you give a region of mapped user memory to the kernel. I proposed an API based on this notion a couple years ago, unfortunately I haven't had the time to implement it yet. At any rate, select() is not part of the sockets API, so identifying select() as a weakness of the sockets API is pretty ridiculous.
    re: multihoming - that's obviously a protocol issue more than an API issue; again, when the majority of your applications are built on TCP which requires 1:1 endpoints, you don't have any other choice. And as others have pointed out, when you want to take advantage of multihoming you can just use bond interfaces and forget about it.
    Sure, the landscape has evolved and there may be areas in which some APIs could be improved. But saying the days of sockets are over is pretty far-fetched, and blaming the API for narrowing programmers' mindsets is over the top.
    There are far more significant/important bottlenecks in software performance today than at the socket layer. Given the prevalence of massively bloated ultra-high-level-language applications out there, even a perfect zero-copy network stack implementation will yield zero measurable benefit to the end user.
  • Jaro | Thu, 14 May 2009 00:25:40 UTC

    Interesting paper--it makes very valid points and is thought provoking.
    But wasn't there an industry working group that was trying to create an async sockets definition some years ago?  I think it was called the Open group's Extended Socket API, or something like that.  I would have liked for this paper to have addressed that work, to explain if (or why not) it addressed the concerns here.
    Also I seem to recall other things which were industry attempts to address some of these issues.  (Event Ports may fit into this also--not sure.)  Again, it would have been nice for the paper to more comprehensively address those attempts.
    Sorry I'm vague in this comment, I'm a program manager working on storage now, and it's been years since I worked in the HP-UX kernel on async I/O interfaces.
  • Jeffrey | Wed, 13 May 2009 20:34:07 UTC

    Windows has had overlapped IO for sockets since Windows 95/NT.  Problem solved.
  • jklowden | Wed, 13 May 2009 19:43:17 UTC

    The sockets API doesn't even mention Ethernet or NICs.    It doesn't associate the socket with the underlying hardware.  Ergo, no redesign is required for multihoming.  The kernel can provide a single multihomed "virtual" address, and no user-space program need be the wiser.  
  • dm | Wed, 13 May 2009 19:02:07 UTC

    The original Berkeley sockets API actually did use an open() call.  One opened one of a variety of network "devices", I think.  This was replaced with the existing one.
    select() is not there to see if data is available, it is there to see which of the many files have data available, so the program can choose (or "select") which to serve.  If one wishes, one can have one's program just do a read(), if one is willing to wait for data to become available when it isn't.  It is also possible to do a non-blocking read, of course (as pointed out by the person referring to asynchronous I/O).
    The original sockets API also has sendmsg() and recvmsg(), which are little used with the SOCKSTREAM (TCP) protocol, but are essential when using datagram protocols like UDP.  TCP, by its nature, has two endpoints, so a TCP socket binds and connects.  UDP may have a single endpoint talking through one socket to multiple peers --- it still binds, but does not use connect, passing the address information for each message in the arguments to the sendmsg() call, and reading the identity of the peer through the arguments of the recvmsg() call.  sendmsg() and recvmsg() probably work for multi-homing, as well.
    It would be nice to be able to obtain a piece of memory to serve as an IO buffer, eliminating the copy.  
  • spinLock | Wed, 13 May 2009 19:01:16 UTC

    OpenVMS has native asynchronous I/O completion - and has had it for decades.  At least part of this problem could be solved by looking at this excellent implementation.
  • Scott | Wed, 13 May 2009 18:27:01 UTC

    I would have liked the author to have proposed a new API.   Just looking at the API is a bit short-sighted;  the protocols themselves are impediments to both latency and bandwidth, including the overhead of handling the TCP and IP protocols in the kernel.   Van Jacobson 's network channels could lead to a user level API that would address several of the author's concerns.
  • David Lee Lambert | Wed, 13 May 2009 18:13:33 UTC

    What about aio?
    ("Asynchronous I/O") - it doesn't solve the multiple-copies problem,  but it delivers available I/O to a signal handler instead of requiring the application to issue a select() or read() call.
  • Marsh Ray | Wed, 13 May 2009 17:00:10 UTC

    > Vivek | Wed, 13 May 2009 15:53:23 UTC
    A memory benchmark giving 1GB/s is probably copying large blocks so the initial latency in the copy becomes insignificant.
    Block copy operations tend to flush out the CPU data caches and cause cache misses, often in unrelated code. A single cache miss can result in a penalty of 400-or-so clock cycles.
    Network packets tend to be sizes like 1500 bytes, too big to ignore, but not big enough to amortize a cache miss penalty over.
    In any case, a system that copies a block of data is, in general, not going to beat a system that just copies a pointer.
  • Mark | Wed, 13 May 2009 16:46:02 UTC

    Vivek, you can't simply compare network throughput vs memory throughput and claim that memory throughput makes redundant memory copies irrelevant.  There are a ridiculous number of things that factor into memory throughput, for example: CPU cache lines, CPU cache size, bus contention with other devices, CPU translation tables, OS paging translations, OS virtual memory paging... just to name a few.  Memory copies are so fast primarily because of CPU/OS cache tricks.  Redundant copies fill the cache and slow down the overall memory throughput.  It may not matter for low capacity systems, but in the high capacity servers I have written, every cache-miss is critical, every user mode to kernel mode transition is critical, every memory allocation is critical, and every duplication of network data is critical.  Don't believe me?  Write a server capable of handling 100,000 simultaneous connections each pushing 4k of data every second (and that's the low end, and yes it's possible--you just need multiple NICs).  If you have redundant memory copies, you'll run out of memory long before you hit the potential network throughput limits.
  • Joshua | Wed, 13 May 2009 16:12:55 UTC

    You are going to need a sctp_select for the ability to monitor many sockets from one thread anyway.
  • Joe Anonymous | Wed, 13 May 2009 16:00:30 UTC

    Article fails to reference the following:
    Bandwidth:  "select(), read(), process(), select().... If data were available to the caller when it invoked select(), then all of the work that went into crossing the user/kernel boundary would be wasted."
    If there are multiple sockets to poll from select(), and there is time wasted marshaling the arguments for select() and its system call overhead, doesn't this simply imply you should just decrease #sockets/thread?
    If #sockets/thread reaches 1 and you still find that there is time wasted marshaling the arguments for select() and its system call overhead, then this just means you shouldn't be using select and should just call read() directly.
    Low latency: "The problem for low-latency applications is that kevents() do not deliver data; they deliver only a signal that data is ready, just as the select() call did. The next logical step would be to have an event-based API that also delivers data."
    If the goal is to have data available when an event notification takes place without requiring another system call to fetch it, then why isn't the aio(4) family of functions sufficient?
    Multihoming:  Why is nic bonding not a sufficient solution?  I can bind my cell, wifi, and bluetooth interfaces into a single logical interface and thus avoid rewriting socket() api based applications.
  • Vivek | Wed, 13 May 2009 15:53:23 UTC

    The points are quite valid, however, talking about memory copies being expensive is a little behind the times....
    Modern systems have memory copy benchmarks ranging > 1 GigaByte / second, still growing with Moores law...
    Network data rarely approaches even 1 Megabyte/second 
    So thats about 0.1% of resource usage..., even a tenfold increase in bandwidth will take it only upto a percent.
    So thats really not a significant point to raise!
Leave this field empty

Post a Comment:

(Required - 4,000 character limit - HTML syntax is not allowed and will be removed)