May/June 2018 issue of acmqueue The May/June issue of acmqueue is out now



Networks

  Download PDF version of this article PDF

ITEM not available

acmqueue

Originally published in Queue vol. 7, no. 4
see this item in the ACM Digital Library


Tweet



Related:

Yonatan Sompolinsky, Aviv Zohar - Bitcoin's Underlying Incentives
The unseen economic forces that govern the Bitcoin protocol


Antony Alappatt - Network Applications Are Interactive
The network era requires new models, with interactions instead of algorithms.


Jacob Loveless - Cache Me If You Can
Building a decentralized web-delivery model


Theo Schlossnagle - Time, but Faster
A computing adventure about time through the looking glass



Comments

(newest first)

Displaying 10 most recent comments. Read the full list here

Nick Black | Sat, 27 Jun 2009 15:33:34 UTC

I'm dismayed that this article didn't mention Ulrich Drepper's "The Need for Asynchronous, Zero-Copy Network I/O" (http://people.redhat.com/drepper/newni-slides.pdf), especially given a recent history of API's going where Drepper takes them (see NGPT vs NPTL (although this was admittedly implementation rather than interface)).


William Ward | Tue, 26 May 2009 21:12:37 UTC

I do not believe the multihoming argument was clear enough for some of the commenters. Multihoming != NIC bonding. Your mobile phone has interfaces on multiple disjoint networks. Having the option of advertising your mobile carrier based IP as well as a traditional LAN IP would be advantageous for both roaming as well as redundant accessibility.


mh | Fri, 22 May 2009 09:09:08 UTC

jhb is dead right on. What kills high throughput is the difficulty of receiving. If you look at every high performance alternative to the socket API, you will see they ALL have some kind of "preparation" step where the sender warns the receiver about the (big) data it is going to send, so the receiver stack can make arrangements in advance. This is the main issue preventing the socket API and (its underlying stack) to achieve high throughput; the socket API provides nothing more than a bare stream with absolutely zero advance notice and zero vision past the next byte, so the receiving side cannot come prepared.

Concerning reliable multicast there indeed will never be any silver bullet, just ad-hoc solutions. The problem is fundamentally too hard.


SimpleSimon | Wed, 20 May 2009 07:39:24 UTC

Author says: "it is quite impressive for an API to have remained in use and largely unchanged for 27 years".

Like so many others, he doesn't seem to have a clue where general purpose computing REALLY started - the IBM 360 series of mainframes, a full 20 years before "sockets". There are APIs from then that are still in common use today.

Heck - there are even PC-based interfaces older than that.

THis guy is probably one of those that thinks Intel invented "virtual storage" when they introduced the 80386. Of course, fact is, Intel joined that game 20 years after the big boys.


Jason P Sage | Wed, 20 May 2009 06:25:49 UTC

Well, I think IP4 and IP6 are fine. Sometimes the tendancy for everyone to keep "fixing what isn't broken" in the name of new technology just doesn't gain anything... and instead antiquates software that was fine, well thought out, and the bugs squashed.. just to start fresh... with new bugs.

I think it's important to consider the "weight" on the infrastructure which is already pretty tanked due to spam and other useless data transmissions... like "STREAMING" constantly.

Protocol variations - regardless of the proposed "function calls" and definations of what these would do, if modified to "help streaming" would likely be utilized as a throttling mechanism. In turn, people would make video game servers and web pages that piggy backed the "Streaming Mechanism" as it would likely be given "more speed" for the simple fact it was the streaming "pipe".

The fact is, when solid API's are maintained as with UNIX file system and Sockets, software matures.. and you get a solid foundation. When the foundation is rock solid, you can TRULY stand on the shoulders of the developers before you and innovate!

Rewriting the wheel all the time and calling it new and awesome gets us no where and creates more IT chaos/incompatibilities/complexity.

I wish I could write a program that wouldn't be outdated as soon as I wrote it because IT is more about trend than substance these days.

Sockets, Unix Fileio... 27 years plus? Grep? WOW! That's decent software.. that runs on any OS practically... That's how it should be done :)


jhb | Thu, 14 May 2009 13:43:38 UTC

The comments about using mmap'd buffers for socket buffers are close, but there are some problems. Sending data is the simplest. You can sort of do this now with sendfile(), but the real gain would be doing it when generated data rather than data in a pre-existing file. You can still do that by creating scratch files and doing your data generation into an mmap'd buffer of that file. However, then you run into the problem that the kernel does not provide notification when a sendfile() request is completely done with the data in the file. You could perhaps work around this by creating a new scratch file (which you unlink() after creating it and munmap() after calling sendfile() and rely on the various reference counts to free it when sendfile() is done), but that is a good bit of overhead. With notification of completed sendfile()'s, you could perhaps amortize that overhead across many write()'s by creating a single large, sparse scratch file for send buffers.

Receiving is a good bit harder, however. The problem is that to really do zero copy, you want to use userland's mmap'd buffer as a RX buffer in the NIC's RX ring. However, you can't predict ahead of time which entries in the RX rings will be receiving packets for a given socket. You could perhaps make NICs smarter such that they have separate RX rings for specific sockets, but then you can run into issues with either 1) consuming more wired physical RAM for per-socket RX buffers or 2) dropping incoming packets because even though you have the same amount of RX buffer space, not all of it is able to receive every packet as they do currently. The other approach for the RX case would be to map the RX buffers into userspace after the data was received. However, to avoid security concerns, multiple RX buffers could never share a physical page (since VM operates on physical pages) (*BSD systems typically use 2048 byte buffers for RX which means on x86 you have 2 buffers sharing a physical page). You would also need a way for userland to "release" the buffer back to the OS. A problem with this approach is that the VM operations to map/unmap buffers can be slow, especially if you are dealing with small buffers. If it wasn't, we would all be using IO-Lite by now.

The one thing I will say about multicast UDP is that while it is indeed useful, many folks actually want multicast data that has TCP-like properties in terms of reliability and in-order delivery. Thus, you have N different flavors of solutions to deal with this (and none of them standardized that I am aware of). In some cases folks fall back to retransmitting lost packets via sideband channels. In the case of a one-way or very high latency link, the sender may simply choose to send all the data multiple times (using something like forward error correction (FEC)) and hope that the receiver gets at least one copy of each datum such that it can reassemble the stream. That said, if one came up with a silver bullet for reliable multicast (I'm not holding my breath), I don't think it would require a change to the existing socket API to support it.

Regarding aio(4), the API is not great, and the implementations I am aware of aren't a performance gain. I know of two classes of implementations: 1) implement AIO in userland by spawning threads that basically do select() / read/write() loops (or even synchronous I/O instead of select()), or 2) implement AIO in the kernel by spawning kernel threads that do synchronous I/O in the kernel. Both cases aren't really any different performance-wise than simply using select/poll/kevent/epoll with non-blocking sockets in userland. One possible benefit is that perhaps the programmer can write less code, but that may be debatable as AIO has its own complexity.


Herbie Robinson | Thu, 14 May 2009 09:59:48 UTC

Whether one uses a wait/read call or gets a callback when data comes in is a red herring. In order to provide proper protection, the OS is going to have to establish a protected address space, do CPU accounting, cross the kernel boundary and do some sort of scheduling whether its a read or a callback. The only real difference between the models is which end of the stack you start executing at.

The socket API has had multi-homing for decades if one uses one socket per interface -- you select the interface by specifying an IP address when you bind. I guess if one could establish a single connection to/from multiple IP addresses, the driver could multiplex the data over all the interfaces (as opposed to having the application do it). That would be somewhat helpful. Of course there is no well established standardized way to get a list of interfaces -- that would seem more like the real obstacle to multi-homing.

Note that bonding is not multi-homing because it only uses a single IP address and the second link is a hot standby (the definition of multi-homing is that there is more than one IP address on a computer). There is also something called link aggregation that will allow one to have multiple NICs for a single IP and use the NICs simultaneously for increased traffic. This is also not multi-homing.

Event loops (i.e., using select) can be sped up by putting the sockets in NDELAY mode and issuing the I/O calls until one gets EAGAIN (this will reduce the number of calls to select).

Memory copying is not a big issue with current CPUs: Cache misses and locking are the big issues. I've been working for a while with a TCP stack that copies twice on output and twice on input and the input copies are likely to be on different CPUs. All that copying is less expensive than the once per interrupt read of the hardware register that says what caused on the interrupt! The other big hits are masking interrupts and locking. The data copies are a distant forth in terms of cost.

What's really missing from the API is a way to read and write multiple packets in a single system call. Doesn't matter with TCP, but it does for any non-stream protocol.


Howard Chu | Thu, 14 May 2009 00:47:32 UTC

The author comes really close to making some good points, but misses the mark each time.

The socket API is not inherently 1:1 connection-oriented; it clearly has supported UDP and broadcast since its inception, and multicast since shortly thereafter. It's clear that multicast is the right answer for most of these net-wide streaming applications, but apparently there aren't many programmers who understand multicast; it seems George isn't even aware of it...

As others have already pointed out, VMS has an excellent async I/O API and it has had it for over 30 years. So one doesn't need to look too hard to find alternatives that work well. Granted, async I/O in POSIX is pathetic... It would be pretty trivial to extend a POSIX kernel to allow a user to create mmap'd buffers for use with sockets. The whole issue of "the kernel losing its mapped memory if the user process goes away" is a total red herring - you don't give a region of kernel memory to the user process, you give a region of mapped user memory to the kernel. I proposed an API based on this notion a couple years ago, unfortunately I haven't had the time to implement it yet. At any rate, select() is not part of the sockets API, so identifying select() as a weakness of the sockets API is pretty ridiculous.

http://www.openldap.org/lists/openldap-devel/200411/msg00088.html

re: multihoming - that's obviously a protocol issue more than an API issue; again, when the majority of your applications are built on TCP which requires 1:1 endpoints, you don't have any other choice. And as others have pointed out, when you want to take advantage of multihoming you can just use bond interfaces and forget about it.

Sure, the landscape has evolved and there may be areas in which some APIs could be improved. But saying the days of sockets are over is pretty far-fetched, and blaming the API for narrowing programmers' mindsets is over the top.

There are far more significant/important bottlenecks in software performance today than at the socket layer. Given the prevalence of massively bloated ultra-high-level-language applications out there, even a perfect zero-copy network stack implementation will yield zero measurable benefit to the end user.


Jaro | Thu, 14 May 2009 00:25:40 UTC

Interesting paper--it makes very valid points and is thought provoking.

But wasn't there an industry working group that was trying to create an async sockets definition some years ago? I think it was called the Open group's Extended Socket API, or something like that. I would have liked for this paper to have addressed that work, to explain if (or why not) it addressed the concerns here.

Also I seem to recall other things which were industry attempts to address some of these issues. (Event Ports may fit into this also--not sure.) Again, it would have been nice for the paper to more comprehensively address those attempts.

Sorry I'm vague in this comment, I'm a program manager working on storage now, and it's been years since I worked in the HP-UX kernel on async I/O interfaces.


Jeffrey | Wed, 13 May 2009 20:34:07 UTC

Windows has had overlapped IO for sockets since Windows 95/NT. Problem solved.


Displaying 10 most recent comments. Read the full list here
Leave this field empty

Post a Comment:







© 2018 ACM, Inc. All Rights Reserved.