One of the most pervasive and longest-lasting interfaces in software is the sockets API. Developed by the Computer Systems Research Group at the University of California at Berkeley, the sockets API was first released as part of the 4.1c BSD operating system in 1982. While there are longer-lived APIs—for example, those dealing with Unix file I/O—it is quite impressive for an API to have remained in use and largely unchanged for 27 years. The only major update to the sockets API has been the extension of ancillary routines to accommodate the larger addresses used by IPv6.2
The Internet and the networking world in general have changed in very significant ways since the sockets API was first developed, but in many ways the API has had the effect of narrowing the way in which developers think about and write networked applications. This article begins by briefly examining some of the conditions present when the sockets API was developed and considers how those conditions shaped the way in which networking code was written. The remainder of the article looks at ways in which developers have tried to get around some of the inherent limitations in the API and talks about the future of sockets in a changing networked world.
The two biggest differences between the networks of 1982 and 2009 are topology and speed. For the most part it is the increase in speed rather than the changes in topology that people notice. The maximum bandwidth of a commercially available long-haul network link in 1982 was 1.5 Mbps. The Ethernet LAN, which was being deployed at the same time, had a speed of 10 Mbps. A home user—and there were very few of these—was lucky to have a 300-bps connection over a phone line to any computing facility. The round-trip time between two machines on a local area network was measured in tens of milliseconds, and between systems over the Internet in hundreds of milliseconds, depending of course on location and the number of hops a packet would be subjected to when being routed between machines. (For a graphic of how the early Internet looked, visit personalpages.manchester.ac.uk/.../m.dodge/.../arpanet4.gif.)
The topology of networks at the time was relatively simple. Most computers had a single connection to a local area network; the LAN was connected to a primitive router that might have a few connections to other LANs and a single connection to the Internet. For one application to another application, the connection was either across a LAN or transiting one or more routers, called IMPs (Internet message passing).
The model of distributed programming that came to be most popularized by the sockets API was the client/server model, in which there is a server and a set of clients. The clients send messages to the server to ask it to do work on their behalf, wait for the server to do the work requested, and at some later point receive an answer. This model of computing is now so ubiquitous it is often the only model with which many software engineers are familiar. At the time it was designed, however, it was seen as a way of extending the Unix file I/O model over a computer network. One other factor that focused the sockets API down to the client/server model was that the most popular protocol it supported was TCP, which has an inherently 1:1 communication model.
The sockets API made the client/server model easy to implement because of the small number of extra system calls that programmers would need to add to their non-networked code so it could take advantage of other computing resources. While other models are possible, with the sockets API the client/server model is the one that has come to dominate networked computing.
Although the sockets API has more entry points than those shown here, it is these five that are central to the API and that differentiate it from regular file I/O:
|socket()||Create a communication endpoint|
|bind()||Bind the endpoint to some set of network layer parameters|
|listen()||Set a limit on the number of outstanding work requests|
|accept()||Accept one or more work requests from a client|
|connect()||Contact a server to submit a work request|
In reality the socket() call could have been dropped and replaced with a variant of open(), but this was not done at the time. The socket() and open() calls actually return the same thing to a program: a process-unique file descriptor that is used in all subsequent operations with the API. It is the simplicity of the API that has led to its ubiquity, but that ubiquity has held back the development of alternative or enhanced APIs that could help programmers develop other types of distributed programs.
Client/server computing had many advantages at the time that it was developed. It allowed many users to share resources, such as large storage arrays and expensive printing facilities, while keeping these facilities within the control of the same departments that had once run mainframe computing facilities. With this model of sharing, it was possible to increase the utilization of what at the time were expensive resources.
Three disparate areas of networking are not well served by the sockets API: low-latency or realtime applications; high-bandwidth applications; and multihomed systems—that is, those with multiple network interfaces. Many people confuse increasing network bandwidth with higher performance, but increasing bandwidth does not necessarily reduce latency. The challenge for the sockets API is giving the application faster access to network data.
The way in which any program using the sockets API sends and receives data is via calls to the operating system. All of these calls have one thing in common: the calling program must repeatedly ask for data to be delivered. In the world of client/server computing these constant requests make perfect sense, because the server cannot do anything without a request from the client. It makes little sense for a print server to call a client unless the client has something it wishes to print. What, however, if the service being provided is music or video distribution? In a media distribution service there may be one or more sources of data and many listeners. For as long as the user is listening to or viewing the media, the most likely case is that the application will want whatever data has arrived. Specifically requesting new data is a waste of time and resources for the application. The sockets API does not provide the programmer a way in which to say, "Whenever there is data for me, call me to process it directly."
Sockets programs are instead written from the viewpoint of a dearth of, rather than a wealth of, data. Network programs are so used to waiting on data that they use a separate system call, select(), so that they can listen to multiple sources of data without blocking on a single request. The typical processing loop of a sockets-based program isn't simply read(), process(), read(), but instead select(), read(), process(), select(). Although the addition of a single system call to a loop would not seem to add much of a burden, this is not the case. Each system call requires arguments to be marshaled and copied into the kernel, as well as causing the system to block the calling process and schedule another. If data were available to the caller when it invoked select(), then all of the work that went into crossing the user/kernel boundary would be wasted because a read() would have returned data immediately. The constant check/read/check is wasteful unless the time between successive requests is quite long.
Solving this problem requires inverting the communication model between an application and the operating system. Various attempts to provide an API that allows the kernel to call directly into a program have been proposed but none has gained wide acceptance—for a few reasons. The operating systems that existed at the time the sockets API was developed were, except in very esoteric circumstances, single threaded and executed on single-processor computers. If the kernel had been fitted with an up-call API, there would have been the problem of which context the call could have executed in. Having all other work on a system pause because the kernel was executing an up-call into an application would have been unacceptable, particularly in timesharing systems with tens to hundreds of users. The only place in which such a software architecture did gain currency was in embedded systems and networked routers where there were no users and no virtual memory.
The issue of virtual memory compounds the problems of implementing a kernel up-call mechanism. The memory allocated to a user process is virtual memory, but the memory used by devices such as network interfaces is physical. Having the kernel map physical memory from a device into a user-space program breaks one of the fundamental protections provided by a virtual memory system.
A couple of different mechanisms have been proposed and sometimes implemented on various operating systems to overcome the performance issues present in the sockets API. One such mechanism is zero-copy sockets. Anyone who has worked on a network stack knows that copying data kills the performance of networking protocols. Therefore, to improve the speed of networked applications that are more interested in high bandwidth than in low latency, the operating system is modified to remove as many data copies as possible.
Traditionally, an operating system performs two copies for each packet received by the system. The first copy is performed by the network driver from the network device's memory into the kernel's memory, and the second is performed by the sockets layer in the kernel when the data is read by the user program. Each of these copy operations is expensive because it must occur for each message that the system receives. Similarly, when the program wants to send a message, data must be copied from the user's program into the kernel for each message sent; then that data will be copied into the buffers used by the device to transmit it on the network.
Most operating-system designers and developers know that data copying is anathema to system performance and work to minimize such copies within the kernel. The easiest way for the kernel to avoid a data copy is to have device drivers copy data directly into and out of kernel memory. On modern network devices this is a result of how they structure their memory. The driver and kernel share two rings of packet descriptors—one for transmit and one for receive—where each descriptor has a single pointer to memory. The network device driver initially fills these rings with memory from the kernel. When data is received, the device sets a flag in the correct receive descriptor and tells the kernel, usually via an interrupt, that there is data waiting for it. The kernel then removes the filled buffer from the receive descriptor ring and replaces it with a fresh buffer for the device to fill. The packet, in the form of the buffer, then moves through the network stack until it reaches the socket layer, where it is copied out of the kernel when the user's program calls read(). Data sent by the program is handled in a similar way by the kernel, in that kernel buffers are eventually added to the transmit descriptor ring and a flag is then set to tell the device that it can place the data in the buffer on the network.
All of this work in the kernel leaves the last copy problem unsolved; several attempts have been made to extend the sockets API to remove this copy operation.3,1 The problem remains as to how memory can safely be shared across the user/kernel boundary. The kernel cannot give its memory to the user program, because at that point it loses control over the memory. A user program that crashes may leave the kernel without a significant chunk of usable memory, leading to system performance degradation. There are also security issues inherent in sharing memory buffers across the kernel/user boundary. At this point there is no single answer to how a user program might achieve higher bandwidth using the sockets API.
For programmers who are more concerned with latency than with bandwidth, even less has been done. The only significant improvement for programs that are waiting for a network event has been the addition of a set of kernel events that a program can wait on. Kernel events, or kevents(), are an extension of the select() mechanism to encompass any possible event that the kernel might be able to tell the program about. Before the advent of kevents(), a user program could call select() on any file descriptor, which would let the program know when any of a set of file descriptors was readable, writable, or had an error. When programs were written to sit in a loop and wait on a set of file descriptors—for example, reading from the network and writing to disk—the select() call was sufficient, but once a program wanted to check for other events, such as timers and signals, select() no longer served. The problem for low-latency applications is that kevents() do not deliver data; they deliver only a signal that data is ready, just as the select() call did. The next logical step would be to have an event-based API that also delivers data. There is no good reason to have the application cross the user/kernel boundary twice simply to get the data that the kernel knows the application wants.
The sockets API not only presents performance problems to the application writer, but also narrows the type of communication that can take place. The client/server paradigm is inherently a 1:1 type of communication. Although a server may handle requests from a diverse group of clients, each client has only one connection to a single server for a request or set of requests. In a world in which each computer had only one network interface, that paradigm made perfect sense. A connection between a client and server is identified by a quad of <Source IP, Source Port, Destination IP, Destination Port>. Since services generally have a well-known destination port (e.g., 80 for HTTP), the only value that can easily vary is the source port, since the IP addresses are fixed.
In the Internet of 1982 each machine that was not a router had only a single network interface, meaning that to identify a service, such as a remote printer, the client computer needed a single destination address and port and had, itself, only a single source address and port to work with. The idea that a computer might have multiple ways of reaching a service was too complicated and far too expensive to implement. Given these constraints, there was no reason for the sockets API to expose to the programmer the ability to write a multihomed program—one that could manage which interfaces or connections mattered to it. Such features, when they were implemented, were a part of the routing software within the operating system. The only way programs could, eventually, get access to them was through an obscure set of nonstandard kernel APIs called a routing socket.
On a system with multiple network interfaces it is not possible, using the standard sockets API, to write an application that can easily be multihomed—that is, take advantage of both interfaces so that if one were to fail, or if the primary route over which the packets were flowing were to break, the application would not lose its connection to the server.
The recently developed SCTP (Stream Control Transport Protocol)4 incorporates support for multihoming at the protocol level, but it is impossible to export this support through the sockets API. Several ad-hoc system calls were initially provided and are the only way to access this functionality. At the moment this is the only protocol that has both the capacity and user demand for this feature, so the API has not been standardized across more than a few operating systems. The table here lists the APIs that SCTP added.
|sctp_bindx()||Bind or unbind an SCTP socket to a list of addresses|
|sctp_connectx()||Connect an SCTP socket with multiple destination addresses|
|sctp_generic_recvmsg()||Receive data from a peer|
|sctp_generic_sendmsg(), sctp_generic_sendmsg_iov()||Send data to a peer|
|sctp_getaddrlen()||Return the address length of an address family|
|sctp_getassocid()||Return an association ID for a specified socket address|
|sctp_getpaddrs(), sctp_getladdrs()||Return a list of addresses to the caller|
|sctp_peeloff()||Detach an association from a one-to-many socket to a separate file descriptor|
|sctp_sendx()||Send a message from an SCTP socket|
|sctp_sendmsgx()||Send a message from an SCTP socke|
While this list of functions contains more APIs than are strictly necessary, it is important to note that many are derivatives of preexisting APIs, such as send(), which need to be extended to work in a multihoming world. The set of APIs needs to be harmonized to make multihoming a first-class citizen in the sockets world. The problem now is that sockets are so successful and ubiquitous that it is very hard to change the existing API set for fear of confusing its users or the preexisting programs that use it.
As systems come to have more network interfaces built in, the ability to write applications that take advantage of multihoming will be an absolute necessity. One can easily imagine the use of such technology in a smartphone, which already has three network interfaces: its primary connection via the cellular network, a WiFi interface, and often a Bluetooth interface as well. There is no reason for an application to lose connectivity if even one of these network interfaces is working properly. The problem for application designers is that they want their code to work, with few or no changes, across a plethora of devices, from cellphones, to laptops, to desktops, etc. With properly defined APIs we would remove the artificial barrier that prevents this. It is only because of the history of the sockets API and the fact that it has been "good enough" to date that this need has not yet been addressed.
High bandwidth, low latency, and multihoming are driving the development of alternatives to the sockets API. With LANs now reaching 10 Gbps, for many applications client/server-style communication is far too inefficient to use the available bandwidth. The communication paradigms supported by the sockets API must be expanded to allow for memory sharing across the kernel boundary, as well as for lower-latency mechanisms to deliver data to applications. Multihoming must become a first-class feature of the sockets API because devices with multiple active interfaces are becoming the norm for networked systems.
LOVE IT, HATE IT? LET US KNOW
GEORGE V. NEVILLE-NEIL (email@example.com) is a columnist for ACM Queue and Communications of the ACM, as well as a member of the Queue Editorial Board. He works on networking and operating-system code and teaches courses on various subjects related to programming.
© 2009 ACM 1542-7730 /09/0200 $5.00
Originally published in Queue vol. 7, no. 4—
see this item in the ACM Digital Library
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Van Jacobson - BBR: Congestion-Based Congestion Control
Measuring bottleneck bandwidth and round-trip propagation time
Josh Bailey, Stephen Stuart - Faucet: Deploying SDN in the Enterprise
Using OpenFlow and DevOps for rapid development
Amin Vahdat, David Clark, Jennifer Rexford - A Purpose-built Global Network: Google's Move to SDN
A discussion with Amin Vahdat, David Clark, and Jennifer Rexford
Harlan Stenn - Securing the Network Time Protocol
Crackers discover how to use NTP as a weapon for abuse.
Displaying 10 most recent comments. Read the full list hereI'm dismayed that this article didn't mention Ulrich Drepper's "The Need for Asynchronous, Zero-Copy Network I/O" (http://people.redhat.com/drepper/newni-slides.pdf), especially given a recent history of API's going where Drepper takes them (see NGPT vs NPTL (although this was admittedly implementation rather than interface)).
Concerning reliable multicast there indeed will never be any silver bullet, just ad-hoc solutions. The problem is fundamentally too hard.
Like so many others, he doesn't seem to have a clue where general purpose computing REALLY started - the IBM 360 series of mainframes, a full 20 years before "sockets". There are APIs from then that are still in common use today.
Heck - there are even PC-based interfaces older than that.
THis guy is probably one of those that thinks Intel invented "virtual storage" when they introduced the 80386. Of course, fact is, Intel joined that game 20 years after the big boys.
I think it's important to consider the "weight" on the infrastructure which is already pretty tanked due to spam and other useless data transmissions... like "STREAMING" constantly.
Protocol variations - regardless of the proposed "function calls" and definations of what these would do, if modified to "help streaming" would likely be utilized as a throttling mechanism. In turn, people would make video game servers and web pages that piggy backed the "Streaming Mechanism" as it would likely be given "more speed" for the simple fact it was the streaming "pipe".
The fact is, when solid API's are maintained as with UNIX file system and Sockets, software matures.. and you get a solid foundation. When the foundation is rock solid, you can TRULY stand on the shoulders of the developers before you and innovate!
Rewriting the wheel all the time and calling it new and awesome gets us no where and creates more IT chaos/incompatibilities/complexity.
I wish I could write a program that wouldn't be outdated as soon as I wrote it because IT is more about trend than substance these days.
Sockets, Unix Fileio... 27 years plus? Grep? WOW! That's decent software.. that runs on any OS practically... That's how it should be done :)
Receiving is a good bit harder, however. The problem is that to really do zero copy, you want to use userland's mmap'd buffer as a RX buffer in the NIC's RX ring. However, you can't predict ahead of time which entries in the RX rings will be receiving packets for a given socket. You could perhaps make NICs smarter such that they have separate RX rings for specific sockets, but then you can run into issues with either 1) consuming more wired physical RAM for per-socket RX buffers or 2) dropping incoming packets because even though you have the same amount of RX buffer space, not all of it is able to receive every packet as they do currently. The other approach for the RX case would be to map the RX buffers into userspace after the data was received. However, to avoid security concerns, multiple RX buffers could never share a physical page (since VM operates on physical pages) (*BSD systems typically use 2048 byte buffers for RX which means on x86 you have 2 buffers sharing a physical page). You would also need a way for userland to "release" the buffer back to the OS. A problem with this approach is that the VM operations to map/unmap buffers can be slow, especially if you are dealing with small buffers. If it wasn't, we would all be using IO-Lite by now.
The one thing I will say about multicast UDP is that while it is indeed useful, many folks actually want multicast data that has TCP-like properties in terms of reliability and in-order delivery. Thus, you have N different flavors of solutions to deal with this (and none of them standardized that I am aware of). In some cases folks fall back to retransmitting lost packets via sideband channels. In the case of a one-way or very high latency link, the sender may simply choose to send all the data multiple times (using something like forward error correction (FEC)) and hope that the receiver gets at least one copy of each datum such that it can reassemble the stream. That said, if one came up with a silver bullet for reliable multicast (I'm not holding my breath), I don't think it would require a change to the existing socket API to support it.
Regarding aio(4), the API is not great, and the implementations I am aware of aren't a performance gain. I know of two classes of implementations: 1) implement AIO in userland by spawning threads that basically do select() / read/write() loops (or even synchronous I/O instead of select()), or 2) implement AIO in the kernel by spawning kernel threads that do synchronous I/O in the kernel. Both cases aren't really any different performance-wise than simply using select/poll/kevent/epoll with non-blocking sockets in userland. One possible benefit is that perhaps the programmer can write less code, but that may be debatable as AIO has its own complexity.
The socket API has had multi-homing for decades if one uses one socket per interface -- you select the interface by specifying an IP address when you bind. I guess if one could establish a single connection to/from multiple IP addresses, the driver could multiplex the data over all the interfaces (as opposed to having the application do it). That would be somewhat helpful. Of course there is no well established standardized way to get a list of interfaces -- that would seem more like the real obstacle to multi-homing.
Note that bonding is not multi-homing because it only uses a single IP address and the second link is a hot standby (the definition of multi-homing is that there is more than one IP address on a computer). There is also something called link aggregation that will allow one to have multiple NICs for a single IP and use the NICs simultaneously for increased traffic. This is also not multi-homing.
Event loops (i.e., using select) can be sped up by putting the sockets in NDELAY mode and issuing the I/O calls until one gets EAGAIN (this will reduce the number of calls to select).
Memory copying is not a big issue with current CPUs: Cache misses and locking are the big issues. I've been working for a while with a TCP stack that copies twice on output and twice on input and the input copies are likely to be on different CPUs. All that copying is less expensive than the once per interrupt read of the hardware register that says what caused on the interrupt! The other big hits are masking interrupts and locking. The data copies are a distant forth in terms of cost.
What's really missing from the API is a way to read and write multiple packets in a single system call. Doesn't matter with TCP, but it does for any non-stream protocol.
The socket API is not inherently 1:1 connection-oriented; it clearly has supported UDP and broadcast since its inception, and multicast since shortly thereafter. It's clear that multicast is the right answer for most of these net-wide streaming applications, but apparently there aren't many programmers who understand multicast; it seems George isn't even aware of it...
As others have already pointed out, VMS has an excellent async I/O API and it has had it for over 30 years. So one doesn't need to look too hard to find alternatives that work well. Granted, async I/O in POSIX is pathetic... It would be pretty trivial to extend a POSIX kernel to allow a user to create mmap'd buffers for use with sockets. The whole issue of "the kernel losing its mapped memory if the user process goes away" is a total red herring - you don't give a region of kernel memory to the user process, you give a region of mapped user memory to the kernel. I proposed an API based on this notion a couple years ago, unfortunately I haven't had the time to implement it yet. At any rate, select() is not part of the sockets API, so identifying select() as a weakness of the sockets API is pretty ridiculous.
re: multihoming - that's obviously a protocol issue more than an API issue; again, when the majority of your applications are built on TCP which requires 1:1 endpoints, you don't have any other choice. And as others have pointed out, when you want to take advantage of multihoming you can just use bond interfaces and forget about it.
Sure, the landscape has evolved and there may be areas in which some APIs could be improved. But saying the days of sockets are over is pretty far-fetched, and blaming the API for narrowing programmers' mindsets is over the top.
There are far more significant/important bottlenecks in software performance today than at the socket layer. Given the prevalence of massively bloated ultra-high-level-language applications out there, even a perfect zero-copy network stack implementation will yield zero measurable benefit to the end user.
But wasn't there an industry working group that was trying to create an async sockets definition some years ago? I think it was called the Open group's Extended Socket API, or something like that. I would have liked for this paper to have addressed that work, to explain if (or why not) it addressed the concerns here.
Also I seem to recall other things which were industry attempts to address some of these issues. (Event Ports may fit into this also--not sure.) Again, it would have been nice for the paper to more comprehensively address those attempts.
Sorry I'm vague in this comment, I'm a program manager working on storage now, and it's been years since I worked in the HP-UX kernel on async I/O interfaces.
Displaying 10 most recent comments. Read the full list here