November/December issue of acmqueue The November/December issue of acmqueue is out now

acmqueue is free for ACM professional members. Non-members can purchase an annual subscription for $19.99 or a single issue for $6.99.

Download the app from iTunes or Google Play,
or view within your browser.

More information here


  Download PDF version of this article PDF

TCP Offload to the Rescue

Andy Currid, iReady

Getting a toehold on TCP offload engines—and why we need them

In recent years, TCP/IP offload engines, known as TOEs, have attracted a good deal of industry attention and a sizable share of venture capital dollars. A TOE is a specialized network device that implements a significant portion of the TCP/IP protocol in hardware, thereby offloading TCP/IP processing from software running on a general-purpose CPU. This article examines the reasons behind the interest in TOEs and looks at challenges involved in their implementation and deployment.


Both TCP and IP are hardware-independent networking communication protocols defined by the IETF (Internet Engineering Task Force).1 TCP/IP is the predominant protocol suite for information exchange across the Internet; as such, the performance and efficiency of networking applications running over TCP/IP are of great interest. To understand the motivations behind TCP/IP offload, it’s important to realize that TCP/IP operates as one element within a network infrastructure that consists of many additional elements. For example:

• Networking interconnects, such as copper, optical, and wireless links.

• CPU, memory, and network interface hardware within nodes on the network.

• Operating system and application software running on these nodes.

All of these elements can have a significant effect on how applications operate over TCP/IP, independently of TCP/IP itself.


Let’s start by looking at the capacity of the networks over which TCP/IP operates. Local area networks are predominantly implemented using Ethernet, which has seen steady growth from 10 Mbps (megabits per second) in 1990, through 100 Mbps in the mid-1990s, to 1 Gbps (gigabit per second) today. Specifications for 10-Gbps Ethernet were ratified in 2002, but the technology has yet to achieve widespread deployment.

For each speed upgrade, widespread deployment has lagged initial introduction by several years. Gigabit Ethernet was introduced in 1998 but achieved widespread deployment only in the last year or two. Not coincidentally, this period saw large reductions in cost; the typical price for a Gigabit Ethernet NIC (network interface card) has fallen from around $500 at introduction to a few tens of dollars today.

Demand for network bandwidth has also increased. While an office desktop machine doesn’t yet require gigabit bandwidth, even modest-size organizations find that centralized file, Web, and database servers require gigabit connectivity to provide timely response to clients. Midrange and enterprise-class servers typically support multiple gigabit interfaces. The challenge is to operate those interfaces at capacity, while still providing sufficient computing power to run server applications—at a price that isn’t prohibitive.


The processing power of CPUs that run TCP/IP protocol software has also increased over time. Since Intel’s Pentium line of CPUs was launched, typical CPU clock speeds have increased about 5,000 percent from 60 MHz (megahertz) to around 3 GHz (gigahertz).

This increase in processing power appears to match the growth in network capacity, but the demands of applications that are dependent on TCP/IP networking have grown just as fast, if not faster. Moreover, research indicates that as CPU speeds increase, their ability to drive TCP/IP traffic increases at a slower rate.

Figure 1 illustrates this nonlinearity. TCP throughput and host CPU utilization were measured across a number of transfer sizes, using 800-MHz and 2.4-GHz CPUs running in otherwise identical hardware configurations. A relative cost was derived as follows

Relative Cost = (CPU utilization % * CPU speed in MHz) / Throughput in Mbps

with the aim of showing how much compute power is required per megabit of networking throughput. (A larger number indicates a higher cost per megabit). While the 2.4-GHz CPU always matched or outperformed the 800-MHz CPU in absolute throughput, its performance does not scale linearly with the tripling of clock speed. In all cases, the relative CPU cost per megabit of throughput was higher for the faster CPU, in some cases more than double that observed at 800 MHz.

The principal reason for this diminishing return as CPU speeds increase is the divergence between CPU performance and that of memory and I/O subsystems.2 As CPU speeds have increased, so have memory and I/O speeds, but at a much slower rate. As a result, proportionally more CPU cycles are spent stalling for memory accesses to complete; these stall cycles show up directly as CPU utilization. For reasons we’ll explore later in detail, conventional TCP/IP processing is memory-intensive, so it is particularly sensitive to this effect.

The memory performance shortfall can be offset to some extent by using higher-performance memory cache hierarchies to hide memory latency and by using techniques such as hardware multithreading, which allows alternate execution threads to run while memory accesses for a stalled thread complete. Scaling cache memory to meet performance requirements is not particularly cost effective, however, particularly with networking applications that constantly churn data rather than iterating over the same data set.

Hardware multithreading provides a performance increase only when multiple execution threads are ready to work, and this is highly dependent on the nature of the application. For example, hardware multithreading is likely to offer some benefit for networked applications that support large numbers of relatively low-bandwidth clients by using pools of threads (Web servers, database servers). Conversely, it will offer little benefit in applications that support small numbers of high-bandwidth clients by a correspondingly small number of threads (block storage and clustering applications).


The ubiquity, performance, and low cost of Gigabit Ethernet hardware have made it attractive for use in applications that have traditionally used specialized (and correspondingly expensive) hardware other than Ethernet. SANs (storage area networks), which carry block storage traffic between servers and storage systems, have traditionally been implemented using the SCSI protocol over Fibre Channel fabrics. The iSCSI specification defines a way to route the same SCSI protocol over TCP/IP. Similarly, RDMA (remote direct memory access) technology running over TCP/IP promises to deliver cluster-based messaging and I/O performance that is comparable to that currently achieved with proprietary messaging protocols and interconnect hardware.

A common factor among these applications is that the incumbent technologies (Fibre Channel, proprietary clustering hardware) achieve very high performance at extremely low CPU overhead. Lower-performing TCP/IP-based alternatives may be adequate for some SAN and clustering applications, but to provide a direct alternative to the incumbent technologies, TCP/IP-based alternatives will require TCP offload.


The 10x performance increase offered by 10-Gigabit Ethernet makes it attractive for demanding applications such as storage networks and clustered server interconnects. There is currently no widespread adoption of 10-Gigabit Ethernet, in part because of the high prices of 10-gigabit NICs and network infrastructure equipment. Even if 10-gigabit equipment were cheaper, driving TCP traffic at 10-gigabit rates with conventional CPUs and memory technology is still very expensive. The memory performance bottleneck, troublesome at gigabit rates, has the potential to become a showstopper at 10-gigabit rates. TCP/IP offload offers a viable path to solving this problem, though longer-term solutions are likely to involve changes that go considerably beyond simply offloading the TCP/IP protocol suite.


There are clearly applications for which TCP/IP offload makes good sense, and some for which it appears to be mandatory. What are the key challenges to be overcome in order to bring TCP/IP offload technology to market?


Though it’s possible to dismiss implementation cost as a marketing concern (we’re engineers, right?), this would miss one of the critical aspects of developing networking technologies in general, and TCP/IP offload technology in particular. After all, the runaway success of Ethernet technology is mostly a result of its relatively low cost in terms of hardware and ease of deployment. To proliferate, TOEs have to provide solutions that conventional CPUs cannot provide, or can provide only at much higher cost.

So what are the factors that determine the cost of a TOE? The gate count of the design is a function of the design’s overall complexity, and it plays a large part in overall cost because it determines the silicon die size. The die size, together with the number of I/O pins required by the design, determines the size of the physical package the silicon requires. The package size, along with any required external support chips such as memory or Ethernet physical interface components, determines how much circuit board space is required to accommodate the TOE.

Two of these factors—design complexity and requirements for external memory—differentiate TOEs from existing Ethernet devices. To understand these better, let’s take a more detailed look at TCP/IP and its environment.


TCP aims to provide reliable delivery of data over networks that are inherently unreliable. The features TCP uses are moderately complex to implement in software and more so in hardware—especially true when compared with protocols such as Fibre Channel, which were designed to operate over reliable networks, with hardware implementation in mind. An exhaustive treatment of TCP’s features is beyond the scope of this article; what follows is a snapshot intended to illustrate some of the implementation challenges for TCP offload.

Data flow in TCP is managed over connections that provide bidirectional byte stream communication between two peer endpoints. The endpoints maintain synchronization within the byte stream by means of a 32-bit sequence number that is incremented per byte transmitted. Each endpoint positively acknowledges received data by including the sequence number of the last byte successfully received in every packet it transmits to the peer endpoint.

Flow control at the receiver is achieved by means of a sliding receive window. This is a byte count of how much data the endpoint is prepared to accept from the peer, and it is included in every data packet transferred between the peers. The receive window shrinks (“closes”) as data is transmitted to a peer, and expands (“opens”) as the peer consumes that data. If an endpoint consumes data slowly enough, the receive window may shrink to zero, in which case its peer cannot transmit any more data until the peer consumes the data and reopens the window. A transmitter that has data to send but has encountered a zero window periodically tests for the window opening again by means of window probe packets.

In addition to potential errors occurring at the peer endpoint, TCP must cope with packets being dropped by the intervening network as a result of congestion or error conditions. To do this, it incorporates two mechanisms to detect when packets have been dropped and require retransmission: failure to receive an acknowledgment within a certain timeout, or receipt of multiple acknowledgments for the same data. When retransmitting data, TCP uses a back-off algorithm for successive retransmits, eventually giving up after several minutes.

The acknowledgment timeout used to trigger retransmission is based on a dynamically calculated RTT (round-trip time) estimate that TCP maintains for each active connection. In addition to detecting dropped packets, TCP incorporates flow control mechanisms—the slow-start and congestion window mechanisms—which dynamically limit the rate at which the transmitter sends data into the network when it detects that packets are being dropped.

As network data rates have increased, the basic acknowledgment and retransmission mechanisms of TCP have proved to be somewhat inefficient; loss of a single packet can result in retransmission of many packets. To address this, optional mechanisms have been introduced: the timestamp option allows for greater accuracy in estimation of RTT, and the SACK (selective acknowledgment) option provides finer granularity for data retransmission.

The TCP features described take a considerable amount of gate complexity to implement in silicon. A key challenge for TOE design is determining which TCP features to implement in hardware to achieve high performance, and which can remain in software—to reduce design complexity and to ensure security against denials of service. (The security aspect of TOE implementation is a subject in itself, beyond the scope of this article.)

Aside from the logic complexity required to implement the various timers, counters, and algorithms, the TOE must also maintain the associated state information for each offloaded connection. This state information is typically about 256 bytes per connection, so a TOE that aims to support 64,000 simultaneous connections needs to maintain about 16 MB of connection state information.


Or should that be, it’s the stupid software? Software, while certainly not stupid, changes at a relatively slow pace compared with hardware. This may appear counterintuitive, as it’s generally easier to update software than it is to develop new hardware. In reality, software forms a complex hierarchy, ranging from the lowest layers of the OS kernel and hardware device interfaces, through kernel services such as TCP/IP, to user-space applications such as Web and database servers. Each layer in this hierarchy abstracts itself to the ones above by means of an API (application programming interface). The relative stability of these APIs over time, especially at the user-space application layer, has made development of software easier and more cost effective.

Figure 2 shows a simplified view of the networking software hierarchy found in conventional systems. Key points include the sockets API that applications use to access the networking facilities of the operating system, and the network hardware device driver interface that links a specific hardware driver into the OS networking stack.


The sockets API (also known as just sockets) was pioneered in the University of California’s Berkeley Software Distribution of Unix; a derivative of the sockets API is included in all significant operating systems sold today, and the majority of networked applications are written to use this API. Sockets are straightforward to use, hiding many aspects of the underlying network stack from the application, but this ease of use comes at a price.

When receiving data, a sockets-based application need not take any account of the TCP and IP headers that accompany the packet when it arrives from the network. In a conventional stack, this is achieved by staging received data in kernel buffers and copying just the data payload to the application’s buffers. The kernel receive buffers also support another convenient feature of sockets: the application need not supply receive buffers to hold inbound data before that data is received from the network. If data should arrive prior to the application supplying receive buffers, it is buffered in the kernel and copied to the application’s buffers when the application eventually supplies them.

Similarly, when transmitting data, the application need not wait for the data to be acknowledged by the remote peer before discarding its own local copy. When accepting data for transmission from the application, the kernel immediately copies the data into kernel buffers, where it remains available for retransmission until acknowledged by the peer.

These features normally lead to data crossing the host’s memory interface three times as it moves between the application and the network interface hardware. The data copy between application and kernel memory buffers requires a read followed by a write (two memory interface crossings), and the networking hardware’s direct read or write access to kernel memory results in another memory interface crossing.

Buffer copies on the transmit data path can be reduced by using novel memory management techniques, but these don’t work so well on the receive data path. Conventional systems therefore require around 375 Mbps of memory bandwidth just to support data reception on a single gigabit interface. For 10-Gigabit Ethernet, this number jumps to 3.75 Gbps.


TOE devices are generally capable of parsing and removing TCP and IP headers from inbound data before passing that data to the host, which removes one of the reasons for staging received data in kernel memory. If the TOE has sufficient buffer memory of its own, it can buffer inbound data in cases where the application has not yet supplied receive buffers, and transfer the data directly to the application’s buffers when they become available. Similarly, data transmitted by the application may be copied directly to transmit buffers on the TOE, rather than being staged in kernel buffers. Any data retransmissions are then serviced from the TOE’s buffers. (In both of these cases, there’s an assumption that the application’s buffers can be locked in memory to prevent them from being modified while the TOE is accessing them.)

It’s worth noting that this technique doesn’t remove memory copies from the networking data flow—it simply relocates them to the TOE’s memory subsystem. That’s still a win; it’s often cheaper to implement a small, high-performance memory subsystem specifically for TCP/IP traffic, rather than improve the memory performance of the entire computing platform. But what are the performance and size requirements for TOE buffer memory?

Data crosses the TOE’s memory interface once when it moves into the TOE from the network or host, and once when it moves out of the TOE, destined for the network or host. This leads to a 2x bandwidth requirement, as compared with the 3x requirement seen on host memory without TOE. To support the copy avoidance techniques just described, the required TOE memory size is proportional to the number of active sockets to be supported and the amount of data that may need to be buffered for each of those sockets. If we assume a maximum receive window size per socket of 64 KB, then 1 MB of TOE buffer memory is required to support inbound data for every 16 offloaded sockets. Various techniques can be used to lower this memory requirement, such as dropping packets when the TOE runs out of buffer memory, or forwarding those packets to the software TCP/IP stack to be handled conventionally. Again, there’s no free lunch—both of these techniques lead to a reduction in performance under high load conditions, which is precisely where TOE support is most needed.


Though it’s clear that use of TOE buffer memory can improve TCP performance at gigabit rates, the memory size requirements may become prohibitive at 10-gigabit rates. This is due to the nature of TCP flow-control mechanisms. To maintain a steady flow of data, receivers must advertise suitably large receive windows to allow transmitters to continually transmit data while still awaiting acknowledgment of previously sent data. The window size required to achieve maximum throughput is a function of how much data can be buffered in the network itself—which is in turn a function of the product of the network bandwidth and the network’s round-trip time (the network’s bandwidth-delay product). Without going into specifics, it’s obvious that for 10-gigabit links, this product is 10 times that of one-gigabit links. Because of the higher bandwidth-delay product, a TOE design operating on a 10-gigabit link would require 10 times the memory of the same TOE design on a one-gigabit link, if the TOE is to support maximum throughput.

The need for TOE buffer memory can be greatly reduced if we can make sure that the TOE can always place data directly into the application buffers. This capability is commonly referred to as DDP (direct data placement). The benefits of DDP are well understood from its use in existing technologies such as Fibre Channel and SCSI storage hardware; SCSI adapters routinely achieve throughput approaching 3 Gbps, at minimal CPU utilization and with minimal buffer memory on the adapter.

DDP is much harder to achieve with network applications over TCP/IP (data can arrive out of order, for example), because of the nature of the sockets API that applications use. One protocol that does achieve DDP over TCP/IP is iSCSI, which transports the SCSI storage protocol over TCP/IP. iSCSI benefits from the fact that storage applications generally don’t use the sockets API and are required to provide buffers for all data ahead of that being received from the network. The iSCSI protocol uses tags that indicate exactly where received data should be placed; iSCSI also has mechanisms to limit the expense of dealing with out-of-order TCP/IP data.3

That’s great for iSCSI and storage over TCP/IP, but what about other applications? A good deal of industry effort is taking place now to define broadly applicable, standardized mechanisms for achieving DDP over TCP/IP, notably the efforts by the RDMA Consortium4 and within the IETF’s RDDP working group.5 Both of these efforts are defining additional protocol layers to operate above TCP/IP to provide the tags required to indicate where data should be placed, and to assist in dealing with out-of-order TCP data. SDP (Sockets Direct Protocol) is a parallel effort that defines mechanisms to allow sockets applications to take advantage of new DDP features in network transports.

Many TOE hardware designs are already incorporating support for iSCSI. While it adds to the gate complexity of the design, the performance benefits are clear and market demand appears to be good for iSCSI. When general DDP mechanisms such as RDMA are accepted, supporting them within a TOE is likely to be an incremental effort from that needed to support the iSCSI protocol.


TOE devices are unlikely to completely replace software TCP/IP implementations in the near future—that would not be cost effective. Instead, TOE devices will complement software implementations by offloading specific networking tasks and data flows that cannot be efficiently handled by software. This leads to a key challenge for TOE deployment: how to integrate TOE operation with that of existing software implementations to provide continuity of experience for users and continued support for networking features such as link aggregation, VLANs (virtual LANs), load balancing and failover, and forwarding of traffic between network interfaces.

OS designs generally provide stable, documented interfaces below TCP/IP at the network driver interface, and above TCP/IP at the sockets interface, but not within the core of the TCP/IP implementation. On open source platforms, the TCP/IP internals are freely accessible, but there’s no consensus yet on how TOE devices should be integrated. Because of this, TOE vendors follow ad-hoc methods to support their devices, with the result that there is currently little chance of TOE devices from two different vendors interoperating within the same open source platform. On proprietary platforms, providing full-featured TCP/IP offload is extremely difficult without explicit support from the OS vendor. There are signs that such support is starting to emerge. Microsoft, for example, has proposed extensions to support TOE devices on the Windows platform.6


Implementation of TCP offload technology involves significant challenges, ranging from silicon design issues to protocol implementation and software integration. Successful realization of TCP offload is likely to ensure the continued dominance of TCP/IP in the networking world by providing higher performance for existing applications and enabling new applications for TCP/IP and Ethernet hardware.


1. Internet Engineering Task Force RFCs (requests for comments): see The site defines TCP/IP and associated protocols. If you don’t like reading standards documents, the following book is an excellent reference: Stevens, W. R. TCP/IP Illustrated, Volume 1. Addison-Wesley, Boston: MA, 1994.

2. Foong, A. P., Huff, T. R., Hum, H. H., Patwardhan, J. P., and Regnier, G. J. TCP Performance Re-Visited; (Provides a quantitative analysis of TCP performance bottlenecks, including CPU and memory subsystem factors.)

3. IETF IPS (IP Storage) Working Group: see The IPS Working Group is chartered with defining protocols to encapsulate existing storage protocols, such as SCSI and Fibre Channel, over IP-based transports. Notable output of the working group includes the iSCSI, iFCP (Internet Fibre Channel Protocol), and related Internet drafts, and FCIP (Fibre Channel over IP), RFC 3643.

4. The RDMA (Remote Direct Memory Access) Consortium: see This industry collaborative has published specifications for RDMA (DDP) over TCP/IP; the SDP (Sockets Direct Protocol), which enables sockets applications to take advantage of DDP transports; and iSCSI Extensions for RDMA (iSER).

5. The RDDP (Remote Direct Data Placement) Working Group: see This group is chartered to define a suite of protocols enabling remote direct data placement over arbitrary network transports, with SCTP and TCP being of primary interest.

6. Details on Microsoft’s proposed Chimney Architecture for TCP offload are largely released under nondisclosure agreement, but a Google search ( for “Microsoft Chimney” will turn up plenty of secondary information on the architecture.


ANDY CURRID is principal software architect at iReady Corporation, where he is focused on software support for iReady’s TCP/IP and iSCSI offload devices. He has over 12 years of experience in the software industry, including engineering, management, and technologist positions at Wind River, Tadpole Techology, and Fujitsu-ICL. Andy earned a B.S. in computing science at the University of Warwick.

© 2004 ACM 1542-7730/04/0500 $5.00


Originally published in Queue vol. 2, no. 3
see this item in the ACM Digital Library



Theo Schlossnagle - Time, but Faster
A computing adventure about time through the looking glass

Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Van Jacobson - BBR: Congestion-Based Congestion Control
Measuring bottleneck bandwidth and round-trip propagation time

Josh Bailey, Stephen Stuart - Faucet: Deploying SDN in the Enterprise
Using OpenFlow and DevOps for rapid development

Amin Vahdat, David Clark, Jennifer Rexford - A Purpose-built Global Network: Google's Move to SDN
A discussion with Amin Vahdat, David Clark, and Jennifer Rexford


(newest first)

Rajiv | Sat, 28 Dec 2013 21:47:54 UTC

"Buffer copies on the transmit data path can be reduced by using novel memory management techniques, but these dont work so well on the receive data path"

Like which ones?

Leave this field empty

Post a Comment:

© 2016 ACM, Inc. All Rights Reserved.