The Emergence of iSCSI

Modern SCSI, as defined by the SCSI-3 Architecture Model, or SAM, really considers the cable and physical interconnections to storage as only one level in a larger hierarchy.

Jeffrey S. Goldner, Microsoft Corporation

When most IT pros think of SCSI, images of fat cables with many fragile pins come to mind. Certainly, that's one manifestation—the oldest one. But modern SCSI, as defined by the SCSI-3 Architecture Model, or SAM, really considers the cable and physical interconnections to storage as only one level in a larger hierarchy. By separating the instructions or commands sent to and from devices from the physical layers and their protocols, you arrive at a more generic approach to storage communication.

Separating the wire protocol from the command protocol allows a common representation of SCSI independent of the actual physical carrier. The various command sets defined in the SCSI-3 protocol command suite all standardize the format of the commands and responses from the targets.

Today, many storage devices use some form of the SCSI protocols, the most notable exception being the ATA (also called IDE or EIDE) devices common in personal computers. Enterprise-class machines and many smaller servers use one or more SCSI-3 interconnects. Examples of these interconnects include the well-known SCSI Parallel Interface (SPI), Fibre Channel Protocol (FCP) mapping, and a new protocol, Internet SCSI (iSCSI), which maps the SCSI storage protocols over standard IP networks. Figure 1 shows the complete chart of SCSI standards [http://www.t10.org/scsi-3.htm].

Figure 1

The focus here is to introduce the details of the emerging iSCSI protocol and show how it fits into the merged realm of storage protocols and networking.

SCSI has a number of advantages over ATA, not the least of which is the ability to communicate outside the computer system itself. Other notable features are support for multiple hosts (initiators), allowance for clustering or device sharing, and command queuing, which allows multiple, outstanding commands to be issued and returned out of order. (Although the latest versions of the ATA protocol also define queuing, its adoption is quite rare at this time.)

Several common terms are worth explaining before diving deeper into iSCSI. SAM defines the concepts of initiators, targets, and logical units as follows:

Figure 2 illustrates the relationship of initiators, targets, and logical units.

Figure 2

Disaggregation of storage and servers allows the physical resources of the enterprise data center to be distributed in more logical ways. For example, racks full of high-performance server blades can connect to storage devices through some type of SCSI-3 interconnect. To achieve the highest density of computing, you have to consider space, power, and cooling. In today's modern computing engines, the disk devices represent a significant consumer of these items.

Something else happens when you remove all the disk devices from the computers themselves. You gain flexibility and increase the capacity utilization of your resources. Many surveys have shown that system buyers tend to purchase excess disk capacity if they buy storage with compute servers. In fact, the amount of unused disk space averages about 60 percent. That's an average, however; some systems will run out of storage free space, while others use very little. By pooling the storage devices and virtualizing these resources, free disk space can be made available where it is needed. Buyers can purchase storage based on total capacity projections and reallocate free storage nearly at will. This has been one of the most tangible benefits of storage area networks (SANs).

A second obvious benefit of separating storage from the server is the ability to replace a server (with newer hardware or perhaps to replace a failed system) and assign the storage resources to the new server without having to back up and restore the data. Other advantages include being able to separate the storage over relatively long distances (important for disaster recovery); the ability to move and make copies of data at very high speeds; and the ability to share expensive devices other than disks, such as tape and media libraries.

ADDING FIBRE TO YOUR DIET

From a few years prior to the development of iSCSI, the predominant interconnection technology in the data center—and the mother of SAN—has been Fibre Channel (misspelling by the inventors intentional). In the late 1980s, researchers at IBM and other companies were investigating new networking technologies and came up with the FC architecture, which was derived from other interconnects available at the time. The original vision was that of a general-purpose "fabric" of interconnections that could be used for a variety of purposes, with networking being the predominant use. As luck would have it, this new fabric had some distinct advantages for storage as well, because FC represents a lower-level protocol capable of handling multiple upper-level protocols (ULPs). These include high-speed, reliable, and in-order delivery of data; the ability to create large-scale and relatively efficient networks; and most significantly, the creation of new paradigms that would become SANs.

Like any new interconnection, FC's success ultimately depends on interoperability among devices from multiple sources. And that's where the long saga starts. Not only was FC a new physical interconnect—at the edge of technological capabilities—it also required the creation of mappings to the different protocols that enable network and storage to use the fabric. Furthermore, the vision of a switched fabric got diverted while the switches were being perfected, and a different topology—Fibre Channel Arbitrated Loop (FC-AL)—was proposed. Although it was supposed to be simpler to implement, with hubs instead of switches, loops have their own unique problems (a discussion best left for another article).

So, why the need for another interconnect? First of all, the development and market acceptance of FC has been hindered by a lack of interoperability, coupled by the broad acceptance of Ethernet for networking. Furthermore, the fact that FC differs from Ethernet at the hardware layers means that specially trained personnel, entrenched in the specifics of the FC architecture, are required to deploy the fabrics successfully. The ability to run storage over Ethernet becomes enticing with the promise of simpler deployment and lower cost.

ENTER ETHERNET

If you take the point of view that interoperability and, worse, difficulty in managing FC fabrics are the two biggest inhibitors to widespread adoption of FC, the advantages to iSCSI jump off the page. Interoperability is a given in the Ethernet world. The cheapest adapters and even chipset implementations of Ethernet (read: practically free) just plug and play together, and that now includes gigabit network interface cards (NICs) integrated into the chipsets.

Also, the structured cabling standard for Ethernet is well understood and supported in the enterprise environment. You don't have to consult hundreds of pages of tables that identify combinations of adapters, switches, firmware, cables, drivers, and settings just to plug Ethernet together. For reasons that defy logic, that's the norm when you configure a fabric based on FC. (Well, maybe there is something to that. It makes for loyal customers because switching products is just too difficult once you have invested in one vendor.)

Of course, differences do exist among adapters, and Ethernet switches have certain features that differentiate the products, particularly in terms of manageability and performance. In general, however, Ethernet is a plug-and-play environment.

When iSCSI was first proposed, it wasn't at all clear which protocol would be used to transmit the SCSI packets over Ethernet. TCP/IP was ultimately selected even though it had some limitations. The decision was largely driven by the requirements of storage traffic: reliable and in-order delivery of data packets. In addition, congestion control was already part of TCP, so that as traffic increased, the lower layers of the protocol stack could sort it all out and leave the data and command encapsulation to the higher levels of the networking (or storage) stack. There was also the practical consideration of having a working protocol and products that support it, rather than having to create an entirely new infrastructure—or at least part of one.

Because iSCSI is built on top of TCP/IP, each command, response, and data packet needs to be encapsulated in the Ethernet frame. This encapsulation is shown in Figure 3. (For the purpose of illustration and because it's the most familiar network media, Ethernet is shown. In fact, there is nothing limiting iSCSI to being transported solely over Ethernet. ATM, Sonet, and, gasp, even FC can be used for the wire protocol, as long as that protocol can support IP traffic.)

Figure 3

The structure of Ethernet frames, datagrams, and TCP headers is best obtained from any networking text. Suffice it to say that the TCP header has enough information to indicate that an iSCSI protocol data unit (PDU) is contained within the TCP segment and has a header and optional data portion. The PDU is the basis for each communication between the initiator (host) and target (storage device). A single iSCSI PDU may require more than one Ethernet frame.

PROTOCOL DATA UNITS

Contained within each TCP segment is a PDU that performs some portion of a SCSI command, management function, command response, or data. Once an iSCSI session is set up, standard SCSI commands to read or write blocks can occur. A typical complete operation consists of the PDUs or packets that must be exchanged between the initiator and target, as shown in Figure 4.

Figure 4

Because iSCSI will typically be transported over Ethernet, the maximum response size (MRU) or payload that Ethernet can carry is 1,500 bytes (The use of "jumbo frames," although not supported by all hardware or device drivers, can increase the MRU to 9,180 bytes per frame). This must include the TCP, IP, and Ethernet headers on each frame; the iSCSI header in the first frame of the PDU; and any optional headers, data, and digests (explained later). The size of the data transmitted in a PDU is limited to a value negotiated between the endpoints (MaxRecvDataSegmentLength), so a request for more data will require multiple PDUs.

In addition to normal commands, several task management functions (such as LU reset) are defined in the iSCSI mapping to SCSI, so these will also be transported using PDUs. Other PDUs, such as login requests and responses and various messages, are defined as part of the iSCSI protocol as well.

Any conversation in the SCSI realm takes place by establishing the I-T nexus, the logical path from initiator to target. This nexus in iSCSI is called a session and must be established by a process known as login. iSCSI allows multiple connections—that is, physical paths to be established from initiators to targets. This is referred to as multiple connections per session, or MC/S, and is best thought of as multipathing built into the protocol itself. (MC/S must be accommodated in the iSCSI-specific drivers on the host operating system, so it's not actually a built-in multipathing solution.) This allows not only redundancy in case of a path failure, but also the ability to aggregate bandwidth over multiple physical links.

With iSCSI, logins may be authenticated so endpoint security may be established to prevent tampering with the data. In addition to establishing endpoint security, the login process involves an exchange of parameters between the initiator and the target. (See Figure 5.) This goes a long way toward keeping iSCSI interoperable.

Figure 5

Communicating between endpoints requires a mechanism to locate targets. iSCSI has a wealth of such mechanisms, ranging from manually entering IP addresses into a file or BIOS setup screen to fully automated mechanisms:

Use of IP addresses is not only boring, but also hard to administer. iSCSI has two schemes for naming nodes (initiators or targets). In both cases, names are unique to the end node.

If that weren't enough, each node has a fully qualified address consisting of the domain name (IP address or DNS name), the TCP port number (3260 by default), as well as one of the forms of the iSCSI name.

BUFFER MANAGEMENT AND ERROR HANDLING

To perform well, iSCSI implementations allocate memory to buffer multiple commands and data. iSCSI commands are "unsolicited"—that is, the target doesn't know they are coming. The command PDU itself is of limited size, whereas the data being transferred can be quite large. As part of the login negotiation, the target and initiator exchange information about how much buffering is available. This is important because it would be very slow going if one end of the connection had to acknowledge each frame before another could be sent to it. This becomes increasingly important as physical and networking hop distances increase. To achieve efficient flow of data, the sending node does not have to wait for a response as long as it believes the other end has free buffers.

In addition to the iSCSI-specific handling of buffers, TCP/IP has its own "windowing" mechanisms and congestion control, although these are in some way subject to the quality of the components in the middle. Therefore, don't expect a $200 Gigabit Ethernet switch to perform as well as a $4,000-per-port Cisco switch that contains a lot more buffering and TCP management capability.

Data integrity is, of course, paramount, and iSCSI has multiple mechanisms to make sure that the data sent is the data received. The iSCSI PDU header contains one or more digests, the security community name for a form of checksumming that is more commonly known as cyclic redundancy checking (CRC). Digests can protect both the command and the data portion of a PDU. If the receiver doesn't calculate the same CRC value, the frame will need to be retransmitted or other error recovery performed.

Interestingly, digests are optional in iSCSI, although in most cases it would be hard to understand why you wouldn't want to use them. The downside to CRC, though, is that the calculation can be expensive, particularly where hardware to offload the calculation is not employed. One case for not using digests makes sense: When IP security (IPsec) is used for encryption, the integrity can be provided at that level and thus avoided at the iSCSI-protocol level. But let's step back from the details of iSCSI and look at the storage systems market and drivers.

PERFORMANCE FOR ANY PRICE RANGE

One of the most compelling aspects of iSCSI is the ability to buy into storage networking at many different price points. The use of iSCSI software drivers (initiators) adds this capability to any system with a NIC, and this means that storage networking can be introduced very cheaply or for free—at least on the host side. This is one clear advantage over FC, where you must have specialized host bus adapters (HBAs), owing to the unique nature of the interconnect.

Although there may be performance issues running through the operating system's network stack, with the rapid increase in CPU speeds there are usually cycles to burn. This is not to mention the advent of hyper-threading processors, which essentially include two CPUs for the price of one. Using those excess clock ticks for iSCSI processing might turn out to be a good investment and trade-off for low-end requirements.

If the pure software approach doesn't pass muster for particular application loads such as heavy-duty e-mail servers and database management systems, a range of higher-performance options are available for those who can afford them. Now we are getting dangerously close to Fibre Channel pricing.

TCP offload engines (TOEs) are sometimes cited as the optimal technology to boost iSCSI performance, and possibly regular network traffic. TOE chips are being developed for use on NICs, as well as iSCSI HBAs. For some workloads, typically those involving larger transfer sizes (more than 8 KB), the effect of using TOE acceleration can halve the CPU utilization. Unfortunately, the effects are less pronounced with the typical data-transfer sizes used by database-type applications.

The appeal of a SAN based on RJ-45 connectors may also be one of its major weaknesses. In the Fibre Channel world, physical security has been easy enough to get right and has been largely sufficient. In fact, with physical access, intercepting data is still quite difficult because specialized analyzers are required to capture the traffic. There is no such thing as a promiscuous HBA. With any protocol that runs over TCP/IP, however, thoughts immediately turn to hackers, and more aggressive mechanisms are required to secure a SAN based on iSCSI. The security picture for an iSCSI network involves authentication, encryption, and possibly virtual LAN (VLAN) implementations. Each of these has its complications.

Authentication in the iSCSI space involves an algorithm called Challenge Handshake Authentication Protocol (CHAP) or one of several optional protocols, such as Security Response, Kerberos, or Simple Public Key Mechanism (SPKM). Each of these is a general-purpose authentication protocol defined in an Internet request for comments (RFC) separate from iSCSI. The requirement to distribute keys to each endpoint (target or initiator) complicates the CHAP implementation; the use of a Remote Authentication Dial-in User Service (RADIUS) will help here. Note that the iSCSI standard does not require any of these to be used, however, and some vendors are planning on skipping security altogether for their first product shipments. (Per IETF specification, CHAP must be implemented by initiators and targets, but no one has to use it. Any device that does not implement CHAP will be non-conforming.) Keep in mind that authentication is concerned only with who talks to whom and does not have anything to do with the actual conversation at all. If the network is compromised (snooped, for example), it doesn't much matter if the snooper can't talk directly to the device—it could still grab all the data for offline digestion if it is not protected by some other means.

Then there is the issue of encryption. IPsec is adopted in the iSCSI standard, but by some strange twist of language, its implementation is stated as "mandatory, but optional." In other words, no one is expecting it to be used. That's because it (a) is difficult to manage; (b) sucks up performance; and (c) is viewed as not necessary when storage networks are physically segregated from general computer networks. Many vendors do not intend to support IPsec at all, citing the lack of a clear requirement from the customer (whoever is paying for the product, not necessarily the end user). Although this would be a violation of the spec, it's hard to argue when cost is considered.

Finally, there is the question of VLANs in the Ethernet world. Although most do not provide the utmost in security, they do offer a well-known model affording adequate protection in many cases. Of course, keeping intruders out of your SAN by physical isolation is still a good bet. Just beware of multi-home computers that might provide an opening by performing a routing function.

The next stop-off on the iSCSI odyssey will likely be the addition of remote direct memory access (RDMA). It is being introduced in clustered computer environments and InfiniBand as a way to avoid the overheads inherent in the TCP/IP stack, requiring data to be copied into the kernel and then into the application space on every host. RDMA promises to let applications on two ends of a network connection copy data directly between application-level buffers securely without getting the operating system kernel involved on every packet. This translates into both lower CPU utilization and lower latencies. Proposals are being turned into standards now, but it is likely to take several more years before any products are available.

The other, and perhaps larger, issue with RDMA is not a technical one, but a marketing problem. Vendors have been itching to sell iSCSI products but have been held off by the lack of both a ratified standard and operating system support. Recent announcements indicate that both of these hurdles have been cleared, however, so vendors can start shipping iSCSI products and recoup their R&D investments. This may make them reluctant to adopt something new like RDMA in the near future, particularly if it involves an additional outlay for specialized adapters.

CHALLENGES FOR iSCSI

Common folklore holds that Ethernet equipment is cheaper than Fibre Channel. Comparing the lowest ends of both would certainly bear that out, but anyone expecting to operate an enterprise storage infrastructure on a $79 switch from the local computer outlet should rethink his or her profession. The fact is, enterprises use high-end managed switches for both Ethernet and Fibre Channel.

This points to another fallacy common in the networking realm: gigabit LANs are not commonly deployed outside of enterprise data centers. Analysts don't expect significant deployments for maybe two more years. Gigabit NICs, on the other hand, are ubiquitous, even appearing on home computers as vendors integrate them into the core chipsets of their computers.

An additional concern is how effective TCP congestion control will be once iSCSI traffic is thrown at the networks in sufficient quantity. No one can talk about networking without thinking about latencies (essentially the time it takes for data to reach its destination). iSCSI proponents tend to speak in terms of low-latency devices and switches while also talking about long-distance applications for the technology. Little real-life data is available to validate any of these theories, but that will come in short order. Vendors will have to provide this information when they try to sell to corporate data centers.

BACK TO THE FUTURE

Anyone who has watched SANs evolve over the past eight or so years may find they are having a sense of d�j� vu all over again. It is amusing listening to the arguments that iSCSI will somehow give all the promised benefits of SANs just by virtue of being based on Ethernet. Although management of the "fabric" would be somewhat covered using existing network management tools, that leaves the entire world of storage management unaccounted for. Likewise, the claim that customers can preserve their investments in storage by migrating old devices to iSCSI using expensive bridges is somewhat laughable, with the exception of even more expensive backup devices (libraries, tape drives). Connecting old disk storage to a SAN makes little sense. Modern storage devices offer new capabilities that just aren't possible with older hardware. But one clear benefit is that iSCSI now brings SAN-like interfaces and capability into the traditional realm of network-attached storage (NAS). Benefits once attributed to NAS—common network cabling, switches, and interfaces—are now available with a block protocol storage system. If anything, this opens up the market to a richer set of trade-off points.

The convergence theory for networking holds that in the future, all communication needs will be handled by a single interconnect. InfiniBand (IB) was supposed to be that interconnect, of course, but a funny thing happened while it was being developed: Ethernet in gigabit speeds at very affordable pricing became available, and it was easy enough to teach it new tricks with reasonable performance and latency (storage and interprocessor communication, for example). IB, however, had to work out a new interconnect, link protocols, management (which is handled on the fabric), and mapping of multiple upper-level protocols onto the link-level protocol, all at the same time while waiting for OS support. With Ethernet, most of those issues have long since been addressed. It is re-creating the whole stack of network layers that also continues to plague FC's more rapid expansion. The past few years have just begun to see the introduction of a routing capability in FC, while this has existed in the IP community for 20-plus years.

Whether or not the single fabric/single network arrives in the next few years or not until the end of the decade, Fibre Channel and Ethernet will coexist for at least the next three to five years. Table 1 compares their features. FC is firmly entrenched in many data centers and has a clear performance advantage. Out of the gate, FC was designed for 1-gigabit-per-second (Gbps) operation. Early proponents, such as Sun Microsystems, built "quarter-speed" devices running at 266 megabits per second because gigabit technology was still pretty exotic. Today's FC runs at 2 Gbps, with 10 Gbps already demonstrated. The current schedules call for a stop-off at 4 Gbps, which promises to be substantially cheaper than 10 Gbps. (All connections are full-duplex, so the performance could actually double the individual link rates.)

Table 1
Fibre Channel vs. iSCSI Feature Comparison

FC
iSCSI
1 or 2 Gbps full-duplex link speed
Up to 1 Gbps full-duplex
Link distance up to 10km (long-wave) Up to 200 km (DWDM) Any distance
Frame size: up to 2,112 bytes
Frame size: up to 9,108 bytes with "Jumbo Frames"
Not routable except by bridging protocols
Routable
Name Service (Simple Name Server)
Name Service (iSNS)
Requires special hardware (HBAs, switches)
Can run over any network interface (but runs best with special hardware/offload)
No wire encryption (except when bridged)
Encryption provided through IPsec

Of course, Ethernet is developing rapidly as well, and will go straight to 10 Gbps over the next few years. This does not mean that all connections will suddenly migrate to 10-Gigabit Ethernet, but rather the data-center backbones will move to the fatter pipes so that all traffic can be run on the same fabric. Besides, the physical interconnect now uses the same components regardless of whether it's Fibre Channel or Ethernet. Further, the comparison between iSCSI and Fibre Channel has to be made with accelerated adapters when the data center is being targeted, so costs will be similar for some time to come.

Some other interesting developments have happened in response to iSCSI. FC vendors are paying more attention to interoperability, and prices are dropping. Now that the development costs of FC technology have been amortized, there is every reason to believe that vendors can put substantial pressure on the iSCSI market. Advances in chip integration have given us 24-port/2-Gbps FC switches on a single chip, for example, and "Fibre Down"—computer system boards with integrated FC controllers—are now on the market. Interesting storage controller chips with integrated FC are also appearing, which will lead to lower-cost storage arrays in the near term.

Notwithstanding the continued presence of Fibre Channel, the long-term picture clearly points to iSCSI as a winner. At some point, when faster-speed Ethernet infrastructures are ubiquitous and RDMA boosts performance, the world might well move to iSCSI SANs. For now, more options all around are clearly favoring the customer who wants to take advantage of storage area networks.

• Draft standards for iSCSI and iSNS are available from the IETF IP Storage Working Group (http://www.ietf.org/html.charters/ips-charter.html).

• Ratified SCSI specifications are available from Global Engineering Documents (http://global.ihs.com/).

• Draft standards for SCSI activities are available from ANSI/INCITS T10 (http://www.t10.org/).

• iSCSI initiator software for use with Microsoft Windows is available for download in June 2003 (http://www.microsoft.com/)

acmqueue

Originally published in Queue vol. 1, no. 4
Comment on this article in the ACM Digital Library





More related articles:

Pat Helland - Mind Your State for Your State of Mind
Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems.


Alex Petrov - Algorithms Behind Modern Storage Systems
This article takes a closer look at two storage system design approaches used in a majority of modern databases (read-optimized B-trees and write-optimized LSM (log-structured merge)-trees) and describes their use cases and tradeoffs.


Mihir Nanavati, Malte Schwarzkopf, Jake Wires, Andrew Warfield - Non-volatile Storage
For the entire careers of most practicing computer scientists, a fundamental observation has consistently held true: CPUs are significantly more performant and more expensive than I/O devices. The fact that CPUs can process data at extremely high rates, while simultaneously servicing multiple I/O devices, has had a sweeping impact on the design of both hardware and software for systems of all sizes, for pretty much as long as we’ve been building them.


Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - Crash Consistency
The reading and writing of data, one of the most fundamental aspects of any Von Neumann computer, is surprisingly subtle and full of nuance. For example, consider access to a shared memory in a system with multiple processors. While a simple and intuitive approach known as strong consistency is easiest for programmers to understand, many weaker models are in widespread use (e.g., x86 total store ordering); such approaches improve system performance, but at the cost of making reasoning about system behavior more complex and error-prone.





© ACM, Inc. All Rights Reserved.