DAFS: A New High-Performance Networked File System

This emerging file-access protocol dramatically enhances the flow of data over a network, making life easier in the data center.

Steve Kleiman, Network Appliance

The Direct Access File System (DAFS) is a remote file-access protocol designed to take advantage of new high-throughput, low-latency network technology.

The motivation for the new protocol comes from three technology trends that have emerged in the past several years:

Networked Storage

Storage today can be shared, managed, and scaled independently from the applications, operating systems, and machine architectures that are attached to the storage system. Applications can access the separated storage using four basic methods.

The first method of access uses raw blocks. This typically has high performance, but the application must provide any data management capabilities, must be tuned to the particular storage configuration to gain the best performance, and must provide inter-application coordination if the data is to be shared.

The second access method is for a portion of the storage to be reserved for a particular node that then accesses the storage through a file system supported by that node's operating system. This typically provides good performance, but data must be managed from each node and direct data sharing is not supported.

The third method is to use a clustered file system, such as the Global File System (GFS), on each node. [See "The Global File System: A File System for Shared Disk Storage," by Steve Soltis, Grant Erickson, Ken Preslan, Matthew O'Keefe, and Thomas Ruwart, IEEE Transactions on Parallel and Distributed Systems, October 1997.]

This allows data sharing, but the file system must be the same (or at least be compatible) on every node; any failing node may damage the entire file system.

These first three methods use block protocols such as SCSI to communicate with external storage. The storage system itself understands little about the data being stored and therefore cannot help much with data management, sharing, or protection.

The last method uses a file access protocol such as the Network File System (NFS), Common Internet File System (CIFS), or Web Distributed Authoring and Versioning (WebDAV). This results in the following benefits:

File-access protocols that use the typical software network stacks tend to suffer higher CPU overhead per operation than do local file systems. In many cases that don't require high throughput, this is not an issue. In high-throughput configurations, however, the overhead can be significant.

The second motivating trend for DAFS is the rapid growth of the Internet. This growth required service providers to develop architectures that are resilient to failure and can rapidly scale in both computing power and storage capacity. The resulting designs, called local file-sharing architectures, spread the service load across a set of application servers with shared file storage, as shown in Figure 1. The application servers can be large machines or relatively small ones. A more recent trend is to use blade servers that integrate a set of servers, each on a hot replaceable board, with a chassis containing a switched fabric. This architecture is resilient in that if one application server fails, another can take its place. The architecture is also scalable, because extra computing resources can become available by simply adding more application servers. Typical applications include e-mail, news, Web servers, geographical information systems, and clustered databases.

Figure 1

Standard file-access protocols allow data to be easily shared even among heterogeneous systems. The file-access paradigm is the same whether processes are sharing file data on a single machine or on a distributed system. This provides all the application servers access to a common pool of data for load balancing. It also allows another application server to take over data previously accessed by a failed application server.

NETWORKING TECHNOLOGY TRENDS

The third trend affecting DAFS is the advent of high-speed networks, such as 10-gigabit-per-second (Gbps) Ethernet and InfiniBand (IB). Such high speeds can put many stresses on existing network stacks.

For Ethernet and TCP/IP-based technologies, the first issue is the packet-processing rate. For example, full duplex 10-Gbps Ethernet processing can have packet rates as high as 1 million packets per second. This can use most, if not all, of the available CPU cycles. The Ethernet packet-processing issue can be mitigated by TCP/IP Offload Engines (TOEs). These specialized chips can process TCP/IP packets and assemble them into a reliable serial byte stream at the required rates. TOEs remove the overhead associated with header processing, packet fragmentation and reassembly, checksum computation and checking, and stream multiplexing and demultiplexing. TOEs are similar to modern SCSI and Fibre Channel host adapters that do similar functions for fixed-format SCSI commands using smaller link-level transfer units. In InfiniBand host channel adapters (HCAs) these functions are also built in.

A second problem caused by high speeds is the placement of received data. Typical networking stacks place incoming packets into generic packet memory and then copy the portion of the data that is relevant to the user into either operating system (OS) buffers or application buffers (or both). TOEs can strip the underlying TCP/IP headers, as shown in Figure 2, and scatter or gather the TCP payload to or from separate buffers. This still leaves the higher-level protocol headers embedded in the data, however. The OS or the application must interpret these headers before the embedded application data is copied into application buffers.

Figure 2

The impact of data copying can be significant at high throughput. For example, if an application were attempting to receive data at the full throughput of a 10-Gbps link, or roughly 1 gigabyte per second (GBps), it would impose a 3-GBps load on the system memory bus (1 GBps for the incoming link data, plus 1 GBps to read the data from the packet buffers, plus 1 GBps to write the data into application buffers). This is most of the memory bandwidth available on current high-volume servers.

Certainly, TOEs can be enhanced to interpret upper-level protocols. This has already been done for specific and relatively simple protocols such as Internet SCSI (iSCSI), where command data is separately placed into buffers. There are, however, a variety of upper-level protocols that can be much more complex than iSCSI, such as NFS, CIFS, HTTP, and application-specific protocols, each of which can change over time. This makes it difficult to have a single TOE that can interpret all these headers adequately to place the application data in separate buffers. Alternatively, applications could be modified to tell the TOEs where to place the data after interpreting the header. Unfortunately, this has the side effect of causing additional application interactions with the TOE and/or interrupts per command processed. This extra CPU load can swamp any data placement gains unless large transfer sizes are used.

Recent developments in system area networks have produced a new networking paradigm: direct access transports (DATs). Examples of networks that support direct access include Virtual Interface on Fibre Channel (FC-VI), InfiniBand [http://www.ibta.org/], and the Remote Direct Data Placement (RDDP) protocol for the Internet [http://www.ietf.org/html.charters/rddp-charter.html and http://www.rdmaconsortium.org/].

Direct access transports support two fundamental capabilities beyond those of traditional networking architecture and TOEs:

THE DAFS PROTOCOL

Remote file-access protocols have been in widespread use for more than 15 years for workgroup data sharing. The previously noted advantages have increasingly made it attractive to the data center as well, in database and local file-sharing applications.

DAFS is designed to take advantage of direct access transports to achieve remote file access at CPU overheads that are as good as or better than what is achieved through block access, while retaining all the advantages of the shared file-access paradigm.

DAFS also extends file-access semantics specifically for high-performance applications such as databases and local file-sharing architectures. The DAFS version 1 protocol was published by the DAFS Collaborative in September 2001, with the participation of more than 85 companies [http://www.dafscollaborative.org/].

The DAFS protocol has its basis in NFS version 4 (NFSv4), which is the latest version recently approved by the Internet Engineering Task Force (IETF). [See "The NFS Version 4 Protocol," by Brian Pawlowski, Spencer Shepler, Carl Beame, Brent Callaghan, Michael Eisler, David Noveck, David Robinson, Robert Thurlow, Proceedings of the 2nd International System Administration and Networking Conference (SANE2000), p. 94, 2000.] NFSv4 introduces several unique advantages in addition to normal network file access:

The basic DAFS operation semantics are similar to NFSv4.The underlying Open Network Computing remote procedure call (ONC RPC) transport mechanism, however, has been redesigned to take full advantage of direct access transports.

Direct access transports are derived from memory-to-memory interconnection networks used by the research and high-performance computing communities. The first attempt to standardize direct access transport semantics was the Virtual Interface Architecture. This effort has been superseded by the DAT Collaborative, which has developed an open API for DAT semantics called the Direct Access Programming Library (DAPL) [http://www.datcollaborative.org/]. The DAT Collaborative is cooperating with the Open Group Interconnect Software Consortium to enhance the APIs.

All of these direct access standards define a connection-oriented transport mechanism that has three fundamental operations: send, RDMA write, and RDMA read. Each end of a connection has a request queue where commands containing the operation directives, arguments, and buffers are queued. Each request queue has three internal queues in the memory space of the application or kernel that is referencing the HCA, as shown in Figure 3. The send queue is where commands containing the desired DAT operations and any buffers they reference are queued. Once a new command is queued, the hardware is informed, usually by a store to a memory-mapped register on the HCA. The HCA signals the completion of an operation by enqueuing a status indication on the event queue. Notification of completions may be batched to reduce interrupts.

Figure 3

Send operations transfer the contents of a buffer associated with the command on the initiating node to the target node. The receiving node places the data in the next available buffer on the receive queue. The target node must enqueue new buffers on the receive queue before another node can issue a send operation to it. This is similar to typical packet sending and reception, except that the buffer size is not constrained by any underlying packet size in the physical medium. The initiator and target must agree on a maximum size, however, and the sender must take care not to initiate more send operations than the target can receive.

The RDMA operations are different. The initiating node specifies the exact location of the buffer on the target node and may transfer data either to (RDMA write) or from (RDMA read) that buffer into a local buffer associated with the command. The RDMA operations use a remote address composed of a remote memory region context, which identifies a specific region of memory in the target node, and an offset and length within that region. The target node must explicitly export the memory regions. RDMA operation to non-exported regions will fail. For added protection, the context identifier usually contains a protection key that must match the one associated with the region when the RDMA operation takes place. The target must inform the initiator of available contexts before the initiator can use them. This is usually accomplished through send operations.

The memory regions must usually be locked to prevent paging while they are exported. The DAPL API provides this through a memory registration procedure. Memory registration also provides the correct virtual-to-physical address mappings to the channel adapter so that commands can reference buffers using process addresses. Memory registration can be expensive, so the best course is for applications to preregister the appropriate buffers. The application can then dynamically control external access permission to each buffer on a per-request basis.

USE OF DAT SEMANTICS

Like NFS, DAFS is a request-response protocol. In general, a request is sent to the server using a send operation, as is the reply back to the client. The server will allow only a limited number of outstanding operations, which is managed through a credit system. This limits the number of buffers in the server's receive queue. The client simply adds a buffer to its receive queue for every request it sends.

DAFS uses DAT semantics to place data directly where it belongs in both the client and the file server. For example, in a DAFS file read operation the request includes the client address of the buffer where the server should place the data via an RDMA write, as shown in Figure 4. This simple protocol places control information in receive queue buffers while letting the data the application desires go directly to the correct buffer.

Figure 4

The file write operation is more complex. The client can choose to pass the data to be written in the request, which is appropriate for small amounts of data, or it can send the address of the buffer in the request and allow the server to retrieve the data into the correct server buffer with an RDMA read.

DAFS has several semantic extensions that go beyond NFSv4 and other remote file-access protocols. In general, the enhancements are designed to help support high-performance I/O, especially in databases and clustered applications. Let's review some of the more important ones.

Enhanced file locking. The typical semantics of local or remote file-system locking can make recovery from failure difficult. Suppose an application on a single computer uses file locking to coordinate file access between two processes, and that one of the processes obtains the lock and starts modifying the file. If a power failure occurs at this point, the system will restart and one of the processes will obtain the lock. The data may be corrupted, however, because the integrity of the file was not reestablished when the previous lock was broken. Applications normally have to create auxiliary "lock" files, a relatively slow process, to detect this kind of failure.

DAFS extends locking semantics by providing two new forms of locking in the protocol. The first form is a persistent lock that notifies the lock requestor that a previous lock was broken. The requestor may then run a procedure to reestablish file integrity. The second form is an auto-recovery lock that restores the file to the same state that it was in before the lock was broken.

Batch I/O. High-performance applications deal with large numbers of I/O events simultaneously. DAFS provides an asynchronous batch I/O mechanism in the protocol that allows an application to launch many requests simultaneously. The mechanism allows the application to specify whether the server will return a notification when all or some of the I/O has completed. This gives applications fine-grained control over their I/O pipelines. This mechanism is especially appropriate for databases, which typically use an asynchronous buffer-writer process to write out database record buffers once a transaction is logged.

Cache hints. DAFS provides protocol operations that allow applications to control the storage cache hierarchy. Applications can specify that particular ranges of file data will be needed soon and should be prefetched into the server's cache, or that other ranges will not be needed so that the server may purge them from its cache.

Cluster fencing. A clustered application usually uses a cluster manager to decide which nodes are behaving well and should be considered members of the cluster, and which nodes have failed or are otherwise not cooperating and must be ejected from cluster membership. Misbehaving nodes must be prevented from accessing any resources, such as file servers, that are shared with cluster members to prevent them from corrupting the application. This is called fencing. DAFS supports fencing by having a client access control list on each server resource. A cluster manager can control the nodes that have permission to access the shared file server by manipulating these lists.

DAFS IMPLEMENTATIONS

DAFS can be implemented in several ways. The two most important ways are shown in Figure 5. The first, called kDAFS, is a file system loaded in the operating system. Both implementations shown use a DAPL to provide the API for DAT semantics. A kDAFS implementation is transparent to applications, but it goes through the normal OS control and data paths and cannot take advantage of OS bypass. In addition, the application cannot take advantage of the enhanced semantics, unless the OS API also supports such functions. The performance of such an implementation should be similar to local file systems on block-based storage.

Figure 5

The second implementation style, called uDAFS, is a user library. The DAFS Collaborative also defined a DAFS API to provide a common interface to such libraries [see Direct Access File System: Application Programming Interface, DAFS Collaborative, October 2001, http://www.dafscollaborative.org/]. With uDAFS, the application can use OS bypass and all extended semantics. Potentially, the performance can be better than local file systems. The downside is that the application must adapt itself to the new API. Fortunately, many high-performance applications have an internal OS adaptation interface, which could provide a mechanism to do the adaptation without changing the bulk of the application.

Performance measurement has a variety of dimensions, which we cannot adequately cover here. Several published papers cover DAFS and RDMA file-access protocol performance in more depth; some also compare RDMA with alternative techniques. [For more information, see the following sources: An Adaptation of VIA to NFS on Linux, Fujitsu Ltd., http://www.pst.fujitsu.com/english/nfs/; Meet the DAFS Performance with DAFS/VI Kernel Implementation Using Clan, Fujitsu Ltd., http://www.pst.fujitsu.com/english/dafsdemo/; "Design and Implementation of a Direct Access File System," by Kostas Magoutis, Proceedings of BSDCon 2002 Conference; "Structure and Performance of the Direct Access File System," by Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, Margo Seltzer, Jeff Chase, Richard Kisley, Andrew Gallatin, Rajiv Wickremisinghe, and Eran Gabber, Usenix Technical Conference, pp. 1—14, June 2002; "Making the Most of Direct-access Network-attached Storage," by Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, and Margo Seltzer, Proceedings of the Second Conference on File and Storage Technologies (FAST), Usenix Association, March 2003; "The Direct Access File System," by Matt DeBergalis, Peter Corbett, Steve Kleiman, Arthur Lent, Dave Noveck, Tom Talpey, and Mark Wittle, Proceedings of the Second Conference on File and Storage Technologies (FAST), Usenix Association, March 2003.]

The following results provide a quick indication that DAFS achieved its goal of lowering the overhead of shared file access.

Access Type Application Server
(CPU �sec/op)
Raw disk access with volume manager (VxVM) 113

Local FS (ufs)

89
Raw disk access 76
User-level DAFS client (with VI/TCP HCA) 34

This shows the client CPU overhead in microseconds for a 4-KB I/O using an online transaction-processing-like workload and asynchronous I/O. The DAFS results used the Emulex GN9000 VI/TCP HCA that implements RDMA over 1-Gbps Ethernet and TCP/IP. The other results used direct-attached disks. The results show that user I/O has almost half the cost of even raw (i.e., block) access to disk, and almost one-third of the cost of local file access.

DAFS STATUS

DAFS products have been shipping for more than a year, using the Emulex VI/TCP host channel adapter. The effort over the next year is to take advantage of the InfiniBand HCAs that have recently become available. The DAT Collaborative has open-source implementations of DAPL, which are being ported to the new HCAs. Open-source Linux kDAFS and uDAFS efforts are under way.

The RDMA community has seen a great deal of activity over the past year. The RDMA Consortium (RDMAC), with participation of some 60 companies, has defined a protocol for RDMA over TCP/IP. The IETF has formed a working group for RDDP that includes RDMA semantics. The RDMA Consortium has submitted its work to the IETF for review.

The IETF NFS working group has recently completed its work on NFSv4 and is now defining a charter for the next version of NFS. Support for RDMA and extended semantics, similar to DAFS, are among the candidate requirements for the new charter. In addition, Sun Microsystems is working on a version of the Open Network Computing Remote Procedure Call (ONC RPC) that uses RDMA. It would allow support for zero-copy file access in current versions of NFS.

The new high-speed networks have transport bandwidth equal to or better than modern storage networks. High-speed networking enables architectures—such as local file sharing—that use volume servers and switching components to achieve performance and resilience previously seen only in expensive high-end servers. Traditional network protocol stacks have had higher CPU overheads at high throughput than storage transports. The DAFS protocol uses TOEs and DAT semantics to bring these overheads to levels that are the same as or even better than standard block storage transports, while retaining the file access protocol advantages of virtualization, data sharing, and protection. Protocols such as DAFS also allow for a more direct and higher-level application-to-storage conversation, which is critical to I/O efficiency at these data rates.

Several Internet resources will provide further information:

http://www.dafscollaborative.org/ The DAFS Collaborative publishes the DAFS protocol specification and API. The site also contains an open-source software development kit for DAFS and links to other reference implementations.

http://www.datcollaborative.org/ The DAT Collaborative publishes an API for accessing DAT semantics for both the application and OS kernel environments (aka uDAPL and kDAPL). The site contains links to open-source reference implementations of DAPL.

http://www.opengroup.org/icsc/ The Open Group's Interconnect Software Consortium has several working groups relevant to DAT interconnects. The Interconnect Transport Working Group is specifying an API for DAT semantics that is compatible with the DAT Collaborative's DAPL. Sockets API Extensions Working Group is specifying an extension to the Unix sockets API to support zero-copy using underlying RDMA, DDP, or other mechanisms. Fabric Management API Working Group is specifying a fabric management API initially targeted at InfiniBand.

http://www.nfsv4.org/ This site is a good resource to follow the latest developments in the NFS community. It contains links to the specification as well as interoperability forums.

http://www.rdmaconsortium.org/ The RDMA Consortium publishes the RDMA and DDP protocol specifications. It also publishes a protocol for marking the boundaries of upper-layer data units on a byte-stream protocol such as TCP. This is used by the RDMA and DDP protocols so that receivers can process their headers efficiently.

http://www.ietf.org/html.charters/rddp-charter.html This site contains the IETF's RDDP working group documents including submissions and charter.

http://www.infinibandta.org/home The InfiniBand Trade Association published the InfiniBand specification.

http://www.ncits.org/ T11 is the technical committee within the International Committee for Information Technology Standards (INCITS) responsible for device-level interfaces such as Fibre Channel. It publishes the FC-VI specification for RDMA over Fibre Channel.

acmqueue

Originally published in Queue vol. 1, no. 4
Comment on this article in the ACM Digital Library





More related articles:

Pat Helland - Mind Your State for Your State of Mind
Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems.


Alex Petrov - Algorithms Behind Modern Storage Systems
This article takes a closer look at two storage system design approaches used in a majority of modern databases (read-optimized B-trees and write-optimized LSM (log-structured merge)-trees) and describes their use cases and tradeoffs.


Mihir Nanavati, Malte Schwarzkopf, Jake Wires, Andrew Warfield - Non-volatile Storage
For the entire careers of most practicing computer scientists, a fundamental observation has consistently held true: CPUs are significantly more performant and more expensive than I/O devices. The fact that CPUs can process data at extremely high rates, while simultaneously servicing multiple I/O devices, has had a sweeping impact on the design of both hardware and software for systems of all sizes, for pretty much as long as we’ve been building them.


Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - Crash Consistency
The reading and writing of data, one of the most fundamental aspects of any Von Neumann computer, is surprisingly subtle and full of nuance. For example, consider access to a shared memory in a system with multiple processors. While a simple and intuitive approach known as strong consistency is easiest for programmers to understand, many weaker models are in widespread use (e.g., x86 total store ordering); such approaches improve system performance, but at the cost of making reasoning about system behavior more complex and error-prone.





© ACM, Inc. All Rights Reserved.