Over the past several years, a new type of storage device has entered laptops and data centers, fundamentally changing expectations regarding the power, size, and performance dynamics of storage. The SSD (solid-state drive) is a technology that has been around for more than 30 years but remained too expensive for broad adoption.
That changed with the introduction of consumer products such as the Apple iPad and iPhone, which led to the widespread availability of cheap nonvolatile memory. Manufacturers have used this consumer-grade material to produce SSDs and to make them look and act as much as possible like HDDs (hard-disk drives). Under the surface, however, they are completely different.
An HDD uses a head mounted on a mechanical actuator to access rotating magnetic storage media. In contrast, an SSD uses nonvolatile memory (i.e., NAND flash) as its storage media. The lack of moving parts and the use of silicon as the media give the device the "solid-state" name. This attribute makes SSDs less fragile than HDDs. As such, SSDs are common today in mobile devices such as smartphones and digital cameras.
SD (Secure Digital) and CF (CompactFlash) memory cards are smaller and less complex versions of an SSD. The key variable in choosing between using an SSD and a less complex device is the application's performance requirements. A digital camera's need for storage is considerably less demanding than a multicore laptop or server. These differences have significant impact on the design of SSDs.
Both the HDD, the core building block of nonvolatile storage in computer systems today, and the SSD are part of a class of storage called block devices. These devices use logical addressing to access data and abstract the physical media, using small, fixed, contiguous segments of bytes as the addressable unit. Each block device consists of three major parts: storage media, a controller for managing the media, and a host interface for accessing the media.
Storage media is the key factor behind the performance and cost advantages of SSDs. Most SSDs are built around widely available NAND flash memory, developed in the late 1980s as an electron-based trapped-charge storage media. The NAND cell stores electrons on a capacitor indefinitely in a no-power state. The charge is then sensed by circuitry on the NAND chip.
Writing to NAND flash is accomplished by either adding electrons to (programming) or removing electrons from (erasing) the memory cell using high-voltage pulses. NAND flash is read by using a simple analog-to-digital converter as a bias voltage is applied to the cell.
Different types of flash use different numbers of thresholds to determine the value in a cell. SLC (single-level cell) stores a single bit of data and has two threshold values. MLC (multilevel cell) stores two bits of data and has four threshold values. Newer flash memories, called TLC (three-level cell), are able to store three bits of value with eight or more threshold values. In general the value of an n-bit A/D converter can be described as:
n = log2(#threshold voltages)
A write-verify operation is used to program or erase the NAND flash. A pulse of high voltage is applied to the cell, and this process is repeated until the proper value has been programmed into the cell. The big drawback of this technique is that a cell must be erased before it can be reprogrammed. Also, the more levels that a flash cell supports, the smaller and more precise the high-voltage pulses must be. Making these pulses more precise slows the programming of the flash and reduces its write performance.
Disk drives use symmetrical access—the minimum read and write accesses are the same size. NAND flash memory, on the other hand, like most nonvolatile memory, uses asymmetrical access—the sizes of the minimum read and write to the media are different. This asymmetry is a result of the architecture of the memory array. Like most memory, NAND consists of a two-dimensional array with a bit line as one dimension of access and a word line as the other. The difference in NAND is that multiple cells share the bit line. This NAND string consists of 32-64 cells. The smallest read is typically an 8-KB page, based on the word-line length. A write to the array requires linearly programming all the pages along the NAND string, making the smallest write size 32-64 pages. This is called a program/erase block.
A die is a silicon wafer that consists of one to four memory arrays of approximately 4,000 blocks and the elements necessary to make the array usable. These non-storage elements consist of: controller logic for managing operations on the memory array; high-voltage generators for programming and reading from the array; sense detectors for reading the threshold values from the cell; cache buffers for temporarily storing the data bits to and from the memory array; and a high-speed interface for reading and writing data out of the die.
A single die is capable of reading up to 400 MB/sec but is capable of writing at only 10-20 MB/sec because of the complexity of programming. Latency is a major benefit of NAND flash. Typical latencies are 20-200 microseconds for reads and 1-10 milliseconds for writes. This performance compares favorably with HDDs, where reads and writes are both typically measured in tens of milliseconds.
NAND flash dies have a relatively small capacity, holding up to 16 GB per die. An SSD contains 8-256 dies to meet storage requirements for computing. Because multiple dies will be active at the same time, larger-capacity SSDs tend to yield better performance. Multiple active dies can cause thermal problems, however, as the high-voltage generators in each die are inefficient. During heavy program and erase operations an SSD limits the number of active dies to avoid overheating or excessive peak current draws.
Electron-based storage has many limitations, as do all storage media. The biggest limitations are found in the program and erase operations. Over time, the high-voltage pulses will burn out the oxide layer, reducing the cells' ability to isolate the electrons. Electrons will also become trapped in the oxide layer, adding resistance to the threshold measurement and causing a misread of the threshold value from the cell.
Burnout is the most misunderstood attribute of NAND flash. Manufacturers often specify the number of recommended program and erase cycles per cell for warranty purposes but do not specify the amount of time that a cell will retain data. Even without excessive use, stored electrons will eventually dissipate from the cell, and the data will be lost. The number of program and erase cycles simply accelerates the time before the data fades. Data can be retained for years in lightly cycled cells, while heavily cycled cells may retain data for only a few months. Long exposure to high temperatures also accelerates the decay of data. Numerous reads of the same NAND block can cause electrons to escape, changing the value stored within the cell, which leads to a failure mechanism called read disturbance.
As manufacturing technology improves, allowing the cell size to shrink and reducing the cost, the number of electrons stored per cell shrinks, as does the isolation oxide size per cell. This makes it harder to match the performance and longevity of the previous generation. These limitations, as well as many others, make NAND flash an impractical storage technology by itself. Burnout, data fade, and read disturbance are solvable problems for the SSD, however, as constantly and consistently moving data can prevent such failures. Thus, a symbiotic relationship between NAND flash and the controller is necessary for managing and working around the imperfections of the media.
The flash controller is the critical component that makes imperfect NAND flash robust and reliable. The controller is a complex embedded system with standalone processing and firmware for managing all aspects of the SSD. It is designed to protect and control the underlying NAND flash media.
Like a disk drive, the flash controller is most commonly implemented as a SoC (system-on-a-chip) design. The controller consists of multiple hardware-accelerated functional blocks, coupled to one or more embedded processor cores. These live within a single ASIC (application-specific integrated circuit) die to provide the lowest controller cost. Large SRAM (static RAM) for executing the SSD firmware is included in the ASIC, but often external DRAM (dynamic RAM) is used for caching both user data and internal SSD metadata. Higher-end SSDs include a backup power system of batteries or capacitors to ensure that the user data in the volatile cache is flushed out to the NAND array during unexpected power loss.
The host interface is the physical interface from the host system to the SSD. The majority of SSDs leverage existing storage standards and interfaces, such as SATA (serial ATA) and SAS (serial attached SCSI). They use traditional block-access protocols. The low-level primitives and high-speed serializer/deserializer are accelerated in hardware, with high-level block-access protocols implemented by firmware running on the embedded processor.
A newer storage interface for SSDs, not used with HDDs, is PCIe (Peripheral Component Interconnect Express). This is the same general-purpose I/O interface used on laptops and servers. This interface is a more efficient interconnect, since it allows systems to phase out the host bus adapter, an additional controller required on SATA and SAS devices to translate these bus protocols. Removing these bus controllers decreases latency and power consumption.
The flash channel is the controller and interface circuitry dedicated to communication with a physical subset of NAND flash on the SSD. NAND dies are connected to a controller via a parallel I/O interface capable of 400 MB/sec of throughput. This interface is shared by four to eight NAND dies, with only one die able to communicate with the flash-channel controller at any one time.
NAND dies with multiple memory arrays are capable of performing multiple operations at once, but they are not fully independent. The die and the channel controller use a lightweight protocol where data is transferred to/from buffers on the NAND die. The controller has to manage and correctly sequence reads, programs, and erase operations to all dies on the channel for the best utilization. Complex array operations, such as program and erase, will render the single memory array or plane busy until the operation is complete. This can take up to tens of milliseconds. Some SSD controllers have dedicated hardware sequencers or microcontrollers that can reorder die operations for optimal performance.
The most common SSD configuration is eight channels, but an SSD controller can have 4-32 channels to meet performance requirements.
NAND flash has higher performance but also a higher bit error rate than other storage media. The higher error rate requires the SSD to correct bit errors at gigabytes per second, equivalent to the speed of NAND flash. As such, the error-correction hardware responsible for encoding and decoding all reads and writes to the flash is often the largest portion of the SSD controller. Some controllers implement an ECC (error-correcting code) hardware engine for every flash channel to improve parallel performance, while others implement a single ECC engine shared by all the channels to reduce costs.
The most commonly used ECC is BCH (Bose Chaudhuri Hocquenghem). It is preferred for its speed and ease of implementation, but this comes at the cost of inefficiencies in the redundant storage required for ECC.
Flash-die designers make assumptions about the amount of BCH correctability needed for a given generation of NAND flash and add redundant space within each page and block to account for it. As dies have shrunk with each successive generation of flash, the number of errors and the need for error correction have grown; the minimum overhead required for correction (rather than storage) has increased five times over the past five years.
In addition, controller designs using BCH alone historically assumed that bit failures were uniform. Newer techniques offer more efficiency and correctability based on better understanding of bit failure characteristics and locality of data within the NAND flash die.
The use of LDPC (low-density parity-check) code methods along with more advanced information from the NAND die has led to 8-10 times the correctability of previous BCH methods. LDPC has some drawbacks, however: the correction performance is slower; it requires a significant amount of controller die space to implement; and NAND die designers are often reluctant to share information that would make the LDPC work effectively, as the parametric data needed is often considered a trade secret. A controller might include both BCH and LDPC capability and use LDPC only when the BCH techniques fail, ensuring fast performance with high data reliability.
Another novel error-correction approach taken from HDD arrays is using XOR parity across a group of NAND dies. This technique should yield better correction with the ability to survive a full NAND die failure. Also, a data scrambler is used to "whiten" the data before it is written to the die. Whitening the data protects against writing certain user data patterns to the NAND die that can cause high bit failures that result from interference between adjacent cells in the memory array.
An SSD essentially has a complex file system running internally on the controller. The firmware running this file system is one of the key differentiators among different SSDs today.
The primary function of the FTL (flash translation layer) is to map logical blocks from the system to physical NAND pages and blocks. This mapping has the challenge of handling multiple sizes of requests and alignments because of the asymmetrical I/O access limitations of NAND flash. The system uses logical blocks consisting of 512 bytes or 4 KB, which then get mapped into 8-KB NAND pages, and finally need to be written into a block consisting of 64 or 128 pages.
SSDs have no standard technique for handling this issue. A common method is to statically map contiguous logical blocks into page-aligned allocation units the size of the NAND page. Once the SSD has enough allocation units, they are combined into a NAND block-size unit before being written to the flash. Most SSDs today use a derivative of the LFS (log-structured file system) as the basis for the FTL because the append-only write design works well given the erase-before-programming limitation of NAND flash.
Because writes can be written only to empty blocks, the FTL must maintain a pool of free blocks. If the FTL runs out of free NAND blocks, or if the SSD is inactive for a period of time, the firmware starts performing background garbage-collection operations to reclaim sparsely filled NAND blocks. These blocks are reclaimed by merging the data into new blocks and erasing the old blocks, thus creating a pool of free blocks for use by the FTL.
Early SSDs had performance inconsistency, particularly under excessive stress, because garbage collection either ran out of blocks or used channel/die bandwidth. SSDs today overprovision the physical NAND flash to ensure there are enough free blocks to prevent performance penalties from garbage collection. Most consumer SSDs overprovision less than 5 percent extra NAND flash, whereas enterprise-class SSDs overprovision up to 50 percent for performance-critical applications. SSD benchmarks now take into account the impact of garbage collection and require preconditioning the FTL before taking performance measurements.
The FTL uses a number of methods to optimize performance despite the physical challenges of NAND flash:
• The firmware tracks the number of times each NAND block has been programmed and erased, and it spreads writes evenly across all NAND blocks in the SSD, increasing SSD longevity.
• A certain number of NAND blocks in a die are defective from the die-manufacturing process. Also, blocks can go bad during SSD operation. The FTL must track these bad blocks and substitute good blocks.
• NAND blocks are marked for garbage collection, based on the age of the data in them, to avoid data-retention issues.
• The FTL optimizes throughput and die usage. One common method is to statically interleave allocation units across multiple channels to ensure the best possible throughput, and to fill as much as possible of the individual blocks.
Because HDDs have been assumed to be primary storage media, today's software and hardware are engineered to optimize the performance of these storage devices:
• File systems and applications use complicated heuristics to move the mechanical disk head the minimum distance possible, improving reads and writes.
• Adjacent requests are merged into a single larger request in a process called I/O coalescing, in turn building up the large sequential writes that HDDs handle best.
• There is also an assumption that HDDs will use linear logical addressing where the beginning of the address range is the outside diameter of the disk platter (the fastest part of the drive), and the end of the address range is the inner diameter (the slowest part of the drive).
These techniques, however, do not optimize SSD usage and can hinder SSD performance:
• In an SSD, the logical addressing ends up being completely random, based on the write-access pattern and how the FTL places the data on the NAND flash. This renders locality-based algorithms ineffective.
• Queuing large number of I/O request to SSDs can lead to inconsistencies in latency and performance (both I/O coalescing and the optimization of disk-head placement require substantial queuing of I/O requests).
File systems and applications need to be rewritten to take true advantage of the performance and benefits SSDs can offer. Linux has done the most work on optimizing scheduling for SSDs. The operating system has tunables for turning off I/O coalescing and locality-based scheduling heuristics, thus improving the predictability and performance of SSDs.
Some new interface additions for file systems and applications break the HDD emulation model and assist SSD performance:
•TRIM/UNMAP. A common problem with SSDs is that a host file system could erase free data but didn't have a way of telling the storage device it no longer needed that data. The TRIM/UNMAP interface lets the SSD clear the LBA (logical block addressing) entries in the FTL, giving it more free space to use in garbage collection and reducing write amplification. OS/X, Microsoft Windows, and Linux have implemented TRIM.
•Scatter Gather. Block-access command protocols limit requests to a single contiguous range of logical blocks, requiring excessive command overhead for small random requests serviced by SSDs. Scatter Gather adds commands for "gathering" multiple noncontiguous requests into a single command, reducing overhead and improving performance.
While solid-state designs clearly benefit from some of the data-management techniques pioneered in the drive industry, the HDD emulation model will be broken as we move from the world of computer systems designed for rotating mechanical storage to a world of all solid-state storage.
In the future, NAND flash memory will inevitably reach physical limits. NAND dies are continually shrinking to lower costs, creating endurance and reliability issues that cannot be compensated for by the SSD controller or firmware. Newer memory technologies still in their infancy, such as PCM (phase-change memory) and ReRAM (resistive RAM), show great promise in moving beyond such limitations. They do so in part by shedding the erase-before-programming and asymmetrical access requirements of the NAND flash used in SSDs.
In turn, this progress inevitably will continue the evolutionary/revolutionary paradigm seen today in the transition from rotating media to solid-state devices. These new forms of media will no doubt borrow from, and build upon, the techniques implemented in NAND-based SSDs. At the same time, the shift to these newer technologies will require moving beyond the techniques developed today to deal with the unique challenges of NAND. New programming models and interfaces will need to be built to take full advantage of new storage media that offer the speed of DRAM coupled with the data retention of flash.
LOVE IT, HATE IT? LET US KNOW
Michael Cornwell is director of technology and strategy for Pure Storage. He was previously with Sun Microsystems where he served as lead technologist for flash memory and led the creation of what was at the time the world's fastest storage system, the Sun Storage F5100 Flash Array. Before joining Sun he was the manager of storage engineering for the iPod division of Apple where he was instrumental in the adoption of NAND flash in Apple products. He formerly worked at Quantum Corporation as a storage architect. He holds a bachelor's degree in computer science from the University of California at Santa Cruz. He has 45 U.S. patents awarded in flash and other storage technologies.
© 2012 ACM 1542-7730/11/1000 $10.00
Originally published in Queue vol. 10, no. 10—
see this item in the ACM Digital Library
Adam H. Leventhal - A File System All Its Own
Flash memory has come a long way. Now it's time for software to catch up.
Marshall Kirk McKusick - Disks from the Perspective of a File System
Disks lie. And the controllers that run them are partners in crime.
David S. H. Rosenthal - Keeping Bits Safe
As storage systems grow larger and larger, protecting their data for long-term storage is becoming more and more challenging.
Adam Leventhal - Triple-Parity RAID and Beyond
As hard-drive capacities continue to outpace their throughput, the time has come for a new level of RAID.