The Morning Paper

  Download PDF version of this article PDF

The Morning Paper

SageDB and NetAccel

Learned models within the database system; network-accelerated query processing

Adrian Colyer

For the last few years, we have been avid followers of Adrian Colyer's blog, The Morning Paper. We read and admire The Morning Paper for the same reasons we began Queue—it delivers practical ideas from academia to the computing practitioner in a useful, no-nonsense fashion. We know this approach works and our ever-growing readership supports this as well. In The Morning Paper, Adrian distills academic papers in a way that makes them naturally accessible to the general practitioner. It should be read by every Queue reader, and we are delighted to welcome Adrian and The Morning Paper to our publication.

- Stephen Bourne and Theo Schlossnagle, Queue editorial board co-chairs

 

The CIDR (Conference on Innovative Data Systems Research) runs once every two years, and luckily for us 2019 is one of those years. I've selected two papers from this year's conference (held January 13-16) that highlight bold and exciting directions for data systems.

In "SageDB: A Learned Database System," the authors explore the potential of embracing machine learning within the very core of a database system: learned models that pervade the algorithms and data structures of the system itself. It hints at an interesting future for systems software in general, with machine-learned components augmenting and adapting systems to fit their workloads and execution environments.

My second choice is "The Case for Network-accelerated Query Processing." With programmable switches we have a whole new option for processing placement in distributed systems—we can now perform certain tasks "in the wire"! With NetAccel, the authors show how parts of the query processing for an MPP (massively parallel processing) database can be done at line rate in packet-processing hardware. The resulting performance improvements are hard to ignore.

- Adrian Colyer, The Morning Paper

SageDB: a learned database system Kraska et al., CIDR'19

About this time last year, a paper entitled 'The case for learned index structures' (part I, part II) generated a lot of excitement and debate. Today's paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.

The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call "database synthesis" will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

For the want of a model

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.

Consider an extreme case: we want to store and query ranges of fixed-length records with continuous integer keys. Using a conventional index here makes no sense as the key itself can be used as an offset. A C program loading 100M integers into an array and summing over a range runs in about 300ms. Doing the same operation in Postgres takes about 150 seconds: a 500x overhead for the general purpose design.

...we can optimize almost any algorithm or data structure used by the database with knowledge of the exact data distribution. These optimizations can sometimes even change the complexity class of well-known data processing algorithms.

Knowledge of the data distribution comes in the form of a (learned) model. Armed with such a model, the authors argue that we can automatically synthesise index structures, sorting and join algorithms, and even entire query optimisers, leveraging the data distribution patterns for performance gains.

Overfitting is good

What kind of model makes sense? A histogram for example is a very simple model, but for the use cases discussed here either too coarse-grained or too big to be useful. At the other end of the spectrum, deep and wide neural nets come with high costs (though these are expected to decrease with advances in hardware). Combine this with the fact that for this use case, 'overfitting' is good! We want to capture the precise nuances of our exact data as precisely as possible. (The research program to date is largely focused on analytic workloads, some degree of generalisation is clearly beneficial once we start to consider updates).

As of today, we found that we often need to generate special models to see significant benefits.

As an example, consider the RMI model from 'The case for learned index structures' :

  1. Fit a simple model (linear regression, simple neural net etc.) over the data
  2. Use the model's prediction to pick another model, an expert, which more accurately models the subset of the data
  3. Repeat the process until the leaf model is making a final prediction

RMI is just a starting point. For example, it is possible to make the top model or bottom model more complex, replace parts of the models at a particular level stage with other types of models, use quantization, vary the feature representation, combine models with other data structures, and so on. We therefore believe we will see an explosion of new ideas on how to most efficiently generate models for database components to achieve the right balance between precision, low latency, space, and execution time for a given workload.

Data access

Last year's paper on 'The case for learned index structures' showed that an RMI-based index can outperform state of the art B-Tree implementations by a factor of two while being orders of magnitude smaller ("note that the updated arXiv version contains new results"). Subsequent work has extended this to data stored on disk, compression inserts, and multi-dimensional data.

For multi-dimensional data, the baseline is an R-Tree (as opposed to a B-Tree). R-Trees map rectangles to a list of index ranges such that the index of every point lying in the rectangle is contained in the union of these ranges. We can replace an R-Tree with a learned model, just as we could the B-Tree. One of the tricks that makes the RMI B-Tree replacement work is that it is sufficient for the model to get us 'in the right locality' and then we can do a local search around the prediction to finish the job. For R-Trees, we also need a layout that enables efficient localised search.

While many possible projection strategies exist, we found that successively sorting and partitioning points along a sequence of dimensions into equally-sized cells produces a layout that is efficient to compute, learnable (e.g., in contrast to z-order, which is very hard to learn), and tight (i.e., almost all points in the union of the index ranges satisfy the query).

The authors implemented such a learned index over an in-memory column store with compression, and compared it to a full column scan, a clustered index (sorting by the column providing the best overall performance), and an R-Tree. The benchmarks used 60 million records from the lineitem table of the TPC-H benchmark, with query selectivity of 0.25%.

The learned index beats the next best performing implementation by 34x (note the log scales on the charts) and has only a tiny space overhead compared to the clustered solution.

Further analysis revealed that the learned index beats the clustered index on almost every type of query — the exception is when the clustered dimension in the clustered index is the only dimension in the query.

Query execution

This is one of my favourite parts of the paper, because it demonstrates how learned models can even help in the humble and age-old case of sorting. The approach to sorting is to use a learned model to put the records into roughly the right order, and then correct the nearly perfected sorted data as a final step. For this an efficient local sort such as insertion sort can be used, which is very fast with almost-sorted arrays.

The figure below shows results of a learned approach to sorting for increasingly large data sizes of 64-bit doubles randomly sampled from a normal distribution. In the comparison, Timsort is the default sort for Java and Python, and std::sort is from the C++ library. The learned variant is 18% faster than the next best (Radix sort in this case) on average.

(This doesn't include the time taken to learn the model.)

Learned models can also be used to improve joins. For example, consider a merge-join with two stored join columns and a model-per-column. We can use the model to skip data that will not join (the authors don't detail how the equivalent of 'local patching' is supposed to work in this scenario, it's not immediately obvious to me).

The authors also experimented with workload aware schedulers, implementing a reinforcement-learning based scheduling system using a graph neural network:

Our system represents a scheduling algorithm as a neural network that takes as input information about the data (e.g., using a CDF model) and the query workload (e.g, using a model trained on previous executions of queries) to make scheduling decisions.

On a sample of 10 TPC-H queries, the learned scheduler improved average job completion time by 45% over Spark's default FIFO scheduler.

The strategy that the scheduler learned to get this improvement was to combine completing short jobs quickly with maximising cluster efficiency, learning to run jobs near their parallelism 'sweet spot.'

Query optimiser

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned...

Initial experiments starting with a traditional cost model and refining it over time through learning showed that the model quality can be improved, but that to make big gains would require making significant improvements to cardinality estimation. The research direction now (no reported results as yet) is to explore hybrid-model based approaches to cardinality estimation. These hybrid models combine a learned model of the underlying data patterns and correlations, with exception/outlier listens that capture extreme (and hard to learn) anomalies of the particular instance of the data.

Other areas

Other suggested areas where learned models may prove beneficial in the future include approximate query processing, predictive modelling, and workloads including inserts and updates.

The last word

SageDB presents a radical new approach to build database systems, by using ML models combined with program synthesis to generate system components. If successful, we believe this approach will result in a new generation of big data processing tools, which can better take advantage of GPUs and TPUs, provide significant benefits in regard to storage consumption and space, and, in some cases, even change the complexity class of certain data operations.

The case for network-accelerated query processing Lerner et al., CIDR'19

Datastores continue to advance on a number of fronts. Some of those that come to mind are adapting to faster networks (e.g. 'FARM: Fast Remote Memory') and persistent memory (see e.g. 'Let's talk about storage and recovery methods for non-volatile memory database systems'), deeply integrating approximate query processing (e.g. 'ApproxHadoop: Bringing approximations to MapReduce frameworks,' and 'BlinkDB'), embedding machine learning in the core of the system (e.g. 'SageDB'), and offloading processing into the network (e.g KV-Direct) — one particular example of exploiting hardware accelerators. Today's paper gives us an exciting look at the untapped potential of network-accelerated query processing. We're going to need all that data structure synthesis and cost-model based exploration coupled with self-learning to unlock the potential that arises from all of these advances in tandem!

NetAccel uses programmable network devices to offload some query patterns for MPP databases into the switch.

Thus, for the first time, moving data through networking equipment can contributed to query execution. Our preliminary results show that we can improve response times on even the best agreed upon plans by more than 2x using 25Gbps networks. We also see the promise of linear performance improvement with faster speeds. [Emphasis added]

That promise of linear performance hints at an 8x speedup with a 100Gbps network in a data center!

Programmable switches and MAUs

Modern programmable network devices have packet-processing hardware in the form of Match-Action Units or MAUs. A MAU combines a match engine (is this a packet of interest?), with an action engine (what should I do with it?). The match engine holds data in a table format. MAUs are programmable in the sense that the table layout, type of lookup to perform, and processing done at a match event can all be specified. Combining several MAUs in pipeline fashion yields a programmable dataplane.

When programming MAUs we must take care to ensure progress can be made at line speed. This requires constant-time lookups and deterministic actions. For example, there are no loops or dynamic resource allocation permitted as these could make a program's runtime non-deterministic. The P4 language can be used to program collections of MAUs, and it's language restrictions and compiler ensure that a given program can run properly on a given target. Current hardware devices have e.g. 12-20 MAUs.

Processing in the network

Traditionally, the fastest plans in MPP databases are those with the least amount of data movement across nodes, and network switches are passive elements that just route packets. The tuples generated by query execution are an opaque payload from the perspective of the network.

A query such as Query 20 from the TPC-H benchmark (Fig 2a), which joins five relations, has a query plan designed to minimise data motion (Fig 2b). With programmable switches though, it becomes possible to implement some parts of the query in the network. This requires an updated query plan (Fig 2c) to take advantage of the possibility.


(Enlarge)

For the first time, entire segments of a query plan can be performed on the switch, with potentially strong consequences to data placement, query execution, and optimization. The tuples are processed on the switch at "line-speed" — up to 100 Gbps in our case — yielding performance improvements that increase with network speed.

Introducing NetAccel

NetAccel is an MPP database (Greenplum) extended with three novel components:

The network scheduler decides how the MAUs will be organised. One strategy is to implement fixed common query patterns, e.g. join-and-group-by. Alternatively the MAU allocation could be adaptive. In the prototype a fixed join-and-group-by pattern is used. The scheduler also puts in place a strategy to deal with overflowing tuples. That is, what should happen if the incoming data tuples do not fit in allocated data space at the MAU. One simple overflowing strategy is to overflow to the control plane, which can be reached simply by routing packets to it. Once a tuple has overflowed, any further operation on the tuple is also redirected.

The deparser manages communication between the MPP node and the switch. This includes the tuples themselves, as well as the instructions for what is to be done with them.

An important consideration here is the choice of a network protocol. For instance, simply making a tuple be the payload of a TCP packet would not work. The switch drops many packets... and moreover creates new packets dynamically... TCP being stateful making such changes to the packet flow without a receiver equating them to anomalies would be an overhead. Better network protocol stacks exist for our case. We could use a traditional IP stack and a connectionless protocol such as UDP. Or use a lightweight protocol directly atop of Ethernet. We are currently exploring the latter...

At 100 Gbps and tuples of less than 40 bytes, we may need to forward more than 148M tuples per second per port. Normal OS and TCP/IP stacks cannot operate at this pace. NetAccel bypasses both.

The fun part of course is the network-accelerated operators themselves. The authors discuss operators for hash-joins, hash-based aggregation, data motion, and data reloading.

Hash-joins require as a minimum two MAUs: one to store a hash table, and one to keep track of overflow. But we can extend capacity by using a chain of MAUs to store larger tables in each case. Each additional MAU in the chain stores an additional position in the collision chain for each location in the hash table.

If insertion fails at all MAUs (the collision chain is full), then the packet is overflowed. Packet metadata is updated to indicate that an insertion was not possible, and we also record that the chain to which the packet hashes is full. The full algorithm looks like this:

See section 3.3 in the paper for the details of the hash-based aggregation algorithm.

Data motion operators assist in sending the results of an operation to downstream nodes, and data reloading operators are used to relocate data within the switch.

Prototype and evaluation

The prototype implements a join-and-group-by-one query pattern on the MAUs of a Tofino switch. The pattern uses 10 MAUs, as shown in the figure below.


(Enlarge)

The experiments use a four-node Greenplum Parallel Database instance (three segment nodes and one control node), running Query 20 from the TPC-H benchmark with scale factor of 100.

For the most expensive join-and-group-by segment of this query, the normal Greenplum query plan completes in 1702ms, whereas the network accelerated plan completes in 834ms.

In the normal plan, Greenplum tries to move as little data as possible and to perform the join-and-group-by locally. This is the accepted best practice for distributed plans. Conversely, the accelerated plan pushes both relations onto the switch. The effective network speed the deparse achieved in that case was within 2% of the nominal maximum of 25 Gbps. The accelerated plan ran 2.04x faster.

A second experiment explores the allocation of MAU tables to either the join or group-by stages, finding that it is more advantageous to assign extra MAUs to the join rather than to the group-by.

A final experiment investigates the effects of varying network speed. In the chart below, the 'original' bar shows the plan originally chosen by Greenplum, and 'normal' is the best possible plan of the query without using the network acceleration. The 'accel' bar shows the performance achieved with network acceleration, and the 'min' bar shows the theoretical minimum time it would take if the network was systematically saturated.

The current limitation to getting the 'accel' performance closer to 'min' is the deparser, which plateaus at 29 Gbps.

Realizing the potential speed— the difference between the 'min' bar at a given speed and the 'accel' one— requires some future work (e.g., by using RDMA).

Research agenda

NetAccel opens several fundamental research directions to be explored, including:

While realizing the full potential of our vision will take years, we are excited by the prospect of NetAccel and by the new data processing avenues opening up with the advent of next-generation networking equipment.

I have to say, I can't wait to see where this will take us. I have a feeling it's going to be a big part of large scale data processing in the future.


Adrian Colyer is a venture partner with Accel in London, where it's his job to help find and build great technology companies across Europe and Israel. (If you're working on an interesting technology-related business he would love to hear from you: you can reach him at [email protected].) Prior to joining Accel, he spent more than 20 years in technical roles, including CTO at Pivotal, VMware, and SpringSource.

Copyright © 2019 held by owner/author. Publication rights licensed to ACM.

Reprinted with permission from https://blog.acolyer.org

acmqueue

Originally published in Queue vol. 17, no. 1
Comment on this article in the ACM Digital Library





More related articles:

Nicole Forsgren, Eirini Kalliamvakou, Abi Noda, Michaela Greiler, Brian Houck, Margaret-Anne Storey - DevEx in Action
DevEx (developer experience) is garnering increased attention at many software organizations as leaders seek to optimize software delivery amid the backdrop of fiscal tightening and transformational technologies such as AI. Intuitively, there is acceptance among technical leaders that good developer experience enables more effective software delivery and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in as business stakeholders question the value proposition of improvements.


João Varajão, António Trigo, Miguel Almeida - Low-code Development Productivity
This article aims to provide new insights on the subject by presenting the results of laboratory experiments carried out with code-based, low-code, and extreme low-code technologies to study differences in productivity. Low-code technologies have clearly shown higher levels of productivity, providing strong arguments for low-code to dominate the software development mainstream in the short/medium term. The article reports the procedure and protocols, results, limitations, and opportunities for future research.


Ivar Jacobson, Alistair Cockburn - Use Cases are Essential
While the software industry is a fast-paced and exciting world in which new tools, technologies, and techniques are constantly being developed to serve business and society, it is also forgetful. In its haste for fast-forward motion, it is subject to the whims of fashion and can forget or ignore proven solutions to some of the eternal problems that it faces. Use cases, first introduced in 1986 and popularized later, are one of those proven solutions.


Jorge A. Navas, Ashish Gehani - OCCAM-v2: Combining Static and Dynamic Analysis for Effective and Efficient Whole-program Specialization
OCCAM-v2 leverages scalable pointer analysis, value analysis, and dynamic analysis to create an effective and efficient tool for specializing LLVM bitcode. The extent of the code-size reduction achieved depends on the specific deployment configuration. Each application that is to be specialized is accompanied by a manifest that specifies concrete arguments that are known a priori, as well as a count of residual arguments that will be provided at runtime. The best case for partial evaluation occurs when the arguments are completely concretely specified. OCCAM-v2 uses a pointer analysis to devirtualize calls, allowing it to eliminate the entire body of functions that are not reachable by any direct calls.





© ACM, Inc. All Rights Reserved.