Check out Pat's
Scattered Thoughts on Distributed Systems

pathelland.substack.com

Escaping the Singularity

  Download PDF version of this article PDF

Extract, Shoehorn, and Load

Data doesn't always fit nicely into a new home.

Pat Helland

A lot of data is moved from system to system in an important and increasing part of the computing landscape. This is traditionally known as ETL (extract, transform, and load). While many systems are extremely good at this process, the source for the extraction and the destination for the load frequently have different representations for their data. It is common for this transformation to squeeze, truncate, or pad the data to make it fit into the target. This is really like using a shoehorn to fit into a shoe that is too small. Sometimes it's a needed step. Frequently it's a real pain!

Extract, Shoehorn, and Load

Goes-Outta and Goes-Inta'

Two major parts of ETL are the extraction and the load. These processes are where the rubber meets the participating data stores.

Extraction pulls data out of a source system. This may be relational data kept in a database. If so, it may be converted to an object relational format where each object transforms the join of multiple relational rows into a cohesive thing. Data is frequently organized as messages when it is sucked out. It's also common for data to be extracted from key-value stores where it is kept in a semi-structured representation.

Load happens when the data is placed into the target system. The target will have its own metadata describing the shape and form of the data in its belly. If the target is an analytics system, then its data will likely be loaded into a relational form.

While it may be counterintuitive, it is frequently useful to take relational data out of a system as objects; convert, massage, and shoehorn the data from one object representation to another; and load it into the target system in relational form.

The new shoehorned data is then used for analytic queries.

 

Descriptive versus Prescriptive Metadata

To make this work, you need metadata1—for both the source being extracted and the target being loaded.

The metadata for the extracted source is descriptive. The data exists. The metadata describes its shape, form, and meaning. It is always the case that extracted data is copied out and the metadata describes what its shape was and what its shape is as it's extracted.

The metadata for the loaded data is prescriptive. The ETL system makes it fit the output metadata's shape and form exactly as it is prescribed to do. Only when the system knows what the results should look like can it do the work.

ETL systems always need to know the current shape of the input, as well as the shape the data should become.

 

Loading Data: Targeting Relational Queries

Frequently the output data to be loaded into the target system is relational in its shape. Many analytics systems expect relational data to analyze and report. Relational is great as the format for both planned and ad hoc queries.

 

What If the Shoe Doesn't Fit?

When incoming data and its descriptive metadata don't match the outgoing data shape specified by the prescriptive data, you need to do something about it.

When a shoe is too large, people will shove padding into it. Similarly, when the incoming data doesn't have all the information required for the outgoing shape and form, you add stuff. This may be a default value or a null value.

If a shoe is too small and the foot is too large, sometimes you use a shoehorn to force the foot into the shoe, comfort be damned. This is a real pain! Similarly, when the incoming data has too much information, the system needs to discard data that doesn't fit the outgoing metadata.

The process of discarding or padding data is very common.

All too often, the descriptive metadata for the input is not a perfect match to the prescriptive metadata for the desired output!

 

Extracting Data: A Hodgepodge of Sources

Sometimes data is extracted from many sources with either the same or different input metadata describing the stuff being loaded. It's essential that the data from the various sources be modified to fit into the target metadata.

Note that normalizing the data to relational form may be difficult with different input data from different systems. The needed information may be missing from some input source.

 

Conclusion

ETL takes disparate sources and destinations and moves data from one to the other. Frequently there is only a partially useful mapping of the metadata. Sometimes data needs to be discarded to traverse the path from source to destination. Other times the source data may need to be augmented with null values or default values. It's also possible that the mapping is complex and loses much of the meaning kept in the original translation as the data is reshaped and re-formed.

Metadata for the loaded source data is descriptive—it describes the data. Metadata for the data loaded into the target is prescriptive—it prescribes the required target shape and form. The challenge is that the described output may be ill fitting to the prescribed input.

It turns out that the business value of ill-fitting data is extremely high. The process of taking the input data, discarding what doesn't fit, adding default or null values for missing stuff, and generally shoehorning it to the prescribed shape is important. The prescribed shape is usually one that is amenable to analysis for deeper meaning.

It is the shoehorning that gives the data the shape it needs to be understood consistently.

 

References

1. Helland, P. 2011. If you have too much data then "good enough" is good enough. acmqueue 9(5); https://queue.acm.org/detail.cfm?id=1988603.

 

Related articles

Immutability Changes Everything
We need it, we can afford it, and the time is now.
Pat Helland
https://queue.acm.org/detail.cfm?id=2884038

Data in Flight
How streaming SQL technology can help solve the Web 2.0 data crunch
Julian Hyde
https://queue.acm.org/detail.cfm?id=1667562

Other People's Data
Companies have access to more types of external data than ever before. How can they integrate it most effectively?
Stephen Petschulat
https://queue.acm.org/detail.cfm?id=1655240

 

Pat Helland has been implementing transaction systems, databases, application platforms, distributed systems, fault-tolerant systems, and messaging systems since 1978. For recreation, he occasionally writes technical papers. He currently works at Salesforce.

Copyright © 2019 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 17, no. 2
Comment on this article in the ACM Digital Library





More related articles:

Pat Helland - Identity by Any Other Name
New emerging systems and protocols both tighten and loosen our notions of identity, and that’s good! They make it easier to get stuff done. REST, IoT, big data, and machine learning all revolve around notions of identity that are deliberately kept flexible and sometimes ambiguous. Notions of identity underlie our basic mechanisms of distributed systems, including interchangeability, idempotence, and immutability.


Raymond Blum, Betsy Beyer - Achieving Digital Permanence
Today’s Information Age is creating new uses for and new ways to steward the data that the world depends on. The world is moving away from familiar, physical artifacts to new means of representation that are closer to information in its essence. We need processes to ensure both the integrity and accessibility of knowledge in order to guarantee that history will be known and true.


Graham Cormode - Data Sketching
Do you ever feel overwhelmed by an unending stream of information? It can seem like a barrage of new email and text messages demands constant attention, and there are also phone calls to pick up, articles to read, and knocks on the door to answer. Putting these pieces together to keep track of what’s important can be a real challenge. In response to this challenge, the model of streaming data processing has grown in popularity. The aim is no longer to capture, store, and index every minute event, but rather to process each observation quickly in order to create a summary of the current state.


Heinrich Hartmann - Statistics for Engineers
Modern IT systems collect an increasing wealth of data from network gear, operating systems, applications, and other components. This data needs to be analyzed to derive vital information about the user experience and business performance. For instance, faults need to be detected, service quality needs to be measured and resource usage of the next days and month needs to be forecast.





© ACM, Inc. All Rights Reserved.