Check out Pat's
Scattered Thoughts on Distributed Systems

pathelland.substack.com

Escaping the Singularity

  Download PDF version of this article PDF

XML and JSON Are Like Cardboard

Cardboard surrounds and protects stuff as it crosses boundaries.

Pat Helland


In today's world, cardboard is an ever-important part of life. Given the major investment of resources and money, you might question whether it's worth it. It turns out that the efficiencies and savings from cardboard outstrip the costs to manufacture and later recycle it.

Semi-structured representations of data are not the cheapest format. There's typically a lot of extra stuff like angle brackets contained in it. JSON, XML, and other semi-structured representations allow for wonderful flexibility and dynamic interpretation. The efficiencies and savings gained from flexibility more than make up for the overhead.

Surround and Protect

Cardboard surrounds and protects stuff as it moves across boundaries. No one uses cardboard to move parts around within a factory. Instead, they use custom-designed containers that are specially purposed for the parts being produced. Cardboard is used to protect the stuff as it leaves the factory.

JSON and XML are used to protect data when it moves across trust boundaries. Semi-structured data wraps a single message or a single item in a key-value store in a way that allows for flexibility and extensibility. Inside an application, relational data is more tightly controlled and well formed. Evolving your relational data inside the trust and management boundary of an app is tractable.

SQL and relational data are easier and better for processing data within a trust boundary. XML and JSON are more flexible and dynamic as they capture the information and its metadata. This makes it easy and flexible to squirt data across trust boundaries.

Self-Defining and Self-Identifying

Cardboard is usually self-describing. Your new TV has printing on the outside of the box telling you what's inside the box. As you move your old TV to your new home, you write "TV" on the outside of the moving box.

JSON and XML are usually self-describing. This can be done by referencing a schema or by examining the attributes expressed within the document/file itself.

Generic versus Custom

The last time we moved to a new home, I bought a bunch of boxes of varying sizes, tape, wrapping paper and padding, and a bunch of marking pens. Like most other folks in the throes of packing and moving, we worked hard to describe the contents of every box we filled, but we occasionally messed up and omitted some items from the list as everything went into a box. Most things fit well into one of the standard boxes, although some of our household items involved really creative uses of cardboard, tape, and padding as we worked to protect our stuff.

Manufactured items frequently have custom-made cardboard protection. My wife loves the vacuum cleaners from one particular manufacturer. Indeed, the shape, form, and workings of the vacuums can be fun and surprising. To me, half the fun is disassembling the cardboard protection used inside the cardboard box. There are dozens of special pieces of cardboard wedged into every nook and cranny of the vacuum. Man, that vacuum is well protected! I suspect they have a factory just to create the specialized pieces of cardboard. I also suspect that the savings from avoiding damage are well worth it.

XML grew out of the document markup world. It descended from SGML (Standard Generalized Markup Language), which was originally intended to separate the text of a document from its formatting. XML is very strongly oriented around letting you "do your own thing" with the format.  

Yet, on top of the flexible "do your own thing" approach, there are mechanisms to impose rigor and constraints on XML documents. XML Schema came into being in the early 2000s as a means of ensuring consistency for a set of messages. A document is validated if it conforms to an XML schema definition. In this way, some usages of XML are constrained to fit a particular shape and form.

One of the wonderful things about XML and JSON is their flexibility. In some applications, they support a tightly prescribed schema much like the cardboard protecting the vacuum cleaner. In other applications, they allow you to toss in all your family goods, including the kitchen sink. Sometimes, there's a tightly prescribed schema of required data while the sender can toss in extensions to its heart's content.

Crossing Boundaries

In general, semi-structured data is used to cross boundaries in your computing environment. Documents containing human-readable stuff are kept on websites. REST calls are made across services that may or may not reside within the same company.

The loose coupling of semi-structured data allows the sending and receiving services to evolve separately with much lower friction. Changing tightly coupled stuff requires coordination that is just plain hard. 

Crossing Boundaries with Key-Value Stores

Frequently, semi-structured data in documents or files is stored in a file system or a key-value store. It is valuable to have readers and writers of these docs/files decoupled in their metadata. To have the shape and form of the data described in the contents of the docs and files makes it possible to evolve the various users with less friction than you would see if the metadata were strict and rigid. This is why we see the success of semi-structured representations for stored stuff.

It's Not the Size that Counts!

It turns out that the weight and size of the cardboard are not that big of a deal. You have surely had the experience of receiving some small item such as a computer chip packaged in a box that weighs a LOT more than the stuff being protected. It makes economic sense to protect the tiny thing well.

Large e-commerce sites ship tens of thousands of different things of different sizes. Still, they find it more efficient to use a relatively small number of box sizes. Consequently, it's common to open the box and find a tiny thing and a whole bunch of padding. 

Similarly, you shouldn't be too worried about the bulkiness of your files and documents. The embedded metadata can take a lot of space. Lord knows, an XML file has a lot of angle brackets! Still, the value accrued from the features of semi-structured data is worth it. As long as the world doesn't run out of angle brackets, it will be all right.

Gotta Take Care of Your Stuff!

In cardboard, the safety and care for stuff is the important reason for its existence. Similarly, in XML and JSON the safety and care of the data, both in transit and in storage, are why we bother. 

Now, if only we could figure out efficient recycling for used angle brackets, we'd be good to go...

Related articles

The Power of Babble
Expect to be constantly and pleasantly befuddled
Pat Helland
http://queue.acm.org/detail.cfm?id=3003188

Rules for Mobile Performance Optimization
An overview of techniques to speed page loading
Tammy Everts, Radware
http://queue.acm.org/detail.cfm?id=2510122

Schema.org: Evolution of Structured Data on the Web
Big data makes common schemas even more necessary.
R.V. Guha, Dan Brickley, and Steve Macbeth, Microsoft
http://queue.acm.org/detail.cfm?id=2857276

Pat Helland has been implementing transaction systems, databases, application platforms, distributed systems, fault-tolerant systems, and messaging systems since 1978. For recreation, he occasionally writes technical papers. He currently works at Salesforce.

Copyright © 2017 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 15, no. 4
Comment on this article in the ACM Digital Library





More related articles:

Pat Helland - Identity by Any Other Name
New emerging systems and protocols both tighten and loosen our notions of identity, and that’s good! They make it easier to get stuff done. REST, IoT, big data, and machine learning all revolve around notions of identity that are deliberately kept flexible and sometimes ambiguous. Notions of identity underlie our basic mechanisms of distributed systems, including interchangeability, idempotence, and immutability.


Raymond Blum, Betsy Beyer - Achieving Digital Permanence
Today’s Information Age is creating new uses for and new ways to steward the data that the world depends on. The world is moving away from familiar, physical artifacts to new means of representation that are closer to information in its essence. We need processes to ensure both the integrity and accessibility of knowledge in order to guarantee that history will be known and true.


Graham Cormode - Data Sketching
Do you ever feel overwhelmed by an unending stream of information? It can seem like a barrage of new email and text messages demands constant attention, and there are also phone calls to pick up, articles to read, and knocks on the door to answer. Putting these pieces together to keep track of what’s important can be a real challenge. In response to this challenge, the model of streaming data processing has grown in popularity. The aim is no longer to capture, store, and index every minute event, but rather to process each observation quickly in order to create a summary of the current state.


Heinrich Hartmann - Statistics for Engineers
Modern IT systems collect an increasing wealth of data from network gear, operating systems, applications, and other components. This data needs to be analyzed to derive vital information about the user experience and business performance. For instance, faults need to be detected, service quality needs to be measured and resource usage of the next days and month needs to be forecast.





© ACM, Inc. All Rights Reserved.