Semi-structured Data

Vol. 3 No. 8 – October 2005

Semi-structured Data

Learning from the Web:
The Web has taught us many lessons about distributed computing, but some of the most important ones have yet to fully take hold.

In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing.

by Adam Bosworth

Managing Semi-Structured Data:
I vividly remember during my first college class my fascination with the relational database.

In that class I learned how to build a schema for my information, and I learned that to obtain an accurate schema there must be a priori knowledge of the structure and properties of the information to be modeled. I also learned the ER (entity-relationship) model as a basic tool for all further data modeling, as well as the need for an a priori agreement on both the general structure of the information and the vocabularies used by all communities producing, processing, or consuming this information.

by Daniela Florescu

Order from Chaos:
Will ontologies help you structure your semi-structured data?

There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines. On today’s Web, however, most of the information is still for human consumption in one way or another.

by Natalya Noy

The Cost of Data:
Semi-structured data is the result of economics.

In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing. As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it. It’s time to change this.

by Chris Suver

Why Your Data Won’t Mix:
New tools and techniques can help ease the pain of reconciling schemas.

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.

by Alon Halevy

XML <and Semi-Structured Data>:
XML provides a natural representation for hierarchical structures and repeating fields or structures.

Vocabulary designers can require XML data to be perfectly regular, or they can allow a little variation, or a lot. In the extreme case, an XML vocabulary can effectively say that there are no rules at all beyond those required of all well-formed XML. Because XML syntax records only what is present, not everything that might be present, sparse data does not make the XML representation awkward; XML storage systems are typically built to handle sparse data gracefully.

by C. M. Sperberg-McQueen

Kode Vicious Unscripted:
The problem? Computers make it too easy to copy data.

Some months, when he’s feeling ambitious, Kode Vicious reads through all of your letters carefully, agonizing for days over which to respond to. Most of the time, though, he takes a less measured approach. This usually involves printing the letters out, throwing them up in the air, and seeing which land face up, repeating the process until only two remain. And occasionally, KV dispenses with reader feedback altogether, as is the case this month.

by George Neville-Neil