Semi-structured Data

Vol. 3 No. 8 – October 2005

Semi-structured Data

Articles

Learning from the Web

In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing. The Web taught us several unintuitive lessons:

Learning from THE WEB

The Web has taught us many lessons about distributed computing, but some of the most important ones have yet to fully take hold.

ADAM BOSWORTH, GOOGLE

In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing. The Web taught us several unintuitive lessons:

1. Simple, relaxed, sloppily extensible text formats and protocols often work better than complex and efficient binary ones. Because there are no barriers to entry, these are ideal. A bottom-up initiative can quickly form around them and reach a tipping point in terms of adoption. In other words, anyone can write HTML, no matter how syntax-challenged they may be, because the browsers are so forgiving; and even writing an HTTP server is within the reach of orders of magnitude more than, say, writing a CORBA or DCOM server. What’s more, if the text format doesn’t work, one can easily mail around the HTTP request or HTML to friends who will examine it in their text tool of choice and explain what is incorrect. In short, having a format that “normal” people can inspect, understand, augment, and author is hugely important to adoption in a bottom-up world.

by Adam Bosworth

Managing Semi-Structured Data

I vividly remember during my first college class my fascination with the relational database—an information oasis that guaranteed a constant flow of correct, complete, and consistent information at our disposal. In that class I learned how to build a schema for my information, and I learned that to obtain an accurate schema there must be a priori knowledge of the structure and properties of the information to be modeled. I also learned the ER (entity-relationship) model as a basic tool for all further data modeling, as well as the need for an a priori agreement on both the general structure of the information and the vocabularies used by all communities producing, processing, or consuming this information.

Managing Semi-Structured Data

DANIELA FLORESCU, ORACLE

I vividly remember during my first college class my fascination with the relational database—an information oasis that guaranteed a constant flow of correct, complete, and consistent information at our disposal. In that class I learned how to build a schema for my information, and I learned that to obtain an accurate schema there must be a priori knowledge of the structure and properties of the information to be modeled. I also learned the ER (entity-relationship) model as a basic tool for all further data modeling, as well as the need for an a priori agreement on both the general structure of the information and the vocabularies used by all communities producing, processing, or consuming this information.

Several years later I was working with an organization whose goal was to create a large repository of food recipes. The intent was to include recipes from around the world and their nutritional information, as well as the historical and cultural aspects of food creation.

by Daniela Florescu

Order from Chaos

There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines. On today’s Web, however, most of the information is still for human consumption in one way or another.

Order from Chaos

Will ontologies help you structure your semi-structured data?

NATALYA NOY, STANFORD UNIVERSITY

There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines. On today’s Web, however, most of the information is still for human consumption in one way or another.

Both of these trends are reflected in the vision of the Semantic Web, a form of Web content that will be processed by machines with ontologies as its backbone. Tim Berners-Lee, James Hendler, and Ora Lassila described the “grand vision” for the Semantic Web in a Scientific American article in 2001:1 Ordinary Web users instruct their personal agents to talk to one another, as well as to a number of other integrated online agents—for example, to find doctors that are covered by their insurance; to schedule their doctor appointments to satisfy both constraints from the doctor’s office and their own personal calendars; to request prescription refills, ensuring no harmful drug interactions; and so on. For this scenario to be possible, the agents need to share not only the terms—such as appointment, prescription, time of the day, and insurance—but also the meaning of these terms. For example, they need to understand that the time constraints are all in the same time zone (or to translate between time zones), to know that the term plans accepted in the knowledge base of one doctor’s agent means the same as health insurance for the patient’s agent (and not insurance, which refers to car insurance), and to realize it is related to the term do not accept for another doctor, which contains a list of excluded plans.

by Natalya Noy

Curmudgeon

The Cost of Data

In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing (in the literal meaning of the term). As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it. It’s time to change this.

The Cost of Data

Chris Suver, Microsoft

In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing (in the literal meaning of the term). As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it. It’s time to change this.

For data to be operated on reliably, either in an application or in a tool (such as a database), it must be typed, or for high reliability, it must be strongly typed. This is necessary because at some point the underlying hardware has to choose the right circuit to process the bits. It has long since been demonstrated that any data can be typed—in fact, strongly typed. Still, today most data is not typed (i.e., structured). This is simply because it is not worth spending the resources to apply typing to the data (i.e., the value of the data simply does not justify the investment).

by Chris Suver

Articles

Why Your Data Won't Mix

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.

Why Your Data Won’t Mix

New tools and techniques can help ease the pain of reconciling schemas.

ALON HALEVY, UNIVERSITY OF WASHINGTON

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.

This article begins by reviewing several common scenarios in which resolving semantic heterogeneity is crucial for building data-sharing applications. We then explain why resolving semantic heterogeneity is difficult and review some recent research and commercial progress in addressing the problem. Finally, we point out the key problems and opportunities in this area.

by Alon Halevy

XML

XML, as defined by the World Wide Web Consortium in 1998, is a method of marking up a document or character stream to identify structural or other units within the data. XML makes several contributions to solving the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics:

XML and Semi-Structured Data

C. M. SPERBERG-MCQUEEN, WORLD WIDE WEB CONSORTIUM

XML, as defined by the World Wide Web Consortium in 1998, is a method of marking up a document or character stream to identify structural or other units within the data. XML makes several contributions to solving the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics:

  • Numerous repeating fields and structures in a naive hierarchical representation of the data, which lead to large numbers of tables in a second- or third-normal form representation
  • Wide variation in structure
  • Sparse tables

XML provides a natural representation for hierarchical structures and repeating fields or structures. Further, XML document type definitions (DTDs) and schemas allow fine-grained control over how much variation to allow in the data: Vocabulary designers can require XML data to be perfectly regular, or they can allow a little variation, or a lot. In the extreme case, an XML vocabulary can effectively say that there are no rules at all beyond those required of all well-formed XML. Because XML syntax records only what is present, not everything that might be present, sparse data does not make the XML representation awkward; XML storage systems are typically built to handle sparse data gracefully.

by C. M. Sperberg-McQueen

Kode Vicious

Kode Vicious Unscripted

The problem? Computers make it too easy to copy data.

Kode Vicious Unscripted

Some months, when he’s feeling ambitious, Kode Vicious reads through all of your letters carefully, agonizing for days over which to respond to. Most of the time, though, he takes a less measured approach. This usually involves printing the letters out, throwing them up in the air, and seeing which land face up, repeating the process until only two remain. And occasionally, KV dispenses with reader feedback altogether, as is the case this month. Not to worry though, he still digs reading and responding to your monthly koding kwestions, so keep ‘em coming to kv@acmqueue.com.

by George Neville-Neil