I vividly remember during my first college class my fascination with the relational database—an information oasis that guaranteed a constant flow of correct, complete, and consistent information at our disposal. In that class I learned how to build a schema for my information, and I learned that to obtain an accurate schema there must be a priori knowledge of the structure and properties of the information to be modeled. I also learned the ER (entity-relationship) model as a basic tool for all further data modeling, as well as the need for an a priori agreement on both the general structure of the information and the vocabularies used by all communities producing, processing, or consuming this information.
In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing. The Web taught us several unintuitive lessons:
XML, as defined by the World Wide Web Consortium in 1998, is a method of marking up a document or character stream to identify structural or other units within the data. XML makes several contributions to solving the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics:
There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines. On today’s Web, however, most of the information is still for human consumption in one way or another.
When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.
In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you characterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valuable. Seeing the reality of semi-structured data clearly can actually lead to improving data processing (in the literal meaning of the term). As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it. It’s time to change this.
The problem? Computers make it too easy to copy data.