January/February 2018 issue of acmqueue

The January/February issue of acmqueue is out now

Semi-structured Data

  Download PDF version of this article PDF

ITEM not available


Originally published in Queue vol. 3, no. 8
see this item in the ACM Digital Library



Andrew McCallum - Information Extraction
Distilling structured data from unstructured text

Alon Halevy - Why Your Data Won't Mix
When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas.

Natalya Noy - Order from Chaos
There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines.

Adam Bosworth - Learning from the Web
In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing. The Web taught us several unintuitive lessons:


(newest first)

karthiga | Thu, 01 Nov 2012 10:04:45 UTC

Your paper helped me to get the concept of semi structured data... u said that"XML makes several contributions to solving the problem of semi-structured data" i recently read that XML itself is having semi structured structure.. Could u please clarify my doubt?

C. M. Sperberg-McQueen | Tue, 29 Sep 2009 00:48:51 UTC

Thank you for your comment. As the remainder of the article should make clear, I am borrowing the term "semi-structured data" from database theorists, and attempting to follow their usage. I don't personally find the concept compelling. So while I can try to clarify what I understand the common usage of the term to be, I can't and won't argue that the concept is especially well grounded or satisfactorily distinguished from other concepts.

> Why can't structured data also have numerous > repeating fields?

As I understand the usage of the terms, accounting for repeating fields by factoring the information into multiple relations is part of the process of structuring the data. So data with repeating fields is not structured, and the same data is structured after being reduced to third normal form.

> What do you mean be ``a naive hierarchical > representation of data' in contrast to fully > structured data?

I mean a representation of information created with a view to organizing the information and labeling it clearly, but without (or before) any attempt to normalize it or make it systematic according to any particular theoretical analysis.

> I also don't quite understand the two last > characterizations. Could you please provide some more > explanation/ examples? That would be very helpful.

By "wide variation in structure" I mean that different instances of the same kind of entity have different structures.

For example, we may find that in the collection of data to be represented, one instance of 'person' contains a name and affiliation, another a name, job title, and address, and a third merely a name. Or: one poem consists of a title and fourteen lines, another has three lines and no title, and a third is divided into cantos, then into stanzas, and then finally into lines. Some plays consist of five acts, some of three, some of a single scene. Some entries in a bibliographic database require that we record the title of the article, the title of the journal, the data, the volume, and the pages on which the article was published. Other articles are published not in journals but in collections of papers, for which we must record the collection title, publisher, and editor(s). Some volumes of journals consist of special collections of papers with an individual volume title as well as the journal title, thus requiring both. Some papers are published in a journal and then republished in collections, sometimes multiple times, sometimes in translation.

If we assume with most database theory that in a database, we must record the same set of attributes for each entity described, then such variation in the attributes of entities represents a challenge, or an affront, to the theory of normalization. Empirically, the result is that literary works and bibliographic data are seldom represented satisfactorily in relational database management systems. (There are of course lots of databases for bibliographic data; I don't think I'm alone in thinking that most of them are cumbersome and unsatisfactory in small or large ways, at least for people who care about the details of good bibliographic practice.)

By "sparse tables" I mean simply conventional relational tables in which relatively few cells actually have values, while many or even most are null. If we attempt to reduce bibliographic entries to a single tabular structure by force, or by joining the many tables to which they are reduced in third normal form, we will end up with a table in which most column values are null.

It's unlikely that there is any bright clear line separating what theorists refer to as "structured" and what they refer to as "semi-structured" data. An large and indeterminate number of gradations are possible.

I hope this helps.

Normen Müller | Mon, 31 Aug 2009 07:55:14 UTC

Dear Mr. Sperberg-McQueen,

I am currently trying to figure out the difference between structured and semi-structured data. In your article at the very beginning your stated:

``[&] the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics:

o Numerous repeating fields and structures in a naive hierarchical representation of the data, which lead to large numbers of tables in a second- or third-normal form representation o Wide variation in structure o Sparse tables'

Why can't structured data also have numerous repeating fields? What do you mean be ``a naive hierarchical representation of data' in contrast to fully structured data? I also don't quite understand the two last characterizations. Could you please provide some more explanation/ examples? That would be very helpful.

Thank you in advance for your time, Normen

Leave this field empty

Post a Comment:

© 2018 ACM, Inc. All Rights Reserved.