Semi-structured Data

Sort By:

Information Extraction:
Distilling structured data from unstructured text

In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.

by Andrew McCallum | December 16, 2005


Why Your Data Won’t Mix:
New tools and techniques can help ease the pain of reconciling schemas.

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas.

by Alon Halevy | December 8, 2005


Order from Chaos:
Will ontologies help you structure your semi-structured data?

There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines.

by Natalya Noy | December 8, 2005


XML <and Semi-Structured Data>:
XML provides a natural representation for hierarchical structures and repeating fields or structures.

Vocabulary designers can require XML data to be perfectly regular, or they can allow a little variation, or a lot. In the extreme case, an XML vocabulary can effectively say that there are no rules at all beyond those required of all well-formed XML. Because XML syntax records only what is present, not everything that might be present, sparse data does not make the XML representation awkward; XML storage systems are typically built to handle sparse data gracefully.

by C. M. Sperberg-McQueen | December 8, 2005


Learning from the Web:
The Web has taught us many lessons about distributed computing, but some of the most important ones have yet to fully take hold.

In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing.

by Adam Bosworth | December 8, 2005


Managing Semi-Structured Data:
I vividly remember during my first college class my fascination with the relational database.

In that class I learned how to build a schema for my information, and I learned that to obtain an accurate schema there must be a priori knowledge of the structure and properties of the information to be modeled. I also learned the ER (entity-relationship) model as a basic tool for all further data modeling, as well as the need for an a priori agreement on both the general structure of the information and the vocabularies used by all communities producing, processing, or consuming this information.

by Daniela Florescu | December 8, 2005