In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.
The first and biggest problem was that much of the data wasn’t available even in semi-structured form, much less normalized, structured form. Although some of the larger organizations had internal databases of their course listings, almost none of them had publicly available interfaces to their databases. The only universally available public interfaces were Web pages designed for human browsing. Unfortunately, but as expected, each organization used different text formatting. Some of these Web pages contained two-dimensional text tables; many others used a stylized collection of paragraphs for each course offering; still others had a single paragraph of English prose containing all the information about each course.
The task thus required extracting structured information from English that had been formatted in a mixture of two-dimensional layout and free-running prose—a daunting technical challenge, but one that was ultimately solved successfully. More details about the solution follow, but first, let’s place this problem in context.
Articles in the October 2005 issue of ACM Queue addressed problems with semi-structured data—data that is loosely formatted in XML or CSV (comma separated value) tables, unnormalized, in different schemas, perhaps also noisy with duplicate records. But the majority of the world’s information is even less structured than this—it is in so-called “natural language text”—written English and other languages, in Web pages, corporate memos, news articles, research reports, e-mail, blogs, and historical documents.
These text documents can be effectively searched and ranked by modern search engines, but fielded searches, range-based or join-based structured queries, data mining, and decision support typically require much more detailed and fine-grained processing. The information locked in natural language must first be transformed into structured, normalized database form.
Information extraction aims to do just this—it is the process of filling the fields and records of a database from unstructured or loosely formatted text. Thus (as shown in figure 1), it can be seen as a precursor to data mining: Information extraction populates a database from unstructured or loosely structured text; data mining then discovers patterns in that database. Information extraction involves five major subtasks (which are also illustrated in figure 2):
Historically, information extraction most often has been studied for news articles from which organizations, locations, and individual names are extracted and related to each other, but more recently information extraction has been applied to many text formats, including Web pages, government reports, scientific articles, e-mail, and legal documents. There are many compelling applications of information extraction, including the U.S. Department of Labor’s continuing education course extraction. Here are some others:
Some simple extraction tasks can be solved by writing regular expressions. Extraction from moderately more complex text sources, yet that have sufficient formatting regularity, can be addressed accurately with hand-tuned, programmed rules. For example, if you wanted to extract book titles and author names from the Web pages at Amazon.com, you could rely on the fact that they appear with exactly consistent formatting (title just under the blue bar, in bold; author hyperlinked underneath, preceded by the word by), and write a fairly straightforward Perl script in about 30 to 60 minutes that would do the job using these formatting regularities. But if you also wanted to gather information from BarnesandNoble.com, and thousands of other booksellers, you would have to write new rules for each one—and then rewrite them every time one of the sellers changed its Web layout.
When the human resources for this level of rule-writing (and ongoing rewriting) aren’t available, or when the formatting clues are unreliable or not present, information extraction must rely on the language itself—the words, word order, grammar—perhaps also combined with whatever weak, irregular formatting clues are present. For example, the U.S. Department of Labor task previously described falls in this category. The hand-tuned rule-writing approach is sometimes used in this situation (and has been successfully used by companies such as ClearForest Corporation and SRA International); however, as the language patterns get subtler, as the exceptions of English usage pile up, and as the rules interact with each other more and more, the rule writing can get extremely complex. It is not unusual for such systems to include literally thousands of written rules, with subtle interconnections that make editing the rules extremely error-prone.
Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction.5,6,7,8,9,10 These are methods that automatically tune their own rules or parameters to maximize performance on a set of example texts that have been correctly labeled by hand. In other words, instead of trying to tune the complex extraction rules manually, you show the machine what to do on specific example texts by performing the extraction task yourself. The machine then generalizes from these examples, appropriately tuning its own rules and parameters. For complex extraction tasks, many examples may be required (on the order of hundreds or thousands), but labeling data is often still significantly easier than hand-tuning rules, and it can be done by less-skilled, part-time labor. In many cases, machine learning obtains significantly higher accuracy than human-tuned methods.
Some of these machine-learning methods use decision trees11 or if-then-else rules.12,13 Such an approach is often followed in systems that use machine learning to create formatting-based extractors (called wrappers), as described in the Amazon.com example. Increasingly popular are machine-learning methods that use large numbers of relatively simple features of the input but assign subtly interacting, real-value weights to these features. For example, the word said is a weak indicator that a person’s name may be coming next. These methods assign some weight with which the preceding-word-is-said feature votes for the next word being a person’s name; then, by combining evidence from many such appropriately weighted, weakly indicative features, very accurate extraction decisions can be made.
One such statistical model with simple features is the HMM (hidden Markov model)—a finite-state machine with probabilities on the state transitions and probabilities on the per-state word emissions. HMMs became widely used in the 1990s for extraction from English prose.14,15 States of the machine are assigned to different database fields, and the highest-probability state path associated with a sequence of words indicates which sub-sequences of the words belong to those database fields.
More recently there has been interest in combining the advantages of finite-state machines with more complex features—a prospect that is enabled by conditional-probability models, including maximum entropy Markov models16 and conditional random fields.17 These models have ranked highly in information extraction competitions (for example, the BioCreative competition to extract protein names from bioinformatics research papers). Conditional random fields have been used to integrate more of the stages of information extraction, including not only segmentation and classification, but also normalization and deduplication, using models beyond just finite-state machines.
The U.S. Department of Labor course extraction problem was solved by a company called WhizBang Labs using a combination of several machine-learning components. To find the Web pages likely to contain course listings, text classification was used in conjunction with a spider. Statistical language modeling methods hypothesized segmentations and classifications of the different fields, which also were put into a classifier responsible for coarse-scale segmentation of one course from another. A method called scoped learning was then used to learn formatting (wrapper-like) regularities on the fly from each page, without human intervention. Logistic-regression classifiers were used to complete the association and deduplication phases. (Conditional random fields were not used only because they had not yet been developed.) In the end, the project was deemed a success—data was extracted with sufficient accuracy so that it could be deposited directly into the Web site’s structured database.
The accuracy of automated extraction methods varies drastically depending on the regularity of the text input and the strength of the extraction method used. Extraction from formatted, highly regular database-generated Web pages (such as those on Amazon.com) can be done with perfect accuracy. Extraction from other somewhat regular text, such as postal address blocks or research paper citations, usually has a percentage accuracy in the mid- to high-90s. Accuracies in the mid-90s are now standard for extracting names of people, companies, and locations from standard news articles. (Extracting these entity names from Web pages, however, is more difficult, yielding accuracies in the 80s.) Extraction of protein names is more difficult, since their naming scheme is more irregular; the accuracies in a recent competition were in the 80s.
Success in the association stage of extraction is generally more difficult because a correct final answer also requires correct segmentation and classification of all of the fields that should be associated. Furthermore, in many domains, such as news articles, the evidence for certain associations or relations may require understanding complex subtleties of English usage and meaning. Accuracy for relation extraction in news stories is typically in the 60s.18
Deduplication, on the other hand, often performs more accurately when there are more fields to process. Deduplication of entity names in news articles often has accuracy in the 70s or 80s.19 Deduplication of multi-field research paper citations is in the 90s.20
Perfect accuracy in anything but super-regular or trivial applications will never be attained—the subtleties and exceptions of human language are too deep. Information-extraction customers who demand perfect accuracy should be reminded that existing manual practices are also full of errors. For example, typical hospital patient contact records and corporate databases are filled with typographic errors and duplicates. Even when paid specialists (linguists) are tasked with labeling text (to be used in training a machine-learning extractor), there is a surprising amount of disparity between their results on the same documents; agreement among multiple humans each manually labeling news articles for entity relations is typically only in the 70s at best. In many cases, a synergistic combination of automated methods and human processing can yield the best accuracy and throughput.21
Increasingly, information-extraction solutions are being made available commercially. If you are thinking about using one of these, here are some questions to ask yourself and the supplier:
You should also consider open source solutions and in-house development. Free information-extraction systems include GATE (http://gate.ac.uk/ie/annie.html), MALLET (http://mallet.cs.umass.edu), MinorThird (http://minorthird.sourceforge.net), and Road Runner (http://www.dia.uniroma3.it/db/roadRunner). There are also other open source solutions for document classification (http://www.cs.cmu.edu/~mccallum/bow/) and document retrieval and matching (http://lucene.apache.org, http://www.lemurproject.org/indri).
Information extraction has made much progress in the past decade, and further research and industrial creativity continue to push this progress. Extraction is being applied to increasingly complex problems and is being designed for more sophisticated yet easy use by nontechnical end users. Active research trends include the following:
Estimating uncertainty, managing multiple hypotheses. As already discussed, extraction will never be perfectly accurate, and some of the most problematic consequences of this occur when the final answer is the result of a cascade of processing steps, through which errors accumulate to a high, unusable level. There is increasing interest in methods that maintain multiple extraction hypotheses from one step to the other, and use probabilistic information to combine hypotheses. Rather than having errors accumulate, this approach can actually allow later processing steps to correct errors made in earlier steps.22,23 These methods will increase the accuracy of association and relation extraction, and enable new applications providing deeper analysis and integrating extraction directly with data mining.
Easier training, semi-supervised learning, interactive extraction. Machine-learning methods often provide the most accurate extractors, but gathering the necessary training data can be time-consuming and tedious—especially if impatient end users are the ones doing the data labeling. New research in semi-supervised machine learning reduces the amount of required labeled data by cleverly (some would say nearly magically) leveraging large quantities of unlabeled data to improve learning efficiency.24,25 Also, methods called “interactive information extraction” begin with imperfect automatically labeled data, then make manual corrections faster by highlighting low-confidence fields, and furthermore, use soft constraints to automatically correct additional fields after the human corrects just one.26
This article has mostly discussed traditional information extraction that builds a structured database. When the goal, however, is topical trend analysis, or a rough summary of a large collection of documents, an interesting alternative is to use a loose mixture of text extraction and data mining. These are methods that leverage whatever limited structured information is available (such as the dates, senders, and recipients of e-mail messages, or simply document boundaries) and then use data-mining techniques that are robust enough to operate directly on the raw text associated with this limited structure.
For example, the latent Dirichlet allocation is a document-clustering method that gives a bird’s-eye view of the topics discussed in a document collection (the topics are represented by collections of automatically discovered, prominent keywords). The Author-Recipient-Topic model27 gives a topical summary of a large collection of e-mail, identifying prominent senders and recipients associated with different topics, and identifying people in this e-mail social network who have similar roles. For example, given the text in the body of the e-mail messages, as well as the people in the To, From, and CC headers, the Author-Recipient-Topic model automatically discovers a set of topics that summarizes the communications within a social network, as well as who talked about which topics to whom. Among the applications for this model would be expert-finding in large corporations.
The Group-Topic model28 discovers groups rather than roles, and has been used to identify associations between topics and like-minded legislators by mining 16 years of voting records in the U.S. Senate, along with the text of the corresponding bills. Kleinberg’s Word Burst algorithm automatically detects trends over time;29 for example, when applied to the text of U.S. Presidential State of the Union addresses, it shows the word slaves bursting from 1859-1863, the word atomic bursting from 1947-1959, and inflation from 1971-1980. Intelliseek.com has applied similar methods to trend analysis from blogs.
The World Wide Web is the world’s largest repository of knowledge, and it is being constantly augmented and maintained by millions of people. However, it is in a form intended for human reading, not in a database form with records and fields that can be easily manipulated and understood by computers. In spite of the promise of the Semantic Web, the use of English and other natural language text will continue to be a major medium for communication and knowledge accumulation on the Web, in e-mail, news articles, and elsewhere.
Eventually we will reach the point at which the answer to almost any question will be available online somewhere, but we will have to wade through more and more material to find it. The next step in improved search tools will be a transition from keyword search on documents to higher-level queries: queries where the search hits will be objects, such as people or companies instead of simply documents; queries that are structured and return information that has been integrated and synthesized from multiple pages; and queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses.
The first half of the Internet revolution consisted of the creation of a wide area network for easy data sharing, enabling human access to an immense store of knowledge and services. The second half of the Internet revolution has yet to come. It will happen when there is machine access to this immense knowledge base, and we are thus able to perform pattern analysis, knowledge discovery, reasoning, and semi-automated decision-making on top of it. Information extraction will be a key part of the solution making this possible.
ANDREW McCALLUM is an associate professor at the University of Massachusetts, Amherst, and director of the Information Extraction and Synthesis Laboratory. He was previously vice president of research and development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990s he was a research scientist at Justsystem Pittsburgh Research Center, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification, and citation analysis. He was a post-doctoral fellow at Carnegie Mellon University after receiving his Ph.D. from the University of Rochester in 1995. He is an action editor for the Journal of Machine Learning Research. For the past 10 years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, document classification, finite state models, semi-supervised learning, and social network analysis. (www.cs.umass.edu/mccallum).
Computer Science Dept, Dartmouth Course Listings Fall 2006
CS121 Java Programming
MWF 1330-1430, Sutton
An introductory course, covering conditionals,
iteration, and I/O. No prerequisites.
CS383 Artificial Intelligence
TuTh 1030-1200, Cash
Logic, search, Bayesian networks, machine learning
and robotics. Requires CS245.
CS392 Computational Linguistics
WF 1500-1630, Quass
Covers N-grams, hidden Markov models, parsing,
and translation. Prerequisites are CS383 & Stat202.
Cross-listed as Ling380.
Linguistics Dept, Dartmouth College
Course Listings Fall 2006
Ling101 Intro to Linguistics is taught by Dr. Wei Li in Smith Hall, Rm 202. This course introduces phonology, mophology, common grammatical patterns. Meets Mondays and
Wednesday from 10:30am to 12:00pm.
Ling380 Computational Linguistics, taught by Dr. Dallan
Quass in Sutton Rm 102, this course covers N-grams,
hidden Markov models, parsing, and machine translation. Meets Wednesdays and Fridays 3pm to 4:30pm. You must first take CS383 and Stat202. Cross-listed as CS392.
Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., Kameyama, M., Kehler, A., Martin, D., Myers, K., and Tyson, M. 1995. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann.
Blei, D., Bagnell, D., and McCallum, A. 2002. Learning with scope, with application to information extraction and classification. Uncertainty in Artificial Intelligence (UAI).
Blei, D., Ng, A., and Jordan, M. 2003. Latent Dirichlet allocation. JMLR 3: 993–1022.
Bruninghaus, S., and Ashley, K. D. 2001. Improving the representation of legal case texts with information extraction methods. Proceedings of the 8th International Conference on Artificial Intelligence and Law.
Carreras, X., Marques, L., and Padro, L. 2002. Named entity extraction using adaboost. Proceedings of CoNLL-2002: 167–170.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. 1998. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98): 509–516.
Culotta A., and McCallum, A. 2005. Reducing labeling effort for structured prediction tasks. AAAI.
Freitag, D. 1998. Information extraction from HTML: Application of a general learning approach. Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98): 517–523.
Freitag. D. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University.
Haghighi, A., Toutanova, K., and Manning, C. D. 2005. A joint model for semantic role labeling. Ninth Conference on Computational Natural Language Learning: 173–176.
Kushmerick, N., Weld, D., and Doorenbos, R. 1997. Wrapper induction for information extraction. Proceedings of the 15th International Conference on Artificial Intelligence: 729-735.
Lawrence, S. 2001. Online or invisible? Nature 411(6837): 521.
McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 2000. Automating the construction of Internet portals with machine learning. Information Retrieval Journal 3: 127–163.
McCallum, A., and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun co-reference. Neural Information Processing Systems (NIPS).
McDonald, R., and Pereira, F. 2005. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6 (Supplement 1).
Miller, S., Guinness, J., and Zamanian, A. 2004. Name tagging with word clusters and discriminative training. Proceedings of HLT-NAACL.
Minkov, E., Wang, R., and Cohen, W. 2004. Extracting personal names from e-mails: Applying named entity recognition to informal text. In preparation; http://www.cs.cmu.edu/ wcohen/pubs-x.html.
Muslea, I., Minton, S., and Knoblock, C. 1998. Stalker: Learning extraction rules for semi-structured, Web-based information sources. Proceedings of the AAAI Workshop on AI and Information Integration.
Ng, V., and Cardie, C. 2002. Improving machine learning approaches to co-reference resolution. Proceedings of the Fortieth Annual Meeting of the Association for Computational Linguistics.
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2002. Identity uncertainty and citation matching. Advances in Neural Information Processing.
Pinto, D., McCallum, A., Lee, X., and Croft, W. B. 2003. Combining classifiers in text categorization. Submitted to SIGIR ’03: Proceedings of the Twenty-sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
Ray, S., and Craven, M. 2001. Representing sentence structure in hidden Markov models for information extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence. Morgan Kaufmann.
Soderland, S. 1997. Learning to extract text-based information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining.
Sutton, C., and McCallum, A. 2005. Composition of conditional random fields for transfer learning. Empirical Methods in Natural Language Processing.
Yeh, A., Morgan, A., Colosimo, M., and Hirschman, L. 2005. BioCreative task 1a: Gene mention finding evaluation. BMC Bioinformatics 6 (Supplement 1).
Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research 3: 1083–1106.
Originally published in Queue vol. 3, no. 9—
see this item in the ACM Digital Library
Alon Halevy - Why Your Data Won't Mix
When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas.
Natalya Noy - Order from Chaos
There is probably little argument that the past decade has brought the “big bang” in the amount of online information available for processing by humans and machines. Two of the trends that it spurred (among many others) are: first, there has been a move to more flexible and fluid (semi-structured) models than the traditional centralized relational databases that stored most of the electronic data before; second, today there is simply too much information available to be processed by humans, and we really need help from machines.
C. M. Sperberg-McQueen - XML
XML, as defined by the World Wide Web Consortium in 1998, is a method of marking up a document or character stream to identify structural or other units within the data. XML makes several contributions to solving the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics:
Adam Bosworth - Learning from the Web
In the past decade we have seen a revolution in computing that transcends anything seen to date in terms of scope and reach, but also in terms of how we think about what makes up “good” and “bad” computing. The Web taught us several unintuitive lessons: