References and Bibliography

Here are some relevant papers:

Brad Adelberg, NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents, SIGMOD Conference 1998, 1998.
This paper describes the Northwestern Document Structure Extractor (NoDoSE), which is an interactive tool for semi-automatically determining the structure of semistructured documents and then extracting their data. The author describes the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that he has developed.
[Download Site]
Naveen Ashish, Craig A. Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Second IFCIS Conference on Cooperative Information Systems (CoopIS), 1997.
The authors present an approach for semi-automatically generating wrappers for structured internet sources. The main contribution of this paper is the development of heuristics by which the system can hypothesize the structure implicit from the formatting information in pages from the source.
[Download Site]
Naveen Ashish, Craig A. Knoblock, Wrapper Generation for Semi-structured Internet Sources, SIGMOD Record, Vol. 26, No. 4, December 1997.
This paper presents a semi-automatic approach to wrapper generation for WWW sources. It reports on the development of an implemented wrapper generation toolkit that provides a semi-automatic, interactive wrapper generation facility. The authors present experimental results to provide an idea of the amount of effort required to generate a wrapper for a new source.
[Download Site]
Paolo Atzeni, Giansalvatore Mecca, Cut and Paste, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), May 1997.
The authors develop EDITOR, a language for manipulating semi-structured documents. EDITOR programs allow to search and restructure a document. They are based on two simple ideas: ``search" instructions are used to select regions of interest in a document, and ``cut and paster" to restructure them.
[Download Site]
Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A Scalable Comparison-Shopping Agent for the World-Wide Web, Proceeding of the First International Confence on Autonomous Agents, 1997.
This paper introduces ShopBot, a fully-implemented, domain-independent comparison-shopping agent. ShopBot achieves good performance without sophisticated natural language processing, and requires only minimum knowledge about different product domains. Instead, ShopBot replies on a combination of heuristic search, pattern matching, and inductive learning techniques.
[Download Site]
David W. Embley, Douglas M. Campbell, Stephen W. Liddle, Randy D. Smith, Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents, Submitted for Publication, February 1998.
This paper presents an ontology-based system to extracting and structuring information from unstructured documents that are data rich and narrow in ontological breadth. The authors parse the application ontology, which describes the objects, relationships, and constraints in a domain of interest, to generate recognition rules for constants and context keywords and to extract structural and constraint information. Given the generated rules and an unstructured document, the authors apply a recognizer to extract the constants and keywords, and then apply a structure builder to match constant values with attributes, to associate attribute-value pairs as relations, and to populate a generated database schema with the extracted data according to the constraints of the application ontology.
Ashish Gupta, Venky Harinarayan, Anand Rajaraman, Vitual Database Technology, SIGMOD Record, Vol. 26, No. 4, December 1997.
This paper describes the Junglee's virtual database (VDB) technology. Junglee's VDB technology makes disparate, heterogeneous information sources behave like a single relational database system. VDB has two components: the data integration system and the data publishing system. The data integration system combines data from several underlying data sources and provides a unified, consistent relational database interface. The data publishing system uses publishing rules to schedule data acquisition, transformation, and dissemination.
Jean-Robert Gruser, Louiqa Raschid, Maria Esther Vidal, Wrapper Generation for Web Data Sources, Technical Report, UMIACS, University of Maryland, 1998.
The authors present technology to define, prototype, and generate wrappers for Web sources. The authors define a standard wrapper interface to specify the capability of Web data sources, develop a wrapper generation toolkit of graphical interfaces and specification languages, and develop a set of utilities to automatically generate and construct a wrapper appropriate to each Web-based data source.
Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Rohan Aranha, Arturo Crespo, Extracting Semistructured Information from the Web, Proceedings of the First Workshop on Management of Semistructured Data, May 1997.
This paper describes a configurable tool for extracting semistructured data from a set of HTML pages and for converting the extracted information into database objects. The extractor in this paper provides a currently missing link between the Web and the applications which have no direct access to the Web data.
Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos, Wrapper Induction for Information Extraction, Proceedings of the 1997 International Joint Conference on Artificial Intelligence (IJCAI), 1997.
This paper introduces wrapper induction, a new technique for automatically constructing wrappers, and identify head-left-right-tail (HLRT), a wrapper class that is efficiently learnable, yet expressive enough to handle numerous actual Internet information resources. This system learns a wrapper by generalizing from example query responses. A PAC model bounds the number of examples needed to generate a satisfactory wrapper.
Stephen Soderland, Learning to Extract Text-based Information from the World Wide Web, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 1997.
This paper introduces Webfoot, a preprocessor that parses web pages into logically coherent segments based on page layout curs. Output from Webfoot is then passed on to CRYSTAL, a natural language processing (NLP) system that learns text extraction rules from example. Webfoot and CRYSTAL transform the text into a formal representation that is equivalent to relational database entries.

Comments are welcome. Updated Thu Nov 11 11:43:24 MST 1999