Record-Boundary Discovery in Web Documents

6/16/99


Click here to start


Table of Contents

Record-Boundary Discovery in Web Documents

Record-Boundary Discovery Larger Goal: Information Extraction

Desired Objective Query the Web Like a Database

Approach and Limitations Automatic Ontology-Based Wrapper Generation

Application Ontology: Object-Relationship Model Instance

Application Ontology: Data Frames

Ontology Parser

Record Extractor

Record Extractor: High Fan-Out Heuristic

Record Extractor: Record-Separator Heuristics

IT: Identifiable “html separator” Tags

HT: Highest-count Tags

SD: Standard Deviation

OM: Ontological Match

RP: Repeating-tag Patterns

Record Extractor: Consensus Heuristic

Record Extractor: Example Consensus Heuristic

Record Extractor: Results

Constant/Keyword Recognizer

Database Instance Generator

Database-Instance Generator

Recall & Precision

Results: Car Ads

Car Ads: Comments

Results: Computer Job Ads

Results: Obituaries

Cautions

Conclusions

Author: David W. Embley

Email: embley@cs.byu.edu

Home Page: http://www.deg.byu.edu

Download presentation source