Record-Boundary Discovery in Web Documents
Record-Boundary DiscoveryLarger Goal: Information Extraction
Desired ObjectiveQuery the Web Like a Database
Approach and LimitationsAutomatic Ontology-BasedWrapper Generation
Application Ontology:Object-Relationship Model Instance
Application Ontology: Data Frames
Ontology Parser
Record Extractor
Record Extractor:High Fan-Out Heuristic
Record Extractor:Record-Separator Heuristics
IT: Identifiable “html separator” Tags
HT: Highest-count Tags
SD: Standard Deviation
OM: Ontological Match
RP: Repeating-tag Patterns
Record Extractor:Consensus Heuristic
Record Extractor:Example Consensus Heuristic
Record Extractor: Results
Constant/Keyword Recognizer
Database Instance Generator
Database-Instance Generator
Recall & Precision
Results: Car Ads
Car Ads: Comments
Results: Computer Job Ads
Results: Obituaries
Cautions
Conclusions
Email: embley@cs.byu.edu
Home Page: http://www.deg.byu.edu
Download presentation source