Record-Boundary Discovery in Web Documentsby Yuan JiangDecember 1, 1998
Introduction
A Sample Web Document
Building WrappersPart of the problem is record separation
Heuristic for Locating Groups of Records
Highest-count Tags (HT) Individual Heuristic
Identifiable “Separator” Tags (IT) Individual Heuristic
Standard Deviation (SD) Individual Heuristic
Repeating-Tag Pattern (RP) Individual Heuristic
Ontology-Matching (OM) Individual Heuristic
Example: Ontology-Matching (OM) (Continue)
Combined Heuristic
Initial Experiments Combined Heuristic
Results for Obituaries and Car Ads Initial Experiments
Certainty Factors Initial Experiments
Experimental results for all the combined heuristics
Record-Boundary Discovery Algorithm
Example
Experimental Results: Obituaries and Car Ads
Experimental Results: Job Ads and Course Descriptions
Experimental Results: Success Rates
Conclusions
Email: embley@cs.byu.edu
Home Page: http://osm7.cs.byu.edu/deg/