Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998

12/10/98


Click here to start


Table of Contents

Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998

Introduction

A Sample Web Document

Building Wrappers Part of the problem is record separation

Heuristic for Locating Groups of Records

Highest-count Tags (HT) Individual Heuristic

Identifiable “Separator” Tags (IT) Individual Heuristic

Standard Deviation (SD) Individual Heuristic

Repeating-Tag Pattern (RP) Individual Heuristic

Ontology-Matching (OM) Individual Heuristic

Example: Ontology-Matching (OM) (Continue)

Combined Heuristic

Initial Experiments Combined Heuristic

Results for Obituaries and Car Ads Initial Experiments

Certainty Factors Initial Experiments

Experimental results for all the combined heuristics

Record-Boundary Discovery Algorithm

Example

Experimental Results: Obituaries and Car Ads

Experimental Results: Job Ads and Course Descriptions

Experimental Results: Success Rates

Conclusions

Author: Stephen Yuan Jiang

Email: embley@cs.byu.edu

Home Page: http://osm7.cs.byu.edu/deg/