Conclusions

We described a heuristic approach to discover record boundaries in unstructured Web documents.

Main contribution: we provided a set of individual heuristics and a way to combine these heuristics into a method for discovering record boundaries.

Under normal assumptions, the process is O(n), where n is the size of a document.

The experiments we conducted showed that this approach uniformly attained an accuracy of 100%.

We described a heuristic approach to discover record boundaries in unstructured Web documents.