Shuyi Zheng Ruihua Song Ji-Rong Wen C. Lee Giles , 2009
Yan-Kai Lai, Yu-An Chou
Abstract
Introduction
Data Representation
System Overview
Record Wrapper Induction
Algorithm
Record Clustering & Wrapper Generation
Constructing Wrapper Libraries
Record Extraction
Record Disambiguation
Experiments
Conclusion
Most traditional wrapper techniques have issues dealing with web records since there is no clear boundary for partitioning different records from the HTML source.
This system is able to effectively extract records and identify their internal semantics at the same time.
Our record-level wrapper technique makes the following contributions:
We propose a novel 「broom」 structure to represent a record
We propose using context words to disambiguate different attributes that are embedded in similar HTML tag trees.
When a page has more than one record, we assign unique IDs (「record id」) to them.
A broom has two parts: the 「head」 and the 「stick」.
The broom head is a record region consisting of sub-trees of a DOM-tree;
The broom stick is a tag-path starting from the root tag HTML to the top of the record region.
Wrappers are also represented in such broom structures.
A wrapper should be used to only extract records which have the same tag-paths as itself.
Record Region:Given a labeled DOM- tree and a record ID i, then the record region of record i is the smallest set of sub-trees (a forest) which satisfies the following conditions: (1) They can fully cover all nodes of record i (2) They are consecutive siblings rooted at the boundary node of record i.
Our approach considers surrounding text in wrapper induction selectively.
There are multiple possible alignments with the same smallest aligning cost, the one with less text mismatch will be chosen as the final solution.
Dataset
We collected our experimental data from 16 real-life large- scale websites belonging to four different domains.
This paper describes a record-level wrapper induction system which is able to effectively extract records and identify their internal semantics at the same time.
Compared to traditional page-level wrapper methods, the proposed approach not only saves a lot of effort made in manually labeling but also performs data extraction more efficiently.