Efficient Record-Level Wrapper Induction

Shuyi Zheng Ruihua Song Ji-Rong Wen C. Lee Giles , 2009

Yan-Kai Lai, Yu-An Chou

Outline

Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website.
In our system, we use a novel 「broom」 structure to represent both records and generated wrappers.

Much Web information is presented in the form of a Web record which exists in both detail and list pages.

The task of extracting records from web pages is usually implemented by programs called wrappers.
The process of leaning a wrapper from a group of similar pages is called wrapper induction

Most traditional wrapper techniques have issues dealing with web records since there is no clear boundary for partitioning different records from the HTML source.

This system is able to effectively extract records and identify their internal semantics at the same time.
Our record-level wrapper technique makes the following contributions:
- We propose a novel 「broom」 structure to represent a record
- We propose using context words to disambiguate different attributes that are embedded in similar HTML tag trees.

When a page has more than one record, we assign unique IDs (「record id」) to them.
A broom has two parts: the 「head」 and the 「stick」.
- The broom head is a record region consisting of sub-trees of a DOM-tree;
- The broom stick is a tag-path starting from the root tag HTML to the top of the record region.
Wrappers are also represented in such broom structures.

For a specific website, different types of records may have the same sub-tree structure.
Records in a website can be grouped by their tag- paths.
- A wrapper should be used to only extract records which have the same tag-paths as itself.

Definition
1. Boundary Node：Given a labeled DOM- tree and a record ID i, then the boundary node of record i is the root node of a minimal sub-tree which can fully cover all nodes of record i.
2. Record Region：Given a labeled DOM- tree and a record ID i, then the record region of record i is the smallest set of sub-trees (a forest) which satisfies the following conditions: (1) They can fully cover all nodes of record i (2) They are consecutive siblings rooted at the boundary node of record i.

As both template detection and wrapper generation are based on a well-defined pair-wise similarity metrics, that approach can achieve a joint optimization by the criterion of extraction accuracy.

The main task of this construction process is to merge different tag-paths into a tree structure
This is a top-down process of merging same prefixes of multiple tag-paths

Our approach considers surrounding text in wrapper induction selectively.
There are multiple possible alignments with the same smallest aligning cost, the one with less text mismatch will be chosen as the final solution.

Dataset
- We collected our experimental data from 16 real-life large- scale websites belonging to four different domains.

：the extracted metadata
：the manually labeled ground-truth metadata
suppose record in is aligned with record in , the attribute-level precision ( ) and recall ( ) for record re can be calculated with the following equations

This paper describes a record-level wrapper induction system which is able to effectively extract records and identify their internal semantics at the same time.
Compared to traditional page-level wrapper methods, the proposed approach not only saves a lot of effort made in manually labeling but also performs data extraction more efficiently.