Efficient Record-Level Wrapper Induction
Shuyi Zheng Ruihua Song Ji-Rong Wen C. Lee Giles , 2009
Yan-Kai Lai, Yu-An Chou
Outline
- 
	Abstract 
- 
	Introduction 
- 
	Data Representation 
- 
	System Overview 
- 
	Record Wrapper Induction 
- 
	Algorithm 
- 
	Record Clustering & Wrapper Generation 
- 
	Constructing Wrapper Libraries 
- 
	Record Extraction 
- 
	Record Disambiguation 
- 
	Experiments 
- 
	Conclusion 
Abstract
- Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website.
- In our system, we use a novel 「broom」 structure to represent both records and generated wrappers.
Introduction
- Much Web information is presented in the form of a Web record which exists in both detail and list pages.

Introduction
- The task of extracting records from web pages is usually implemented by programs called wrappers.
- The process of leaning a wrapper from a group of similar pages is called wrapper induction
Introduction
- 
	Most traditional wrapper techniques have issues dealing with web records since there is no clear boundary for partitioning different records from the HTML source. 


Introduction
- 
	This system is able to effectively extract records and identify their internal semantics at the same time. 
- 
	Our record-level wrapper technique makes the following contributions: - 
		We propose a novel 「broom」 structure to represent a record 
- 
		We propose using context words to disambiguate different attributes that are embedded in similar HTML tag trees. 
 
- 
		
Data Representation
- 
	When a page has more than one record, we assign unique IDs (「record id」) to them. 
- 
	A broom has two parts: the 「head」 and the 「stick」. - 
		The broom head is a record region consisting of sub-trees of a DOM-tree; 
- 
		The broom stick is a tag-path starting from the root tag HTML to the top of the record region. 
 
- 
		
- 
	Wrappers are also represented in such broom structures. 
Data Representation

Data Representation
- For a specific website, different types of records may have the same sub-tree structure.
- 
Records in a website can be grouped by their tag- paths. 
	- 
		A wrapper should be used to only extract records which have the same tag-paths as itself. 
 
- 
		
System Overview

System Overview

Record Wrapper Induction
- 
Definition 
	- Boundary Node:Given a labeled DOM- tree and a record ID i, then the boundary node of record i is the root node of a minimal sub-tree which can fully cover all nodes of record i.
- 
		Record Region:Given a labeled DOM- tree and a record ID i, then the record region of record i is the smallest set of sub-trees (a forest) which satisfies the following conditions: (1) They can fully cover all nodes of record i (2) They are consecutive siblings rooted at the boundary node of record i. 
 
Record Wrapper Induction

Record Wrapper Induction

Record Wrapper Induction

Algorithm

Record Clustering & Wrapper Generation
- As both template detection and wrapper generation are based on a well-defined pair-wise similarity metrics, that approach can achieve a joint optimization by the criterion of extraction accuracy.
Constructing Wrapper Libraries
- The main task of this construction process is to merge different tag-paths into a tree structure
- This is a top-down process of merging same prefixes of multiple tag-paths
Constructing Wrapper Libraries

Record Extraction


Record Disambiguation
- 
	Our approach considers surrounding text in wrapper induction selectively. 
- 
	There are multiple possible alignments with the same smallest aligning cost, the one with less text mismatch will be chosen as the final solution. 

Experiments
- 
	Dataset - 
		We collected our experimental data from 16 real-life large- scale websites belonging to four different domains. 
 
- 
		
Experiments
- :the extracted metadata
- :the manually labeled ground-truth metadata
- suppose record in is aligned with record in , the attribute-level precision ( ) and recall ( ) for record re can be calculated with the following equations









Experiments

Experiments

Experiments


Experiments

Conclusion
- 
	This paper describes a record-level wrapper induction system which is able to effectively extract records and identify their internal semantics at the same time. 
- 
	Compared to traditional page-level wrapper methods, the proposed approach not only saves a lot of effort made in manually labeling but also performs data extraction more efficiently. 
Copy of deck
By tz5514
Copy of deck
- 1,703
 
   
   
  