Yu-Chang (Andy) Ho
May 8th, 2019 in Colombia
Consultant @ ECLAC, United Nations
M.S. in Computer Science @
University of California, Davis, CA, USA
B.S. in Computer Science @
National Central University, Taiwan
Former Research Assistant @
Academia Sinica, Taiwan
(Crawling??)
1/30
Collecting a list of webpage links (Uniform Resource Locator, URL).
We called such a program
"Web Crawler".
Web Crawler
World-wide Web (WWW)
List of Target to Visit
2/30
Getting the data from the web!
You will have target.
Wikipedia
Scraping is a form of copying, in which specific data is gathered and copied from the web.
Web Crawler
Content Parsing
Basic Cleaning
World-wide Web (WWW)
Web-scraper
3/30
4/30
What else?
What will you need?
A programmer!
5/30
Web Crawler
Content Parsing
Basic Cleaning
Web-scraper
6/30
A webpage is essentially a file of texts.
<html>
<head>
<title>This is an example website</title>
</head>
<body>
<div id='name'><h1>Yu-Chang Ho</h1></div>
<hr height="2px">
<h2>Education</h2>
<div id='edu1'>M.S.in Computer Science, UCD</div>
<div id='edu2'>B.S.in Computer Science, NCU</div>
<h2>Work Experiences</h2>
<div id='work1'>Consultant in ECLAC, UN</div>
</body>
</html>7/30
A HTML Element:
<div id='name'>Yu-Chang Ho</div>is a type of element, a container of any other elements or texts.
Other examples: , , , ......
is an identifier of a specific element.
<div><table><img><p>id='name'8/30
9/30
<html>
<head>
<title>This is an example website</title>
</head>
<body>
<div id='name'><h1>Yu-Chang Ho</h1></div>
<hr height="2px">
<h2>Education</h2>
<div id='edu1'>M.S.in Computer Science, UCD</div>
<div id='edu2'>B.S.in Computer Science, NCU</div>
<h2>Work Experiences</h2>
<div id='work1'>Consultant in ECLAC, UN</div>
</body>
</html>| name | edu1 | edu2 | work1 |
|---|---|---|---|
| Yu-Chang Ho | M.S. in Computer Science, UCD | B.S. in Computer Science, NCU | Consultant in ECLAC, UN |
10/30
Web Crawler
Content Parsing
Basic Cleaning
Web-scraper
11/30
12/30
Application Programming Interface (API)??
The API is a service to easily access data!
Data in uniform format like CSV, JSON, or XML.
Not all the website provides this service......
Indicates the willingness of the company/website for sharing their data!
13/30
id,name,price (USD),sold,country
'MCR423578486',Parrot Drone Bebop 2 Fpv Wi-fi...,762,4.0,Costa Rica
'MCR423607385',Drone Dji Mavic Air Camera - C...,1150.0,,Costa Rica
'MCR423609599',Drone CuadricÛptero,34,21.0,Costa Rica
'MCR423598615',Minidrone Quadcopter,33,3.0,Costa Rica
'MCR423541901',Parrot Minidrones Rolling...,99.72736445485222,1.0,Costa Rica
'MCR423630621',X8 Pro Dron Cargador De Bateria Nuevo,8.310613704571017,,Costa Rica
'MCR423632316',Drone Sharper Image.con C·mara.,108.03797815942323,,Costa Rica
'MCR423632295',Drone Mini X10. ,36.56670030011248,,Costa Rica14/30
{
"id": "MSV418788",
"name": "Accesorios Náuticos",
"picture": null,
"permalink": null,
"total_items_in_this_category": 0,
"children_categories": [],
"attribute_types": "none",
"meta_categ_id": null,
"attributable": false
}A set of Key/Value pair.
15/30
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<id>'MCR423578486'</id>
<name>Parrot Drone Bebop 2 Fpv ...</name>
<symbol (original)>CRC</symbol (original)>
<price (original)>459000.0</price (original)>
<price (USD)>762.9143380796195</price (USD)>
<sold>4.0</sold>
<country>Costa Rica</country>
</row>
</root>Similar to HTML.
16/30
Always provided with documentations for the usage.
The address to access
Something you could customize
(sorting, filtering, ......)
The format of response
17/30
Has limitations for Accessing:
token = '72b17b7e17b58e9de79c83678737d418'18/30
| API | Web-scraping |
|---|---|
| Granted to use the data | Not grant to use the data |
| No need to make a parser | Need to design the parser for the webpage source |
| No need to clean the data | Need to make sure the data obtained is correct |
|
Easy to use (as a programmer perspective), but not always provided |
Some companies can prevent the data from being scraped |
| Have rules to follow, so not a security issue for the website | A kind of attack to their service! |
19/30
Of Course, if provided......
20/30
21/30
22/30
| Target | Data Points through Scraping | Data Points through API |
|---|---|---|
| MercadoLibre (Items) | 192092 (192K) | 53423091 (53.4M) |
| MercadoLibre (Sellers) | 25679 (25K) | 2474157 (2.4M) |
| Kiva | N/A | 1684119 (1.6M) |
| N/A | 104724059 (104M) | |
| Freelancer.com | N/A | 191424 (191K) |
| WORKANA | 20456 (20K) | N/A |
Over 150 Million Data Points!
23/30
24/30
Database
Multi-threaded Crawler
Content Parser
Basic Data Cleaning
Normalization
Data File Output
Visualization!
Web-scraper
Error Handling & Retry
De-dupe
Aggregation
25/30
26/30
Time-consuming!!!
Magic happens in background!
Open a webpage is essentially downloading files from the server.
Google's
Web Server
1. Open www.google.com
Your Browser
2. Send the webpage source & the browser will run it
3. The Result
27/30
Sample Website:
28/30
29/30
My Teammates (from Left to Right):
Matthew Reese, Karla Rascon-Garcia, Me, Veronika Vilgis, Xin Jin
30/30
Parsing Algorithm
Record URL to log
Continue to next URL
Failure!
Write Data
Record Data to log
Input
Failure!
Finished?
Log Files are your best friend!
Yes!
No
27/30