API, Web-scraping, and Workshop
Yu-Chang Ho (Andy)
Oct. 15, 2019
UC Davis
M.S. in Computer Science @
University of California, Davis, CA, USA
Former Consultant @ ECLAC, United Nations
data APIs
database dump files or static data file downloads
web-scraping
Application Programming Interface (API)??
The API is a service to easily access data!
Data in uniform format like CSV, JSON, or XML.
Not all the website provides this service......
Indicates the willingness of the company/website for sharing their data!
{
"id": "MSV418788",
"name": "Accesorios Náuticos",
"picture": null,
"permalink": null,
"total_items_in_this_category": 0,
"children_categories": [],
"attribute_types": "none",
"meta_categ_id": null,
"attributable": false
}A set of Key/Value pairs.
(like a dictionary in Python)
Always provided with documentations for the usage.
The address to access
Something you could customize
(sorting, filtering, ......)
The format of response
limitations for Accessing:
token = '72b17b7e17b58e9de79c83678737d418'(Crawling??)
Collecting a list of webpage links (Uniform Resource Locator, URL).
We called such a program
"Web Crawler".
Web Crawler
World-wide Web (WWW)
List of Target to Visit
Getting the data from the web!
You will have target.
Wikipedia
Scraping is a form of copying, in which specific data is gathered and copied from the web.
Web Crawler
Content Parsing
Basic Cleaning
World-wide Web (WWW)
Web-scraper
What else?
Web Crawler
Content Parsing
Basic Cleaning
Web-scraper
<html>
<head>
<!-- Settings, link files -->
</head>
<body>
<!-- Main Content -->
</body>
</html>CSS (Cascading Style Sheets) is a scripting language to define the style (font size, font weight, ......) of a HTML element.
.bold-text {
font-weight: bold;
}
<div class="bold-text">Test 1</div>
<div class="bold-text">Test 2</div>A webpage is essentially a file of texts.
<html>
<head>
<title>This is an example website</title>
</head>
<body>
<div id='name'><h1>Yu-Chang Ho</h1></div>
<hr height="2px">
<h2>Education</h2>
<div id='edu1'>M.S.in Computer Science, UCD</div>
<div id='edu2'>B.S.in Computer Science, NCU</div>
<h2>Work Experiences</h2>
<div id='work1'>Consultant in ECLAC, UN</div>
</body>
</html>A HTML Element (Tag):
<div id='name'>Yu-Chang Ho</div>is a type of element, a container of any other elements or texts.
Other examples: , , , ......
is an identifier of a specific element.
<div><table><img><p>id='name'<html>
<head>
<title>This is an example website</title>
</head>
<body>
<div id='name'><h1>Yu-Chang Ho</h1></div>
<hr height="2px">
<h2>Education</h2>
<div id='edu1'>M.S.in Computer Science, UCD</div>
<div id='edu2'>B.S.in Computer Science, NCU</div>
<h2>Work Experiences</h2>
<div id='work1'>Consultant in ECLAC, UN</div>
</body>
</html>| name | edu1 | edu2 | work1 |
|---|---|---|---|
| Yu-Chang Ho | M.S. in Computer Science, UCD | B.S. in Computer Science, NCU | Consultant in ECLAC, UN |
1. Observe the webpage source yourself
2. Understand the pattern of the URL
3. Implement the web-scraper
4. Create the data file
Please note that not all the browser has the same name!
Please visit the following website to download the source code:
Then, Please review the code
"example_user_profile.ipynb"
Web Crawler
Content Parsing
Basic Cleaning
Web-scraper
| API | Web-scraping |
|---|---|
| Granted to use the data | Not grant to use the data |
| No need to make a parser | Need to design the parser for the webpage source |
| No need to clean the data | Need to make sure the data obtained is correct |
|
Easy to use (as a programmer perspective), but not always provided |
Some companies can prevent the data from being scraped |
| Have rules to follow, so not a security issue for the website | A kind of attack to their service! |
Think about what will happen when you enter a URL in your browser and hit Enter.
Magic happens in background!
Open a webpage is essentially downloading files from the server.
Google's
Web Server
1. Open www.google.com
Your Browser
2. Send the webpage source & the browser will run it
3. The Result
Please note that not all the browser has the same name!
Looks the same as v1, right?