Web Scraping
with
Kartik
kartik@brown.edu
Outline
Webpages
Canvas of the Internet
- HTML
- Hyperlinks
<!DOCTYPE html>
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p>Hello world!</p>
<p>
<a href="https://www.wikipedia.org/">A link to Wikipedia!</a>
</p>
</body>
</html>
Why Scrape the Web?
Document Object Model
The DOM
© Wikipedia
© Douglas Crockford
© Douglas Crockford
© Douglas Crockford
Python
- Easy to learn, almost English like
- Great for quick experiments
- Awesome documentation, community
- Huge collection of modules (libraries):
- Requests
- BeautifulSoup
Analyzing Webpages
-
Tags <…>
-
#id
-
.class
-
text
Code Demo
https://github.com/k4rtik/fetch-topicals
Output Formats
JSON
{
"1976": {
"0": {
"alt": "Use your head!",
"caption": "When helmets were made compulsory in Bombay",
"url": "http://www.amul.com/files/hits/amul-hits-1251.gif"
},
"1": {
"alt": "The Big Payoff",
"caption": "The Big Payoff",
"url": "http://www.amul.com/files/hits/amul-hits-1250.gif"
}
},
"1986": {
"0": {
"alt": "Wahi hota hai jo Manzure Ilahi hota hai.",
"caption": "The Pakistani cricketer Mansur Ilahi in great form.",
"url": "http://www.amul.com/files/hits/amul-hits-1145.jpg"
},
"1": {
"alt": "In all Holmes Amul's elementary, my dear.",
"caption": "During the time when a film on Shrelock Holms was being screened.",
"url": "http://www.amul.com/files/hits/amul-hits-1144.jpg"
}
}
}
# in Python
import json
Others
- CSV
- Text
- Your own format
Challenges
- Dealing with bad markup on webpages
- JavaScript driven content
- Webpages needing Login
But there is sometimes hope: APIs!
Other Tutorials/Resources
- https://first-web-scraper.readthedocs.org/en/latest/
Covers basics of command line, Python and uses CSV module - https://github.com/ThaWeatherman/scrapers/
Collection of scrapers for sites like craigslist, reddit, etc. - Lots of other resources, Google is your friend!
Web Scraping with Python
By Kartik Singhal
Web Scraping with Python
Workshop given to participants of Citizen + Virtual at Brown Design Workshop on Jan 16, 2016
- 2,652