Web Scraping

with

Kartik

kartik@brown.edu

Outline

Webpages

Canvas of the Internet

  • HTML
  • Hyperlinks
<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
    <p>
      <a href="https://www.wikipedia.org/">A link to Wikipedia!</a>
    </p>
  </body>
</html>

Why Scrape the Web?

Document Object Model
The DOM

© Wikipedia

© Douglas Crockford

© Douglas Crockford

© Douglas Crockford

Python

  • Easy to learn, almost English like
  • Great for quick experiments
  • Awesome documentation, community
  • Huge collection of modules (libraries):
    • Requests
    • BeautifulSoup

Analyzing Webpages

  • Tags <…>

  • #id

  • .class

  • text

Code Demo

https://github.com/k4rtik/fetch-topicals

Output Formats

JSON

{
    "1976": {
        "0": {
            "alt": "Use your head!",
            "caption": "When helmets were made compulsory in Bombay",
            "url": "http://www.amul.com/files/hits/amul-hits-1251.gif"
        },
        "1": {
            "alt": "The Big Payoff",
            "caption": "The Big Payoff",
            "url": "http://www.amul.com/files/hits/amul-hits-1250.gif"
        }
    },
    "1986": {
        "0": {
            "alt": "Wahi hota hai jo Manzure Ilahi hota hai.",
            "caption": "The Pakistani cricketer Mansur Ilahi in great form.",
            "url": "http://www.amul.com/files/hits/amul-hits-1145.jpg"
        },
        "1": {
            "alt": "In all Holmes Amul's elementary, my dear.",
            "caption": "During the time when a film on Shrelock Holms was being screened.",
            "url": "http://www.amul.com/files/hits/amul-hits-1144.jpg"
        }
    }
}
# in Python
import json

Others

  • CSV
  • Text
  • Your own format

Challenges

  • Dealing with bad markup on webpages
  • JavaScript driven content
  • Webpages needing Login

 

But there is sometimes hope: APIs!

Other Tutorials/Resources

  • https://first-web-scraper.readthedocs.org/en/latest/
    Covers basics of command line, Python and uses CSV module
  • https://github.com/ThaWeatherman/scrapers/
    Collection of scrapers for sites like craigslist, reddit, etc.
  • Lots of other resources, Google is your friend!

Web Scraping with Python

By Kartik Singhal

Web Scraping with Python

Workshop given to participants of Citizen + Virtual at Brown Design Workshop on Jan 16, 2016

  • 2,525