Data scraping with Python3

Anna Keuchenius

PhD candidate Sociology

@AKeuchenius

 

Javier Garcia-Bernardo

PhD candidate Political Science

@JavierGB_com

Computational Social Science Amsterdam

CSSamsterdam.github.io

Slides available at

slides.com/jgarciab/web_scraping

github.com/jgarciab/web_scraping

Notebook available at

https://CSSamsterdam.github.io/

Structure

  • Beginners (10:00-14:30):
    • What is data scraping
    • Static websites.
      • HTML
      • Libraries: requests and beautifulSoup
      • Extracting tables from websites
    • Dynamic websites
      • APIs and Javascript
      • Behaving like a human: selenium
      • Tapping into the APIs: requests and ad-hoc libraries
    • Legality and morality of data scraping

Structure

  • Advanced (15:00-17:00):
    • Tapping into hidden APIs: requests
    • Session IDs
    • Setting up crawlers:
      • CRON jobs
      • Processes
      • scrapy
    • Common problems when scraping:
      • Proxies
      • Speed up requests
      • Selenium problems (scroll down, ElementNotVisible exception etc.)
      • Robust scraping
  • CREA (17:00 onwards)

What is data scraping?

  • Easy
  • Great return on investment
  • A very powerful experience
  • Extracting data from the web in an automatic manner

Part 1: Static websites

https://www.boulderhumane.org/animals/adoption/dogs
https://www.boulderhumane.org/animals/adoption/dogs

1.1 HTML

Drawing from: http://www.scriptingmaster.com/html/basic-structure-HTML-document.asp

html tag tells the browser that it's a website 

metadata, style, title...

content

1.1 HTML: Tree structure

1.1 HTML: Tree structure

Tag list:

  • div → basic container
  • a → link to url
  • p → paragraph
  • h1/h2/h3/h4/h5/h6 → titles   
  • img → image
  • table → tables

Attribute names:

  • href → url
  • src → source of image
  • class → usually sets the style
  • align / color... → gives style
  • id / name → names
  • title → text on hover

1.2 How to extract the information?

Step 1: Download data --> library requests

Step 2: Parse data --> library beautifulSoup

1.2.1 Requests

1.2.1 Requests

More info at http://docs.python-requests.org

1.2.2 BeautifulSoup

Let's get the intro from 

CSSamsterdam.github.io

1.2.2 BeautifulSoup

1.2.2 BeautifulSoup

find returns an HTML element

Some useful things to do with it

find_all returns a list of HTML elements

1.2.2 Finding the right tag: Inspect element

1.2.2 Finding the right tag: Inspect element

Time to play

1.3: Tables from websites

https://en.wikipedia.org/wiki/List_of_sandwiches
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches",
                      header=0)[0]

URL

read_html returns a list, keep the first table

The first line is the header

Part 2: Dynamic websites

- 2.1: APIs/Javascript

- 2.2: Behaving like a human

- 2.3: Tapping into the API

    - 2.3.1 Explicit APIs

    - Part 4: "Hidden" APIs (advanced)

2.1 API/JavaScript

Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/

API

2.1 API/JavaScript

HTTP request

HTTP request

https://www.uvm.edu/directory

Part 4 (advanced)

2.2 Behaving like a human

selenium

Requirements (one of the below):

 

Some characteristics of HTML scraping with Selenium it:

  • (b) can handle javascript
  • (c) gets HTML back after the Javascript has been rendered
  • (d) can behave like a person
  • (a) can be slow

2.2 selenium

# Get the xkcd website
driver.get("https://xkcd.com/")

# Find the title of the image
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")

Time to play

2.3: Tapping into the API

Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/

API

2.3 working with APIs

API

  • The company tells you the language of the server (the API)
  • You communicate with the server through HTTP requests
  • Usually sets up some restrictions
  • Example: Twitter API
  • https://developer.twitter.com/en/docs/tweets/search/overview

 

 

 

2.3.1 Explicit APIs

2.3.1 Explicit APIs

Theory:

Time to play

3: Ethics and legality

3: Ethics and legality

Is it legal?:

  • Possibly... recent court cases have supported the accused, even when the Terms of Use disallowed data scrapping
  • Personal data cannot be protected by copyright (but... consult a lawyer)
  • Read the Terms of Use of the website
  • When dealing with personal data make sure to comply with the EU General Data Protection Regulation (GDPR)  
    • Data processing is lawful, fair and transparent.
    • For research: You are required to go through ethics review (IRB) before collecting the data.
  • Consult a lawyer in case of doubt

 

3: Ethics and legality

Is it ethical?

  • For personal use or research is typically ethical to scrape data (in my opinion)
  • ​Use rule #2: Don't be a dick:
    • Use an API if one is provided, instead of scraping data
    • Use a reasonable crawl rate (~2 request per minute, or whatever says in robots.txt) and scrape at night.
    • Identify your web scraper (e.g. headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5); myproject.github.io; Javier Garcia-Bernardo (garcia@uva.nl)"})
    • Ask for permission:
      • ​If the ToS or robots.txt disallows scraping
      • Before republishing the data

Adapted from: https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

Break

Advanced workshop

4. Tapping into "hidden" APIs

Advantage: Very fast and more reliable

4: Finding the API: HTTP Requests

4: Finding the API: HTTP Requests

HEADER

PARAMETERS

COOKIES

  • Usually not needed, but you should identify yourself just in case the website owner wants to contact you
  • user-agent is many times used to restrict the person
    • You can change it in every request. Problem: ethics
  • Sometimes the website only works if you are redirected from another site (more on that later)

 

  Usually not needed

 

 

 

  • Here is the "language" of the API, the field name is what we want here
  • Sometimes you just add the parameters in the url:

https://www.uvm.edu/directory/api/query_results.php?name=john+smith&department=

4: Create queries

Step 1: Find the cULR command

4: Create queries

Step 2: Create the Python requests: http://curl.trillworks.com/

Step 3: Get the data remembering rule #2

Session ID's

Easiest way to deal with session IDs: behave like a human >> use selenium

 

If that is too slow, use 'hidden api' method with variable parameters.

selenium

Detect session id from page

Again, use https://curl.trillworks.com/ and add SID as variable

5. Crawlers

5.1: CRON jobs

- The software utility cron is a time-based job scheduler in Unix-like computer operating systems.
- Easy, robust
https://stackoverflow.com/questions/21648410/write-python-script-that-is-executed-every-5-minutes

5.2: Processes

- All run within Python

5.3: Scrapy

- An open source and collaborative framework for extracting the data you need from websites. In a fast, "simple", yet extensible way.


- Best solution, but steep learning curve

 

 

https://doc.scrapy.org/en/1.5/intro/overview.html

6. Advanced topics

6.1A: Proxies

from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
# Collects the proxys and log errors
req_proxy = RequestProxy()
req_proxy.set_logger_level(logging.CRITICAL)

# Request a website
r = req_proxy.generate_proxied_request(link)

6.1B: Proxies

from tor_control import TorControl
import requests
tc = TorControl()
print(requests.get("https://api.ipify.org?format=jso").text)
> 163.172.162.106

tc.renew_tor()
print(requests.get("https://api.ipify.org?format=jso").text)
> 18.85.22.204

6.2: Speed up requests 

  • Use: Want to collect info from many different websites.

  • Problem: requests is blocking (it waits until the website responds )

  • Solution: run many threads

    • But not straightforward

    • Best: grequests: asynchronous HTTP Requests

import grequests

urls = [
    'http://www.heroku.com','http://python-tablib.org', 'http://httpbin.org', 
    'http://python-requests.org', 'http://fakedomain/','http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]

6.3: Dealing with selenium

We get this when the element is destroyed or hasn't been completely loaded. Possible solutions: Refresh the website, or wait until the page loads

## Option1:
import selenium.common.exceptions
import selenium.webdriver
import selenium.webdriver.common.desired_capabilities
import selenium.webdriver.support.ui
from selenium.webdriver.support import expected_conditions

#Define a function to wait for an element to load
def _wait_for_element(xpath, wait):
    try:
        polling_f = expected_conditions.presence_of_element_located((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

def _wait_for_element_click(xpath, wait):
    try:
        polling_f = expected_conditions.element_to_be_clickable((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

#define short and long timeouts
wait_timeouts=(30, 180)

#open the driver (change the executable path to geckodriver_mac or geckodriver.exe)
driver = selenium.webdriver.Firefox(executable_path="./geckodriver")

#define short and long waits (for the times you have to wait for the page to load)
short_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[0], poll_frequency=0.05)
long_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[1], poll_frequency=1)



#And this is how you get an element
element = _wait_for_element('//*[@id="selFundID"]',short_wait)

6.3.1 Stale Element Exception 

6.3: Dealing with selenium

Sometimes not all elements are loaded (e.g. by AJAX) and we need to wait. We could use time.sleep() but for how long? Response time can be highly unstable. 

 

Alternative solution: Wait until a specific element is loaded on the page:

6.3.2 ElementNotVisibleException

More info at https://selenium-python.readthedocs.io/waits.html 

6.3: Dealing with selenium

def scroll_down(SCROLL_PAUSE_TIME = 0.5):
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height: break
        last_height = new_height

6.3.3 scrolling down to load all site

6.3: Dealing with selenium

options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list":
           [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
           "download.default_directory": "./download_directory/" ,
           "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
   
driver = webdriver.Chrome("./chromedriver",chrome_options=options)

6.3.4 downloading files without asking

6.3: Dealing with selenium

def enable_download_in_headless_chrome(driver, download_dir):
    # add missing support for chrome "send_command"  to selenium webdriver
    driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')

        params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
        command_result = driver.execute("send_command", params)

[...] 

options.add_experimental_option("prefs", profile)

options.add_argument("--headless")
   
driver = webdriver.Chrome("./chromedriver",chrome_options=options)
enable_download_in_headless_chrome(driver,"./download_directory/))

6.3.5 headless chrome (not opening more windows)

6.3: Dealing with selenium

#Click somewhere
driver.find_element_by_xpath("xxxx").click()

#Switch to the new window
driver.switch_to_window(driver.window_handles[1])

#Do whatever
driver.find_element_by_xpath('xxxxxx').click()

#Go back to the main window
driver.switch_to_window(driver.window_handles[0])

6.3.6 pop up windows (e.g. to log in)

6.4: Robust scraping

  • Don't make your scraper language dependent
     
  •  Save raw html
     
  •  Use drilldown method to identify/extract elements on the page
     
  • Avoid xpath (my opinion)
     
  • Track your progress, st scraper can crash and start from where it left off

Discussion time

Workshop on data scraping - Amsterdam CSS

By Javier GB

Workshop on data scraping - Amsterdam CSS

Workshop on data scraping

  • 173
Loading comments...

More from Javier GB