Data scraping with Python3

Anna Keuchenius

PhD candidate Sociology

@AKeuchenius

Javier Garcia-Bernardo

PhD candidate Political Science

@JavierGB_com

Computational Social Science Amsterdam

CSSamsterdam.github.io

Slides available at

slides.com/jgarciab/web_scraping

github.com/jgarciab/web_scraping

Notebook available at

https://CSSamsterdam.github.io/

Structure

Beginners (10:00-14:30):
- What is data scraping
- Static websites.
  - HTML
  - Libraries: requests and beautifulSoup
  - Extracting tables from websites
- Dynamic websites
  - APIs and Javascript
  - Behaving like a human: selenium
  - Tapping into the APIs: requests and ad-hoc libraries
- Legality and morality of data scraping

Structure

Advanced (15:00-17:00):
- Tapping into hidden APIs: requests
- Session IDs
- Setting up crawlers:
  - CRON jobs
  - Processes
  - scrapy
- Common problems when scraping:
  - Proxies
  - Speed up requests
  - Selenium problems (scroll down, ElementNotVisible exception etc.)
  - Robust scraping
CREA (17:00 onwards)

What is data scraping?

Easy

Great return on investment

A very powerful experience

Extracting data from the web in an automatic manner

Part 1: Static websites

https://www.boulderhumane.org/animals/adoption/dogs

https://www.boulderhumane.org/animals/adoption/dogs

1.1 HTML

Drawing from: http://www.scriptingmaster.com/html/basic-structure-HTML-document.asp

html tag tells the browser that it's a website

metadata, style, title...

content

Example: https://www.boulderhumane.org/animals/adoption/dogs

1.1 HTML: Tree structure

1.1 HTML: Tree structure

Tag list:

div → basic container
a → link to url
p → paragraph
h1/h2/h3/h4/h5/h6 → titles
img → image
table → tables

Attribute names:

href → url
src → source of image
class → usually sets the style
align / color... → gives style
id / name → names
title → text on hover

1.2 How to extract the information?

Step 1: Download data --> library requests

Step 2: Parse data --> library beautifulSoup

1.2.1 Requests

More info at http://docs.python-requests.org

1.2.2 BeautifulSoup

Let's get the intro from

CSSamsterdam.github.io

1.2.2 BeautifulSoup

find returns an HTML element

Some useful things to do with it

find_all returns a list of HTML elements

1.2.2 Finding the right tag: Inspect element

1.2.2 Finding the right tag: Inspect element

Time to play

1.3: Tables from websites

https://en.wikipedia.org/wiki/List_of_sandwiches

table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches",
                      header=0)[0]

URL

read_html returns a list, keep the first table

The first line is the header

Part 2: Dynamic websites

- 2.1: APIs/Javascript

- 2.2: Behaving like a human

- 2.3: Tapping into the API

- 2.3.1 Explicit APIs

- Part 4: "Hidden" APIs (advanced)

2.1 API/JavaScript

Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/

API

2.1 API/JavaScript

HTTP request

https://www.uvm.edu/directory

Part 4 (advanced)

2.2 Behaving like a human

selenium

Requirements (one of the below):

Firefox + geckodriver (https://github.com/mozilla/geckodriver/releases)
Chrome + chromedriver (https://sites.google.com/a/chromium.org/chromedriver/)

Some characteristics of HTML scraping with Selenium it:

(b) can handle javascript
(c) gets HTML back after the Javascript has been rendered
(d) can behave like a person
(a) can be slow

2.2 selenium

# Get the xkcd website
driver.get("https://xkcd.com/")

# Find the title of the image
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")

Time to play

2.3: Tapping into the API

Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/

API

2.3 working with APIs

API

The company tells you the language of the server (the API)
You communicate with the server through HTTP requests
Usually sets up some restrictions
Example: Twitter API
https://developer.twitter.com/en/docs/tweets/search/overview

2.3.1 Explicit APIs

Practice:

Find a library: https://dev.twitter.com/resources/twitter-libraries
We'll use https://github.com/tweepy/tweepy

2.3.1 Explicit APIs

Theory:

Time to play

3: Ethics and legality

Is it legal?:

Possibly... recent court cases have supported the accused, even when the Terms of Use disallowed data scrapping
Personal data cannot be protected by copyright (but... consult a lawyer)
Read the Terms of Use of the website
When dealing with personal data make sure to comply with the EU General Data Protection Regulation (GDPR)
- Data processing is lawful, fair and transparent.
- For research: You are required to go through ethics review (IRB) before collecting the data.
Consult a lawyer in case of doubt

3: Ethics and legality

Is it ethical?

For personal use or research is typically ethical to scrape data (in my opinion)
Use rule #2: Don't be a dick:
- Use an API if one is provided, instead of scraping data
- Use a reasonable crawl rate (~2 request per minute, or whatever says in robots.txt) and scrape at night.
- Identify your web scraper (e.g. headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5); myproject.github.io; Javier Garcia-Bernardo (garcia@uva.nl)"})
- Ask for permission:
  - If the ToS or robots.txt disallows scraping
  - Before republishing the data

Adapted from: https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

Break

Advanced workshop

4. Tapping into "hidden" APIs

Advantage: Very fast and more reliable

4: Finding the API: HTTP Requests

HEADER

PARAMETERS

COOKIES

Usually not needed, but you should identify yourself just in case the website owner wants to contact you
user-agent is many times used to restrict the person
- You can change it in every request. Problem: ethics
Sometimes the website only works if you are redirected from another site (more on that later)

Usually not needed

Here is the "language" of the API, the field name is what we want here
Sometimes you just add the parameters in the url:

https://www.uvm.edu/directory/api/query_results.php?name=john+smith&department=

4: Create queries

Step 1: Find the cULR command

4: Create queries

Step 2: Create the Python requests: http://curl.trillworks.com/

Step 3: Get the data remembering rule #2

Session ID's

Easiest way to deal with session IDs: behave like a human >> use selenium

If that is too slow, use 'hidden api' method with variable parameters.

selenium

Detect session id from page

Again, use https://curl.trillworks.com/ and add SID as variable

5. Crawlers

5.1: CRON jobs

- The software utility cron is a time-based job scheduler in Unix-like computer operating systems.
- Easy, robust

https://stackoverflow.com/questions/21648410/write-python-script-that-is-executed-every-5-minutes

5.2: Processes

- All run within Python

5.3: Scrapy

- An open source and collaborative framework for extracting the data you need from websites. In a fast, "simple", yet extensible way.

- Best solution, but steep learning curve

https://doc.scrapy.org/en/1.5/intro/overview.html

6. Advanced topics

6.1A: Proxies

from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
# Collects the proxys and log errors
req_proxy = RequestProxy()
req_proxy.set_logger_level(logging.CRITICAL)

# Request a website
r = req_proxy.generate_proxied_request(link)

Library: http_request_randomizer
Uses public proxy websites:
- Many will be blocked already
Don't use it for evil (rule #2)

6.1B: Proxies

from tor_control import TorControl
import requests
tc = TorControl()
print(requests.get("https://api.ipify.org?format=jso").text)
> 163.172.162.106

tc.renew_tor()
print(requests.get("https://api.ipify.org?format=jso").text)
> 18.85.22.204

Use TOR
Instructions to configure it: https://github.com/jgarciab/tor
Don't use it for evil (rule #2)

6.2: Speed up requests

Use: Want to collect info from many different websites.
Problem: requests is blocking (it waits until the website responds )
Solution: run many threads
- But not straightforward
- Best: grequests: asynchronous HTTP Requests

import grequests

urls = [
    'http://www.heroku.com','http://python-tablib.org', 'http://httpbin.org', 
    'http://python-requests.org', 'http://fakedomain/','http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)

grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]

6.3: Dealing with selenium

We get this when the element is destroyed or hasn't been completely loaded. Possible solutions: Refresh the website, or wait until the page loads

## Option1:
import selenium.common.exceptions
import selenium.webdriver
import selenium.webdriver.common.desired_capabilities
import selenium.webdriver.support.ui
from selenium.webdriver.support import expected_conditions

#Define a function to wait for an element to load
def _wait_for_element(xpath, wait):
    try:
        polling_f = expected_conditions.presence_of_element_located((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

def _wait_for_element_click(xpath, wait):
    try:
        polling_f = expected_conditions.element_to_be_clickable((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

#define short and long timeouts
wait_timeouts=(30, 180)

#open the driver (change the executable path to geckodriver_mac or geckodriver.exe)
driver = selenium.webdriver.Firefox(executable_path="./geckodriver")

#define short and long waits (for the times you have to wait for the page to load)
short_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[0], poll_frequency=0.05)
long_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[1], poll_frequency=1)



#And this is how you get an element
element = _wait_for_element('//*[@id="selFundID"]',short_wait)

6.3.1 Stale Element Exception

6.3: Dealing with selenium

Sometimes not all elements are loaded (e.g. by AJAX) and we need to wait. We could use time.sleep() but for how long? Response time can be highly unstable.

Alternative solution: Wait until a specific element is loaded on the page:

6.3.2 ElementNotVisibleException

More info at https://selenium-python.readthedocs.io/waits.html

6.3: Dealing with selenium

def scroll_down(SCROLL_PAUSE_TIME = 0.5):
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height: break
        last_height = new_height

6.3.3 scrolling down to load all site

6.3: Dealing with selenium

options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list":
           [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
           "download.default_directory": "./download_directory/" ,
           "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
   
driver = webdriver.Chrome("./chromedriver",chrome_options=options)

6.3.4 downloading files without asking

6.3: Dealing with selenium

def enable_download_in_headless_chrome(driver, download_dir):
    # add missing support for chrome "send_command"  to selenium webdriver
    driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')

params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)

[...]

options.add_experimental_option("prefs", profile)

options.add_argument("--headless")

driver = webdriver.Chrome("./chromedriver",chrome_options=options)
enable_download_in_headless_chrome(driver,"./download_directory/))

6.3.5 headless chrome (not opening more windows)

6.3: Dealing with selenium

#Click somewhere
driver.find_element_by_xpath("xxxx").click()

#Switch to the new window
driver.switch_to_window(driver.window_handles[1])

#Do whatever
driver.find_element_by_xpath('xxxxxx').click()

#Go back to the main window
driver.switch_to_window(driver.window_handles[0])

6.3.6 pop up windows (e.g. to log in)

6.4: Robust scraping

Don't make your scraper language dependent
Save raw html
Use drilldown method to identify/extract elements on the page
Avoid xpath (my opinion)
Track your progress, st scraper can crash and start from where it left off

Discussion time

Workshop on data scraping - Amsterdam CSS

By Javier GB

Workshop on data scraping - Amsterdam CSS

Workshop on data scraping

1,614

Javier GB

javiergb_com

Data scraping with Python3

Slides available at

slides.com/jgarciab/web_scraping

github.com/jgarciab/web_scraping

Notebook available at

Structure

Structure

What is data scraping?

Part 1: Static websites

1.1 HTML

1.1 HTML: Tree structure

1.1 HTML: Tree structure

1.2 How to extract the information?

1.2.1 Requests

1.2.1 Requests

1.2.2 BeautifulSoup

1.2.2 BeautifulSoup

1.2.2 BeautifulSoup

find returns an HTML element

Some useful things to do with it

find_all returns a list of HTML elements

1.2.2 Finding the right tag: Inspect element

1.2.2 Finding the right tag: Inspect element

Time to play

1.3: Tables from websites

Part 2: Dynamic websites

2.1 API/JavaScript

2.1 API/JavaScript

2.2 Behaving like a human

2.2 selenium

Time to play

2.3: Tapping into the API

2.3 working with APIs

2.3.1 Explicit APIs

2.3.1 Explicit APIs

Time to play

3: Ethics and legality

3: Ethics and legality

3: Ethics and legality

Break

Advanced workshop

4. Tapping into "hidden" APIs

4: Finding the API: HTTP Requests

4: Finding the API: HTTP Requests

HEADER

PARAMETERS

COOKIES

4: Create queries

4: Create queries

Session ID's

Detect session id from page

5. Crawlers

5.1: CRON jobs

5.2: Processes

5.3: Scrapy

6. Advanced topics

6.1A: Proxies

6.1B: Proxies

6.2: Speed up requests

6.3: Dealing with selenium

6.3.1 Stale Element Exception

6.3: Dealing with selenium

6.3.2 ElementNotVisibleException

6.3: Dealing with selenium

6.3.3 scrolling down to load all site

6.3: Dealing with selenium

6.3.4 downloading files without asking

6.3: Dealing with selenium

6.3.5 headless chrome (not opening more windows)

6.3: Dealing with selenium

6.3.6 pop up windows (e.g. to log in)

6.4: Robust scraping

Discussion time

Workshop on data scraping - Amsterdam CSS

More from Javier GB