Anna Keuchenius
PhD candidate Sociology
@AKeuchenius
Javier Garcia-Bernardo
PhD candidate Political Science
@JavierGB_com
Computational Social Science Amsterdam
CSSamsterdam.github.io
https://CSSamsterdam.github.io/
https://www.boulderhumane.org/animals/adoption/dogs
https://www.boulderhumane.org/animals/adoption/dogs
Drawing from: http://www.scriptingmaster.com/html/basic-structure-HTML-document.asp
html tag tells the browser that it's a website
metadata, style, title...
content
Tag list:
Attribute names:
Step 1: Download data --> library requests
Step 2: Parse data --> library beautifulSoup
More info at http://docs.python-requests.org
Let's get the intro from
CSSamsterdam.github.io
https://en.wikipedia.org/wiki/List_of_sandwiches
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches",
header=0)[0]
URL
read_html returns a list, keep the first table
The first line is the header
- 2.1: APIs/Javascript
- 2.2: Behaving like a human
- 2.3: Tapping into the API
- 2.3.1 Explicit APIs
- Part 4: "Hidden" APIs (advanced)
Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/
API
HTTP request
HTTP request
https://www.uvm.edu/directory
Part 4 (advanced)
selenium
Requirements (one of the below):
Some characteristics of HTML scraping with Selenium it:
# Get the xkcd website driver.get("https://xkcd.com/") # Find the title of the image element = driver.find_element_by_xpath('//*[@id="comic"]/img') element.get_attribute("title")
Source: https://about.gitlab.com/2016/06/03/ssg-overview-gitlab-pages-part-1-dynamic-x-static/
API
API
Practice:
Theory:
Is it legal?:
Is it ethical?
Adapted from: https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/
Advantage: Very fast and more reliable
Usually not needed
https://www.uvm.edu/directory/api/query_results.php?name=john+smith&department=
Step 1: Find the cULR command
Step 2: Create the Python requests: http://curl.trillworks.com/
Step 3: Get the data remembering rule #2
Easiest way to deal with session IDs: behave like a human >> use selenium
If that is too slow, use 'hidden api' method with variable parameters.
selenium
Again, use https://curl.trillworks.com/ and add SID as variable
- The software utility cron is a time-based job scheduler in Unix-like computer operating systems. - Easy, robust
https://stackoverflow.com/questions/21648410/write-python-script-that-is-executed-every-5-minutes
- All run within Python
- An open source and collaborative framework for extracting the data you need from websites. In a fast, "simple", yet extensible way.
- Best solution, but steep learning curve
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
# Collects the proxys and log errors
req_proxy = RequestProxy()
req_proxy.set_logger_level(logging.CRITICAL)
# Request a website
r = req_proxy.generate_proxied_request(link)
Library: http_request_randomizer
Uses public proxy websites:
Many will be blocked already
Don't use it for evil (rule #2)
from tor_control import TorControl
import requests
tc = TorControl()
print(requests.get("https://api.ipify.org?format=jso").text)
> 163.172.162.106
tc.renew_tor()
print(requests.get("https://api.ipify.org?format=jso").text)
> 18.85.22.204
Use TOR
Instructions to configure it: https://github.com/jgarciab/tor
Don't use it for evil (rule #2)
Use: Want to collect info from many different websites.
Problem: requests is blocking (it waits until the website responds )
Solution: run many threads
But not straightforward
Best: grequests: asynchronous HTTP Requests
import grequests
urls = [
'http://www.heroku.com','http://python-tablib.org', 'http://httpbin.org',
'http://python-requests.org', 'http://fakedomain/','http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]
We get this when the element is destroyed or hasn't been completely loaded. Possible solutions: Refresh the website, or wait until the page loads
## Option1: import selenium.common.exceptions import selenium.webdriver import selenium.webdriver.common.desired_capabilities import selenium.webdriver.support.ui from selenium.webdriver.support import expected_conditions #Define a function to wait for an element to load def _wait_for_element(xpath, wait): try: polling_f = expected_conditions.presence_of_element_located((selenium.webdriver.common.by.By.XPATH, xpath)) elem = wait.until(polling_f) except: raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath)) return elem def _wait_for_element_click(xpath, wait): try: polling_f = expected_conditions.element_to_be_clickable((selenium.webdriver.common.by.By.XPATH, xpath)) elem = wait.until(polling_f) except: raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath)) return elem #define short and long timeouts wait_timeouts=(30, 180) #open the driver (change the executable path to geckodriver_mac or geckodriver.exe) driver = selenium.webdriver.Firefox(executable_path="./geckodriver") #define short and long waits (for the times you have to wait for the page to load) short_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[0], poll_frequency=0.05) long_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[1], poll_frequency=1) #And this is how you get an element element = _wait_for_element('//*[@id="selFundID"]',short_wait)
Sometimes not all elements are loaded (e.g. by AJAX) and we need to wait. We could use time.sleep() but for how long? Response time can be highly unstable.
Alternative solution: Wait until a specific element is loaded on the page:
More info at https://selenium-python.readthedocs.io/waits.html
def scroll_down(SCROLL_PAUSE_TIME = 0.5):
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height: break
last_height = new_height
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list":
[{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
"download.default_directory": "./download_directory/" ,
"download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome("./chromedriver",chrome_options=options)
def enable_download_in_headless_chrome(driver, download_dir):
# add missing support for chrome "send_command" to selenium webdriver
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)
[...]
options.add_experimental_option("prefs", profile)
options.add_argument("--headless")
driver = webdriver.Chrome("./chromedriver",chrome_options=options)
enable_download_in_headless_chrome(driver,"./download_directory/))
#Click somewhere
driver.find_element_by_xpath("xxxx").click()
#Switch to the new window
driver.switch_to_window(driver.window_handles[1])
#Do whatever
driver.find_element_by_xpath('xxxxxx').click()
#Go back to the main window
driver.switch_to_window(driver.window_handles[0])