Scraping with Selenium

KnoxPy

April 5, 2018

Gavin Wiggins

gavinw.me

Requests

Beautiful Soup

Pandas

NLTK

Selenium

Selenium WebDriver is a collection of bindings to drive a browser

  • Operates a web browser natively just like a user would
  • Language bindings available for Java, C#, Ruby, Python, Javascript

 

Selenium Grid runs tests on many servers at the same time

 

Selenium IDE is a Firefox add-on to record and play back test

 

Selenium Remote Control is a client/server system to control web browsers locally or remotely

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# go to the google home page
driver.get("http://www.google.com")

# the page is ajaxy so the title is originally this:
print(driver.title)

# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("cheese!")

# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    # we have to wait for the page to refresh, the last thing that seems to be updated is the title
    WebDriverWait(driver, 10).until(EC.title_contains("cheese!"))

    # You should see "cheese! - Google Search"
    print(driver.title)

finally:
    driver.quit()

What else can we do with Selenium?

Scraping the CodeStock WebStock site

Must login to view submissions

Submissions page

Click "more" button to view full abstract.

Video of scraping abstracts

Demo...

Summary

Submissions

  • Number of submissions = 370
  • Max submissions per speaker = 15 
  • Most popular track = Developer
  • Most common key words = Azure, .NET, ASP.NET, Angular, and SQL

Lineup

  • Number of accepted talks = 89
  • Max talks per speaker = 2
  • Most popular track = ?
  • Most common key words = .NET, C#, SQL, Elm, and ASP.NET

CodeStock is still WebStock

Scraping with Selenium

By Gavin Wiggins

Scraping with Selenium

Use the Selenium web driver and Python to scrape data from websites.

  • 158

More from Gavin Wiggins