Scraping and storing data using Amazon Web Services (AWS)

My Problem

Task:

Collect data from WMATA API every 10 seconds for a while... months

 

Requirements: 

  • don't burn out laptop 
  • store data somewhere more accessible than hard drive
  • ... but somewhere more suitable than Github
  • little impedance to data science workflow
  • free
  • scalable

problem | solution | what | why | how 

 

  • access WMATA API with Python
  • parse JSON and save as .csv
  • ... all using my laptop
import Wmata
apikey = 'xxxxxxxxxyyyyyyyyzzzzzzz'
api = Wmata(apikey)
stopid = '1003043'
buspred=api.bus_prediction(stopid)

>>> buspred['StopName']
u'New Hampshire Ave + 7th St'

>>> buspred['Predictions'][1]
{u'TripID': u'6783517', u'VehicleID': u'6493', u'DirectionNum': u'1', u'RouteID': u'64', u'DirectionText': u'South to Federal Triangle', u'Minutes': 10}

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.0: laptop

Solution 1.1: shiny server
 

  • runs 24/7
  • re-code API calls in R
  • use Shiny Server to collect and store data
  • write to .csv or database
  • feels like a hack
  • only R
  • works on ERI Shiny server, but not shinyapps.io

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.better: AWS
 

  • runs 24/7 on EC2 Linux instance and writes to SimpleDB

  • doesn't burn up or crash my laptop

  • can run multiple scrapers using the same instance

  • can be more dangerous ... in a good way

  • scale

  • maintain one version of code that runs on EC2 and locally

  • configure EC2 instance w/ any software or packages

  • free, for a while

  • overhead

,

problem | solution | what | why | how 

laptop | shiny | aws

## Python code deployed to EC2 to collect and store WMATA bus data 

import datetime
import time
import boto.sdb
from pytz import timezone
from python_wmata import Wmata

def buspred2simpledb(wmata, stopID, dom, freq=10, mins=10, wordy=True):
    stime = datetime.datetime.now(timezone('EST'))
    while datetime.datetime.now(timezone('EST')) < stime + datetime.timedelta(minutes=mins):
        time.sleep(freq)            
        try:
            buspred = wmata.bus_prediction(stopID) 
            npred = len(buspred['Predictions'])
            if npred>0:
                for i in range(0,npred):
                    thetime = str(datetime.datetime.now(timezone('EST')))
                    items = buspred['Predictions'][i]
                    items.update({'time':thetime})
                    items.update({'stop':stopID})
                    itemname = items['VehicleID'] + '_' + str(items['Minutes']) + '_' + thetime
                    dom.put_attributes(itemname, items) ## actually writing to simpledb domain
                    if wordy==True: print items
        except:
            print [str(datetime.datetime.now()), 'some error...']
            pass

## establish connection to WMATA API
api = Wmata('xxxyyyzzzmyWMATAkey')

## establish connection to AWS SimpleDB
conn = boto.sdb.connect_to_region(
    'us-east-1',
    aws_access_key_id='xxxyyyzzzzTHISisYOURaccessKEY',
    aws_secret_access_key='xxxyyyzzzzTHISisYOURsecretKEY'
)

## connect to simpleDB domain wmata2
mydom = conn.get_domain('wmata2') 

## Letting it rip...scrape
buspred2simpledb(api, stopID='1003043', dom=mydom, freq=5, mins=60*24, wordy=True)

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.better: AWS
 

What is AWS

Simple DB

...a lot of things

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Elastic Compute Cloud (EC2)

  • where you run your code
  • pre-configured instances (virtual machines) 
  • Ubuntu, Linux, Windows, ...
  • 1 year free

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Elastic Compute Cloud (EC2)

  • customize with software and packages (Python, R, etc)
  • save image configuration & stop when not using it
  • run Rstudio server or other web-based software from browser

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Simple Storage Service (S3)

  • file system where you store stuff
  • like dropbox... actually it is
  • this presentation stored on S3
  • files accessible from a URL on the web
  • 5GB free
## move data from AWS S3 to laptop using AWS CLI
aws s3 cp s3://yourbucket/scraped_data.csv C:/Users/abrooks/Documents/scraped_data.csv

## read data from file in R
df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')

## an image from this presentation hosted on S3
https://s3.amazonaws.com/media-p.slid.es/uploads/ajb073/images/793870/awsservices.png

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

SimpleDB

  • lightweight non-relational database
  • SQL < SimpleDB < NoSQL
  • flexible schema
  • automatically indexes
  • "SQLish" queries from API
  • ... made easy with packages in Python (boto) and R
  • ... even Chrome sdbNavigator extension
  • easy to setup, easy to write into
  • working with API without package is nontrivial.
  • extracting data wholesale into a table-like object clunky 

 

 

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Relational Database Service (RDS)

  • traditional relational databases in the cloud:  MySQL, Oracle, Microsoft SQL Server, or PostgreSQL
  • tested:
    • Creating SQL Server instance
    • Connecting to through SQL Server Management Studio
    • Inserting and extracting data
    • API calls for data

easy

easy

hard

not possible

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

Good for

standard data science configurations

 

 

  • useful for communicating our tech specs to clients
  • clients can use config files for EC2 instances to outfit the same system in their environment
  • OR use pre-configured EC2 instances 
    • smoother model deployment
    • scale up resources on demand
      • host websites, web apps, software
      • schedule refreshes and data extracts
    • less time waiting for IT to approve software

problem | solution | what | why | how 

standardization & deployment | personal projects

 

Good for

personal projects

  • EC2 and S3 take the load off your laptop
    • run your code overnight, over days, weeks, ...
    • you can make mistakes
      • just start another instance
      • won't take down the company server
      • learn more
    • store your collected data publicly on S3 (or not)
  • relatively simple set-up for the ~intermediate techie
  • scale up projects as you need more resources

problem | solution | what | why | how 

standardization & deployment | personal projects

 

How to do it

  • setup AWS account
    • free, but need credit card
    • can use your regular amazon.com account
  • start with EC2
  • then S3
  • download (if on Windows)
    • PuTTY:
      • PuTTYgen for to generate keys
      • PuTTY to connect to image
      • PSCP to move files locally to EC2
    • amazon EC2 CLI tools
      • makes communication easy from laptop to AWS

first things first

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: move code and data 

  • develop and test code locally
  • push to EC2 with PuTTY PSCP
  • move output from EC2 to laptop
## move your python script from laptop to EC2.
## directories with spaces need to be quoted with ""
## at the Windows command prompt...

pscp -r -i C:\yourfolder\yourAWSkey.ppk  C:\yourfolder\yourscript.py ec2-user@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:/home/ec2-user/yourscript.py



problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: using screen

  • run jobs in background after closing Linux shell
  • use multiple shell windows from a single SSH session
  • noisily print to the console in your Python scripts to monitor status in separate organized places
  • I found screen easier than nohup
## in your EC2 Linux instance...

## create a new screen and run script
screen
python noisy_script1.py

## type ctrl+a+d to leave screen

## list screens
screen -list

## get backinto screen
screen -r 1001





problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: configure instance with Python packages

  • Python comes loaded on EC2 Linux instance
  • install packages at the Linux command line
  • OR save configuration as a .sh file and use to configure new instances if you frequently spin up new instances 
## at the EC2 Linux command line...

## get pip
wget https://bootstrap.pypa.io/get-pip.py

## run pip install script
sudo python get-pip.py

## install pytz package
sudo pip install pytz

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: configure instance with your favorite tools

# this tutorial shows how to download Rstudio on Ubuntu server
# http://randyzwitch.com/r-amazon-ec2/
# at the EC2 Ubuntu instance command line...

#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio
 
#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade
 
#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list
 
#Add following line (or your favorite CRAN mirror). 
# ctrl+i to allow vi to let you type
# esc + ZZ to exit and save file
deb http://cran.rstudio.com/bin/linux/ubuntu precise/
 
#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update
 
#Install latest version of R
#Install without verification
sudo apt-get install r-base

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

## R code

df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')
head(df)

time Minutes VehicleID             DirectionText RouteID  TripID
1 2014-09-17 08:30:43.537000       7      2145 South to Federal Triangle      64 6783581
2 2014-09-17 08:30:43.537000      20      7209 South to Federal Triangle      64 6783582

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: move files to/from S3 and EC2/laptop

### at the windows command prompt (or EC2 instance command line)
aws s3 cp mtcars.csv s3://yourbucket/mtcars.csv

### R code to configure RStudio Ubuntu instance to enable system calls from within R.
# only need to do this once
Sys.setenv(AWS_ACCESS_KEY_ID = 'xxXXyyYYzzZZYourAWSaccessKEY')
Sys.setenv(AWS_SECRET_ACCESS_KEY = 'xxxxYYYYYYYZZZZZZourAWSprivateKEY')

# move data to S3 storage
system('aws s3 cp mtcars.csv s3://wmatatest/mtcars.csv --region us-east-1')

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: access your data/files from R 

  • connecting directly to S3 from R is less supported
    • RAmazonS3 - shaky, sparse documentation
    • RS3 - only works on Linux 
    • s3.r - took some fanegaling: reads data in as single string, needed to parse
  • ... but copying files (data) from S3 to your local computer is easy... but kind of defeats the purpose of S3
  • ability to access data directly from URL
  • let me know if you find better methods!

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB : access with Python

  • install boto Python package
  • boto simpledb tutorial
    • create domains (tables)
    • add items (attributes/rows/records)
    • retrieving data
  • easy to write into
  • clunky to extract from in bulk
import boto.sdb

conn = boto.sdb.connect_to_region(
'us-east-1',
aws_access_key_id='xxxxyyyzzzzYourAWSaccessKEY',
aws_secret_access_key='xxxxyyyyzzzzYourAWSsecretKEY'
)

## establish connection to domain
dom = conn.get_domain('wmata2')

## select all items/rows from domain (table)
query = 'select * from `wmata2`'
rs = dom.select(query)

## rs is a cursor.  need to iterate over every element to extract, which is slow.  45 seconds for 27,000 items.
mins = []
for i in rs: mins.append(i['Minutes'])

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB: access with R 

  • R support is limited:
  • Could access API directly without package... a non-trivial task

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

Documentation

  • official Amazon documentation is comprehensive and helpful
  • stackOverflow and blogs found through Googling were useful to navigate gaps in documentation specific to me and my computer

 

  • HTML slides made from reveal.js online editor slides.com .... coincidentally also hosted on AWS S3 

Web Scraping and Persisting Data in the Cloud

By ajb073

Web Scraping and Persisting Data in the Cloud

presented at tech time @ ERI November 7, 2014

  • 5,408