Scraping and storing data using Amazon Web Services (AWS)

My Problem

Task:

Collect data from WMATA API every 10 seconds for a while... months

 

Requirements: 

  • don't burn out laptop 
  • store data somewhere more accessible than hard drive
  • ... but somewhere more suitable than Github
  • little impedance to data science workflow
  • free
  • scalable

problem | solution | what | why | how 

 

  • access WMATA API with Python
  • parse JSON and save as .csv
  • ... all using my laptop
import Wmata
apikey = 'xxxxxxxxxyyyyyyyyzzzzzzz'
api = Wmata(apikey)
stopid = '1003043'
buspred=api.bus_prediction(stopid)

>>> buspred['StopName']
u'New Hampshire Ave + 7th St'

>>> buspred['Predictions'][1]
{u'TripID': u'6783517', u'VehicleID': u'6493', u'DirectionNum': u'1', u'RouteID': u'64', u'DirectionText': u'South to Federal Triangle', u'Minutes': 10}

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.0: laptop

Solution 1.1: shiny server
 

  • runs 24/7
  • re-code API calls in R
  • use Shiny Server to collect and store data
  • write to .csv or database
  • feels like a hack
  • only R
  • works on ERI Shiny server, but not shinyapps.io

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.better: AWS
 

  • runs 24/7 on EC2 Linux instance and writes to SimpleDB

  • doesn't burn up or crash my laptop

  • can run multiple scrapers using the same instance

  • can be more dangerous ... in a good way

  • scale

  • maintain one version of code that runs on EC2 and locally

  • configure EC2 instance w/ any software or packages

  • free, for a while

  • overhead

,

problem | solution | what | why | how 

laptop | shiny | aws

## Python code deployed to EC2 to collect and store WMATA bus data 

import datetime
import time
import boto.sdb
from pytz import timezone
from python_wmata import Wmata

def buspred2simpledb(wmata, stopID, dom, freq=10, mins=10, wordy=True):
    stime = datetime.datetime.now(timezone('EST'))
    while datetime.datetime.now(timezone('EST')) < stime + datetime.timedelta(minutes=mins):
        time.sleep(freq)            
        try:
            buspred = wmata.bus_prediction(stopID) 
            npred = len(buspred['Predictions'])
            if npred>0:
                for i in range(0,npred):
                    thetime = str(datetime.datetime.now(timezone('EST')))
                    items = buspred['Predictions'][i]
                    items.update({'time':thetime})
                    items.update({'stop':stopID})
                    itemname = items['VehicleID'] + '_' + str(items['Minutes']) + '_' + thetime
                    dom.put_attributes(itemname, items) ## actually writing to simpledb domain
                    if wordy==True: print items
        except:
            print [str(datetime.datetime.now()), 'some error...']
            pass

## establish connection to WMATA API
api = Wmata('xxxyyyzzzmyWMATAkey')

## establish connection to AWS SimpleDB
conn = boto.sdb.connect_to_region(
    'us-east-1',
    aws_access_key_id='xxxyyyzzzzTHISisYOURaccessKEY',
    aws_secret_access_key='xxxyyyzzzzTHISisYOURsecretKEY'
)

## connect to simpleDB domain wmata2
mydom = conn.get_domain('wmata2') 

## Letting it rip...scrape
buspred2simpledb(api, stopID='1003043', dom=mydom, freq=5, mins=60*24, wordy=True)

problem | solution | what | why | how 

laptop | shiny | aws

Solution 1.better: AWS
 

What is AWS

Simple DB

...a lot of things

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Elastic Compute Cloud (EC2)

  • where you run your code
  • pre-configured instances (virtual machines) 
  • Ubuntu, Linux, Windows, ...
  • 1 year free

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Elastic Compute Cloud (EC2)

  • customize with software and packages (Python, R, etc)
  • save image configuration & stop when not using it
  • run Rstudio server or other web-based software from browser

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Simple Storage Service (S3)

  • file system where you store stuff
  • like dropbox... actually it is
  • this presentation stored on S3
  • files accessible from a URL on the web
  • 5GB free
## move data from AWS S3 to laptop using AWS CLI
aws s3 cp s3://yourbucket/scraped_data.csv C:/Users/abrooks/Documents/scraped_data.csv

## read data from file in R
df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')

## an image from this presentation hosted on S3
https://s3.amazonaws.com/media-p.slid.es/uploads/ajb073/images/793870/awsservices.png

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

SimpleDB

  • lightweight non-relational database
  • SQL < SimpleDB < NoSQL
  • flexible schema
  • automatically indexes
  • "SQLish" queries from API
  • ... made easy with packages in Python (boto) and R
  • ... even Chrome sdbNavigator extension
  • easy to setup, easy to write into
  • working with API without package is nontrivial.
  • extracting data wholesale into a table-like object clunky 

 

 

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

What is AWS

Relational Database Service (RDS)

  • traditional relational databases in the cloud:  MySQL, Oracle, Microsoft SQL Server, or PostgreSQL
  • tested:
    • Creating SQL Server instance
    • Connecting to through SQL Server Management Studio
    • Inserting and extracting data
    • API calls for data

easy

easy

hard

not possible

problem | solution | what | why | how 

everything | EC2 | S3 | SimpleDB | RDS

Good for

standard data science configurations

 

 

  • useful for communicating our tech specs to clients
  • clients can use config files for EC2 instances to outfit the same system in their environment
  • OR use pre-configured EC2 instances 
    • smoother model deployment
    • scale up resources on demand
      • host websites, web apps, software
      • schedule refreshes and data extracts
    • less time waiting for IT to approve software

problem | solution | what | why | how 

standardization & deployment | personal projects

 

Good for

personal projects

  • EC2 and S3 take the load off your laptop
    • run your code overnight, over days, weeks, ...
    • you can make mistakes
      • just start another instance
      • won't take down the company server
      • learn more
    • store your collected data publicly on S3 (or not)
  • relatively simple set-up for the ~intermediate techie
  • scale up projects as you need more resources

problem | solution | what | why | how 

standardization & deployment | personal projects

 

How to do it

  • setup AWS account
    • free, but need credit card
    • can use your regular amazon.com account
  • start with EC2
  • then S3
  • download (if on Windows)
    • PuTTY:
      • PuTTYgen for to generate keys
      • PuTTY to connect to image
      • PSCP to move files locally to EC2
    • amazon EC2 CLI tools
      • makes communication easy from laptop to AWS

first things first

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: move code and data 

  • develop and test code locally
  • push to EC2 with PuTTY PSCP
  • move output from EC2 to laptop
## move your python script from laptop to EC2.
## directories with spaces need to be quoted with ""
## at the Windows command prompt...

pscp -r -i C:\yourfolder\yourAWSkey.ppk  C:\yourfolder\yourscript.py ec2-user@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:/home/ec2-user/yourscript.py



problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: using screen

  • run jobs in background after closing Linux shell
  • use multiple shell windows from a single SSH session
  • noisily print to the console in your Python scripts to monitor status in separate organized places
  • I found screen easier than nohup
## in your EC2 Linux instance...

## create a new screen and run script
screen
python noisy_script1.py

## type ctrl+a+d to leave screen

## list screens
screen -list

## get backinto screen
screen -r 1001





problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: configure instance with Python packages

  • Python comes loaded on EC2 Linux instance
  • install packages at the Linux command line
  • OR save configuration as a .sh file and use to configure new instances if you frequently spin up new instances 
## at the EC2 Linux command line...

## get pip
wget https://bootstrap.pypa.io/get-pip.py

## run pip install script
sudo python get-pip.py

## install pytz package
sudo pip install pytz

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

EC2: configure instance with your favorite tools

# this tutorial shows how to download Rstudio on Ubuntu server
# http://randyzwitch.com/r-amazon-ec2/
# at the EC2 Ubuntu instance command line...

#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio
 
#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade
 
#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list
 
#Add following line (or your favorite CRAN mirror). 
# ctrl+i to allow vi to let you type
# esc + ZZ to exit and save file
deb http://cran.rstudio.com/bin/linux/ubuntu precise/
 
#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update
 
#Install latest version of R
#Install without verification
sudo apt-get install r-base

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

## R code

df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')
head(df)

time Minutes VehicleID             DirectionText RouteID  TripID
1 2014-09-17 08:30:43.537000       7      2145 South to Federal Triangle      64 6783581
2 2014-09-17 08:30:43.537000      20      7209 South to Federal Triangle      64 6783582

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: move files to/from S3 and EC2/laptop

### at the windows command prompt (or EC2 instance command line)
aws s3 cp mtcars.csv s3://yourbucket/mtcars.csv

### R code to configure RStudio Ubuntu instance to enable system calls from within R.
# only need to do this once
Sys.setenv(AWS_ACCESS_KEY_ID = 'xxXXyyYYzzZZYourAWSaccessKEY')
Sys.setenv(AWS_SECRET_ACCESS_KEY = 'xxxxYYYYYYYZZZZZZourAWSprivateKEY')

# move data to S3 storage
system('aws s3 cp mtcars.csv s3://wmatatest/mtcars.csv --region us-east-1')

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

S3: access your data/files from R 

  • connecting directly to S3 from R is less supported
    • RAmazonS3 - shaky, sparse documentation
    • RS3 - only works on Linux 
    • s3.r - took some fanegaling: reads data in as single string, needed to parse
  • ... but copying files (data) from S3 to your local computer is easy... but kind of defeats the purpose of S3
  • ability to access data directly from URL
  • let me know if you find better methods!

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB: getting started

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB : access with Python

  • install boto Python package
  • boto simpledb tutorial
    • create domains (tables)
    • add items (attributes/rows/records)
    • retrieving data
  • easy to write into
  • clunky to extract from in bulk
import boto.sdb

conn = boto.sdb.connect_to_region(
'us-east-1',
aws_access_key_id='xxxxyyyzzzzYourAWSaccessKEY',
aws_secret_access_key='xxxxyyyyzzzzYourAWSsecretKEY'
)

## establish connection to domain
dom = conn.get_domain('wmata2')

## select all items/rows from domain (table)
query = 'select * from `wmata2`'
rs = dom.select(query)

## rs is a cursor.  need to iterate over every element to extract, which is slow.  45 seconds for 27,000 items.
mins = []
for i in rs: mins.append(i['Minutes'])

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

How to do it

SimpleDB: access with R 

  • R support is limited:
  • Could access API directly without package... a non-trivial task

problem | solution | what | why | how

start | EC2 | S3 | SimpleDB  

 

Documentation

  • official Amazon documentation is comprehensive and helpful
  • stackOverflow and blogs found through Googling were useful to navigate gaps in documentation specific to me and my computer

 

  • HTML slides made from reveal.js online editor slides.com .... coincidentally also hosted on AWS S3 
Made with Slides.com