Task:
Collect data from WMATA API every 10 seconds for a while... months
Requirements:
problem | solution | what | why | how
import Wmata
apikey = 'xxxxxxxxxyyyyyyyyzzzzzzz'
api = Wmata(apikey)
stopid = '1003043'
buspred=api.bus_prediction(stopid)
>>> buspred['StopName']
u'New Hampshire Ave + 7th St'
>>> buspred['Predictions'][1]
{u'TripID': u'6783517', u'VehicleID': u'6493', u'DirectionNum': u'1', u'RouteID': u'64', u'DirectionText': u'South to Federal Triangle', u'Minutes': 10}
problem | solution | what | why | how
laptop | shiny | aws
problem | solution | what | why | how
laptop | shiny | aws
runs 24/7 on EC2 Linux instance and writes to SimpleDB
doesn't burn up or crash my laptop
can run multiple scrapers using the same instance
can be more dangerous ... in a good way
scale
maintain one version of code that runs on EC2 and locally
configure EC2 instance w/ any software or packages
free, for a while
overhead
,
problem | solution | what | why | how
laptop | shiny | aws
## Python code deployed to EC2 to collect and store WMATA bus data
import datetime
import time
import boto.sdb
from pytz import timezone
from python_wmata import Wmata
def buspred2simpledb(wmata, stopID, dom, freq=10, mins=10, wordy=True):
stime = datetime.datetime.now(timezone('EST'))
while datetime.datetime.now(timezone('EST')) < stime + datetime.timedelta(minutes=mins):
time.sleep(freq)
try:
buspred = wmata.bus_prediction(stopID)
npred = len(buspred['Predictions'])
if npred>0:
for i in range(0,npred):
thetime = str(datetime.datetime.now(timezone('EST')))
items = buspred['Predictions'][i]
items.update({'time':thetime})
items.update({'stop':stopID})
itemname = items['VehicleID'] + '_' + str(items['Minutes']) + '_' + thetime
dom.put_attributes(itemname, items) ## actually writing to simpledb domain
if wordy==True: print items
except:
print [str(datetime.datetime.now()), 'some error...']
pass
## establish connection to WMATA API
api = Wmata('xxxyyyzzzmyWMATAkey')
## establish connection to AWS SimpleDB
conn = boto.sdb.connect_to_region(
'us-east-1',
aws_access_key_id='xxxyyyzzzzTHISisYOURaccessKEY',
aws_secret_access_key='xxxyyyzzzzTHISisYOURsecretKEY'
)
## connect to simpleDB domain wmata2
mydom = conn.get_domain('wmata2')
## Letting it rip...scrape
buspred2simpledb(api, stopID='1003043', dom=mydom, freq=5, mins=60*24, wordy=True)
problem | solution | what | why | how
laptop | shiny | aws
Simple DB
...a lot of things
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
Elastic Compute Cloud (EC2)
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
Elastic Compute Cloud (EC2)
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
Simple Storage Service (S3)
## move data from AWS S3 to laptop using AWS CLI
aws s3 cp s3://yourbucket/scraped_data.csv C:/Users/abrooks/Documents/scraped_data.csv
## read data from file in R
df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')
## an image from this presentation hosted on S3
https://s3.amazonaws.com/media-p.slid.es/uploads/ajb073/images/793870/awsservices.png
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
SimpleDB
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
Relational Database Service (RDS)
easy
easy
hard
not possible
problem | solution | what | why | how
everything | EC2 | S3 | SimpleDB | RDS
standard data science configurations
problem | solution | what | why | how
standardization & deployment | personal projects
personal projects
problem | solution | what | why | how
standardization & deployment | personal projects
first things first
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
EC2: getting started
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
EC2: move code and data
## move your python script from laptop to EC2.
## directories with spaces need to be quoted with ""
## at the Windows command prompt...
pscp -r -i C:\yourfolder\yourAWSkey.ppk C:\yourfolder\yourscript.py ec2-user@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:/home/ec2-user/yourscript.py
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
EC2: using screen
## in your EC2 Linux instance...
## create a new screen and run script
screen
python noisy_script1.py
## type ctrl+a+d to leave screen
## list screens
screen -list
## get backinto screen
screen -r 1001
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
EC2: configure instance with Python packages
## at the EC2 Linux command line...
## get pip
wget https://bootstrap.pypa.io/get-pip.py
## run pip install script
sudo python get-pip.py
## install pytz package
sudo pip install pytz
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
EC2: configure instance with your favorite tools
# this tutorial shows how to download Rstudio on Ubuntu server
# http://randyzwitch.com/r-amazon-ec2/
# at the EC2 Ubuntu instance command line...
#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio
#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade
#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list
#Add following line (or your favorite CRAN mirror).
# ctrl+i to allow vi to let you type
# esc + ZZ to exit and save file
deb http://cran.rstudio.com/bin/linux/ubuntu precise/
#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update
#Install latest version of R
#Install without verification
sudo apt-get install r-base
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
S3: getting started
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
## R code
df <- read.csv('http://wmatatest.s3-website-us-east-1.amazonaws.com/bus64_17Sep2014.txt', sep='|')
head(df)
time Minutes VehicleID DirectionText RouteID TripID
1 2014-09-17 08:30:43.537000 7 2145 South to Federal Triangle 64 6783581
2 2014-09-17 08:30:43.537000 20 7209 South to Federal Triangle 64 6783582
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
S3: move files to/from S3 and EC2/laptop
### at the windows command prompt (or EC2 instance command line)
aws s3 cp mtcars.csv s3://yourbucket/mtcars.csv
### R code to configure RStudio Ubuntu instance to enable system calls from within R.
# only need to do this once
Sys.setenv(AWS_ACCESS_KEY_ID = 'xxXXyyYYzzZZYourAWSaccessKEY')
Sys.setenv(AWS_SECRET_ACCESS_KEY = 'xxxxYYYYYYYZZZZZZourAWSprivateKEY')
# move data to S3 storage
system('aws s3 cp mtcars.csv s3://wmatatest/mtcars.csv --region us-east-1')
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
S3: access your data/files from R
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
SimpleDB: getting started
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
SimpleDB : access with Python
import boto.sdb
conn = boto.sdb.connect_to_region(
'us-east-1',
aws_access_key_id='xxxxyyyzzzzYourAWSaccessKEY',
aws_secret_access_key='xxxxyyyyzzzzYourAWSsecretKEY'
)
## establish connection to domain
dom = conn.get_domain('wmata2')
## select all items/rows from domain (table)
query = 'select * from `wmata2`'
rs = dom.select(query)
## rs is a cursor. need to iterate over every element to extract, which is slow. 45 seconds for 27,000 items.
mins = []
for i in rs: mins.append(i['Minutes'])
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB
SimpleDB: access with R
problem | solution | what | why | how
start | EC2 | S3 | SimpleDB