Make life less miserable

Sort of...

Ben

@QUASIBEN

CONTINUUM ANALYTICS

Misery...

Converting Data

Moving Data

Converting then Moving Data

Data Analytics Entangled Bank

And Growing...

"Pretty Much my life..."

--D. Rodriguez

STOP THE INSANITY!!!

Actually...Lots of good tools

Still -- There are problems

CSV->Python

import csv

f = open("file.csv")
f_csv = csv.reader(f)

for line in f_csv...
f_csv.next()
...
munge()
get_money()

Data->DataFrame

df = pd.read_csv("/path/to/file.csv")
pd.read_clipboard  
pd.read_csv        
pd.read_excel      
pd.read_fwf        
pd.read_gbq        
pd.read_hdf        
pd.read_html       
pd.read_json       
pd.read_msgpack    
pd.read_pickle     
pd.read_sql        
pd.read_sql_query  
pd.read_sql_table  
pd.read_stata
pd.read_table
df.to_clipboard
df.to_csv
df.to_dense
df.to_dict
df.to_excel
df.to_gbq
df.to_hdf
df.to_html
df.to_json
df.to_latex
df.to_msgpack
df.to_panel
df.to_period
df.to_pickle
df.to_records
df.to_sparse
df.to_sql
df.to_stata
df.to_string
df.to_wide

Input

Output

Data->DataFrame

import pandas as pd
from glob import glob


df_list = []
csv_list = glob("iris*.csv")
for csv in csv_list:
  df_list.append(pd.read_csv(csv))
df = pd.concat(df_list)

How about a pile of CSVs?

CSV->DataFrame

from odo import odo
import pandas as pd

df = odo("*.csv", pd.DataFrame)
def read_csv_dir(...):
    df_list = []
    csv_list = glob(...)
    for csv in csv_list:
      df_list.append(pd.read_csv(csv))
    return pd.concat(df_list)
    

Codify Function in a Framework

Let's Talk about Databases

DBA

MS SQL

MySQL

PostgreSQL

MongoDB

Oracle

Can be Crazy?

Done everything?

But you're not a DBA...

CSV->DB

conn = get_connect()
cursor = conn.cursor()
sql = "CREATE TABLE iris(s_length INT, s_width INT, ..., species TEXT);"
cursor.execute(sql)

data = read_csv(...)
sql = "INSERT INTO iris(s_length, s_width, ..., species) " \
                  "VALUES(?, ?, ..., ?)"
cursor.executemany(sql,data)
cursor.commit()
conn = get_connect()

df = pd.read_csv(...)
df.to_sql(name=tbl_name, con=conn...)

## pile of csvs

df = odo("*.csv", pd.DataFrame)
df.to_sql(...)

Wait, what?!

CSV->DB

What is `to_sql` doing?

Type Conversion

sqlalchemy.types

​Table Building

​Columns and Types

Chunked Insert

CSV->DB

I knew a DBA -- she said,

"Bulk Loading is a Thing"

COPY FROM # PostgreSQL

LOAD DATA # MySQL

BULK INSERT # MS SQL

BCP

BULK COLLECT # ORACLE

.import # SQLITE

CSV->DB

COPY {tblname} FROM '{abspath}'
    (FORMAT {format_str},
     DELIMITER E'{delimiter}',
     NULL '{na_value}',
     QUOTE '{quotechar}',
     ESCAPE '{escapechar}',
     HEADER {header},
     ENCODING '{encoding}');"""
odo("iris.csv", "sqlite:///iris.db")
odo("iris.csv", "postgresql://localhost::iris")

Odo: A Framework for leveraging other peoples misery...

CSV->DB

COPY {tblname} FROM '{abspath}'
    (FORMAT {format_str},
     DELIMITER E'{delimiter}',
     NULL '{na_value}',
     QUOTE '{quotechar}',
     ESCAPE '{escapechar}',
     HEADER {header},
     ENCODING '{encoding}');"""
odo("iris.csv", "sqlite:///iris.db")
odo("iris.csv", "postgresql://localhost::iris")

Odo: A Framework for leveraging other people's misery...

If you let people go to crazytown they will everytime

odo("*.csv", db://hostname::tablename)
odo("iris.json", mongodb://hostname/db::collection")
odo("*.json", mongodb://hostname/db::collection")

## Full URI
sqldb://username:password@hostname:port/database::tablename

That looks familiar...

# scp filename username@hostname:/path/to/file
# paramiko is an ssh client in Python
ssh.connect(...)
ftp = ssh.open_sftp() 
ftp.put(local_file, remote_file) 

odo('iris.csv', 'ssh://hostname:myfile.csv',
     username='user', key_filename='.ssh/id_rsa', port=22)

Let's Talk about HDFS/Cloud

DE?

HDfs

S3

REDSHIFT

HIVE

IMpala

Data Engineer?

Emoji Algebra is Weird

Cloud/Hdfs++ <-> Data

 

oDO("IRIS.CSV", "HDFS://HDFS_IP:/TMP/IRIS/IRIS.CSV", ...)

from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host=host, port=str(port),
                                  user_name=user)

hdfs.read_file(...)
hdfs.append_file(...)

Cloud/Hdfs++ <-> Data

 

odo("iris.csv", "s3://bucket/iris.csv"...) <->

odo("iris.csv", "hdfs://HDFS_IP:/tmp/iris/iris.csv", ...) <->

odo("hdfs://.../iris/", "hive://database-name::table-name", ...)

odo("iris.csv", "redshift://user:pass@host:port/database::table", ...) <->?

Let's Talk about Binary Stores

Binary StoreS

HDF5

Portable

tabular

compressible

Chunked

Columnar

Bcolz

Parquet

pytables

h5py
HDFSTORE

No Var Len Str

No Python 

missing data

some rough edges

kind of new

Binary Stores <-> Data

ODO("TRIP.CSV, "TRIP.BCOLZ" )

odo(hdfstore://...)

ODO(PYTABLES://...)

ODO(h5py://...)

does this work?

odo('s3://mybucket/path/to/data.json',
    'postgresql://user:passwd@localhost:port/db::data')

Yes!

Uh, How?

Outlinks

INLINKS

Network solves conversions...

URL -> Data

http://raw.githubusercontent.com...../file.csv

Load to Pandas

Load To HDFS

....

 

Not a new idea:
import pandas.io.data as web
web.DataReader("IBM", 'yahoo', start, end)
web.DataReader("IBM", 'google', start, end)

$ FRED
$ Yahoo! Option
$ Fama/French
...

Four Functions

discover - Return the DataShape of an object

convert - Convert data to new type

append - Append data on to existing data source

resource - Identify data by a string URI

Drop - deletes resource (Optional)

URL -> Data

URL -> Data

@append.register(CSV, URL(CSV))
def append_urlX_to_X(target, source, **kwargs):

    with closing(urlopen(source.url)) as r:
        chunk_size = 16 * source.chunk_size
        with open(target.path, 'wb') as fp:
            for chunk in iter(curry(r.read, chunk_size), ''):
                fp.write(chunk)
                return target

APPEND

URL -> Data

@discover.register((URL(CSV), URL(JSONLines)))
def discover_url_line_delimited(c, lines=5, encoding='utf-8', **kwargs):
    """Discover CSV and JSONLines files from URL."""
    with sample(c, lines=lines, encoding=encoding) as fn:
        return discover(c.subtype(fn, **kwargs), **kwargs)

Discover

Sample: sip a few lines of data

URL -> Data

@convert.register(Temp(CSV), (Temp(URL(CSV)), URL(CSV)))
def url_file_to_temp_file(data, **kwargs):
    fn = '.%s' % uuid.uuid1()
    target = Temp(data.subtype)(fn, **kwargs)
    return append(target, data, **kwargs)

COnvert

data: url(csv)
Temp(CSV): CSV with garbage collection

URL -> Data

@resource.register('ftp://.+', priority=16)
@resource.register('http://.+', priority=16)
@resource.register('https://.+', priority=16)
def resource_url(uri, **kwargs):
    path = os.path.basename(urlparse(uri).path)
    try:
        subtype = types_by_extension[ext(path)]
    except KeyError:
        subtype = type(resource(path))

    return URL(subtype)(uri, **kwargs)

Resource

URL -> Data

from odo import odo
import pandas as pd

url = "http://foo.com/file.csv"
df = odo(url, pd.DataFrame)

auth = {'user': 'hdfs','port': '14000'}
odo(url, 'hdfs://hdfs_ip:/tmp/iris/iris.txt', **auth)

Example

Some Thoughts

 

  • frameworks are good

    • Small functions to move and Convert Data

  • Verify Folk tales

    • DataBases require DBas

    • Mongo Doens't Suck -- Try It

    • What's BCOLZ -- Try it

  • ODO -- it's for happy People

Thanks!

Install

conda install odo

pip install odo

Docs

SRC

odo

By Benjamin Zaitlen