PyGotham 2018
Molly Leen
Molly Leen
Senior Software Engineer @
At Kyruus we help health systems match patients with the right providers and enhance patient access enterprise-wide
LETS ADD THIS LATER
"BIG" data
raw data
API server
POST
{
"first_name": "Molly",
"last_name": "Leen",
"conferences": ["PyGotham"]
}
ingestion script
INSERT
Processing one by one is SLOW
Use COPY to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads.
From the PostgreSQL docs:
API server
ONE request
raw data
ingestion script
COPY
https://www.postgresql.org/docs/9.4/static/sql-copy.html
COPY takes data from a file which is formatted to match the structure of the table
"Spongebob","Squarepants","sponge"
"Patrick","Star","starfish"
"Squidward","Tentacles","squid"
"Eugene","Krabs","crab"
"Sandy","Cheeks","squirrel"
"Sheldon","Plankton","plankton"
"Gary","The Snail","snail"
fname | lname | type
---------------------
| |
...but this file could be very large....
...we don't want to have to download the file to our API server....
...let's define some requirements....
...and the structure of the file needs to exactly match the table....
Bulk load data with COPY
According to the python docs:
Do not download full file
We can use a file-like object.
With a Pre-Signed s3 URL and python requests, we can iterate over the data line by line using a generator
with requests.get(url, stream=True) as data:
# Data is now a generator we can access either
# in a loop or by calling data.next()
We have a generator...
...We need a file-like object with read() and readline() methods...
...Let's build one!
def __init__(self, s3_data):
self._data = s3_data # generator of s3 data
def readline(self):
try:
return self._data.next()
except StopIteration:
# Generators end with a StopIteration
# Files end with an empty string
return ''
def read(self, size=-1):
# A file-like object stream takes
# a size param for read we'll
# accept but will ignore param
# because the copy reads line
# by line
return self.readline()
class MyFileLikeObject(object):
def readline(self):
def readline(self):
return self._data.next()
Create records as JSON as if creating only one record
{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}
{"fname": "Patrick", "lname": "Star", "type": "starfish"}
{"fname": "Squidward", "lname": "Tentacles", "type": "squid"}
{"fname": "Eugene", "lname": "Krabs", "type": "crab"}
{"fname": "Sandy", "lname": "Cheeks", "type": "squirrel"}
{"fname": "Sheldon", "lname": "Plankton", "type": "plankton"}
{"fname": "Gary", "lname": "The Snail", "type": "snail"}
Example data:
"Spongebob","Squarepants","sponge"
"Patrick","Star","starfish"
"Squidward","Tentacles","squid"
"Eugene","Krabs","crab"
"Sandy","Cheeks","squirrel"
"Sheldon","Plankton","plankton"
"Gary","The Snail","snail"
fname | lname | type
---------------------
| |
Table
File for COPY
https://docs.python.org/3/library/csv.html#csv.writer
CSV writer writes many lines to a single file...
...time for another file-like object..?
...we don't want to write all the data to a file on disk (Req. #1)...
...COPY expects data one line at a time...
class FileQueue(Queue):
def read(self):
return self.get()
def write(self, data):
self.put(data)
def __init__(self, s3_data):
self._data = s3_data # generator of s3 data
def readline(self):
try:
return self._data.next()
except StopIteration:
# Generators end with a StopIteration
# Files end with an empty string
return ''
def read(self, size=-1):
return self.readline()
class MyFileLikeObject(object):
except StopIteration:
return ''
data =
# Assume we have a function that
# simply returns the values from
# the json data in order in a list
data_for_csv = prep_for_csv(data)
self._file_queue = FileQueue()
self._writer = csv.writer(self._file_queue)
self._writer.writerow(data_for_csv)
return self._file_queue.read()
with requests.get(url, stream=True) as data:
file_like_data = MyFileLikeObject(s3_data=data)
table = 'spongebob_characters'
columns = ['fname', 'lname', 'type']
# Assume we have a psycopg2 cursor which
# is a class that has a connection to the
# database and executes SQL such as copy_from
cursor.copy_from(file_like_data, table,
columns=columns, sep=',')
The copy_from command reads from our file
and all processing happens when data is read!