HOW TO TRICK SQL INTO DOING ALL THE WORK FOR YOU

PyGotham 2018

Molly Leen

Talk Overview

The Problem

Current Process

The Solution

Takeaways

The Problem

Large amount of data
2x the number of rows currently in our system

"BIG" data

The Data

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}
{"fname": "Patrick", "lname": "Star", "type": "starfish"}
{"fname": "Squidward", "lname": "Tentacles", "type": "squid"}
{"fname": "Eugene", "lname": "Krabs", "type": "crab"}
{"fname": "Sandy", "lname": "Cheeks", "type": "squirrel"}
{"fname": "Sheldon", "lname": "Plankton", "type": "plankton"}
{"fname": "Gary", "lname": "The Snail", "type": "snail"}

Current Infrastructure

raw data

API server

Post each record 1 at a time
Estimated time: ~48 hours

POST
{
  "fname": "Spongebob",
  "lname": "Squarepants",
  "type": "Sponge"
}

ingestion script

INSERT

Why is this so slow?

Network request for each record
API commits to db on EACH request

Processing one by one is SLOW

Can we process in bulk?

Use COPY to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads.

From the PostgreSQL docs:

How does it work?

COPY takes data from a file which is formatted to match the structure of the table

"Spongebob,Squarepants,sponge"
"Patrick,Star,starfish"
"Squidward,Tentacles,squid"
"Eugene,Krabs,crab"
"Sandy,Cheeks,squirrel"
"Sheldon,Plankton,plankton"
"Gary,The\ Snail,snail"

fname | lname | type
---------------------
      |       |

Requirements of COPY

Imports data from a FILE

File must be a CSV matching the FORMAT of the postgres table

Problems with COPY

The file is not on the server

The file is JSONLines not a CSV

Problem #1

COPY imports data from a file

The file is not on the server

Potential Solution

raw data

API server

Streaming data from s3

with requests.get(url, stream=True) as s3_data:
    # Data is now a generator we can access
    # in a loop or by calling data.next()
    print(s3_data.next())

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}

prints...

Requirements

Imports data from a FILE

File must be a CSV matching the tables FORMAT

Psycopg2 COPY

PostgreSQL adapter for use in Python apps/scripts/etc.

What is a file-like object?

According to the python docs:

An object exposing a file-oriented API (with methods such as read() or write()) to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.).
https://docs.python.org/3/glossary.html#term-file-object

Generator -> File

Can we make our generator LOOK like a file?

Generator	File
next() to get next line	readline() to get next line
raises StopIteration when done	returns an empty string when done

    def __init__(self, s3_data):
        self.data = s3_data # generator

    def readline(self):
        try:
            return self.data.next()
        except StopIteration:
            return ''

    def read(self, size=-1):
        return self.readline()

Let's try it

class FileLike(object):

    def readline(self):

    def readline(self):
        return self.data.next()

Extracting data from s3

with requests.get(url, stream=True) as s3_data:
    file_like_data = FileLike(s3_data)
    print(file_like_data.readline())

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}

prints...

Requirements

Imports data from a "FILE"

File must be a CSV matching the tables FORMAT

Problem #2

File must be a CSV that matches the format of the table

The file is JSONLines

The Data

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}
{"fname": "Patrick", "lname": "Star", "type": "starfish"}
{"fname": "Squidward", "lname": "Tentacles", "type": "squid"}
{"fname": "Eugene", "lname": "Krabs", "type": "crab"}
{"fname": "Sandy", "lname": "Cheeks", "type": "squirrel"}
{"fname": "Sheldon", "lname": "Plankton", "type": "plankton"}
{"fname": "Gary", "lname": "The Snail", "type": "snail"}

JSON -> CSV

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}

"Spongebob,Squarepants,sponge"

CSV Writer

Ensures all data is properly formatted and escaped

["Spongebob", "Squarepants", "sponge"]

"Spongebob,Squarepants,sponge"

    ["Gary", "The Snail", "snail"]

   "Gary,The\ Snail,snail"

CSV Writer

Writes to a file-like object that has a write() fn
We need to format and immediately retrieve the data

class FileQueue(Queue):

    def write(self, data):
        self.put(data)

    def read(self):
        return self.get()

JSON -> CSV Flow

JSON

{"fname": "Spongebob", "lname": "Squarepants", "type": "sponge"}

List (prep_for_csv step)

["Spongebob", "Squarepants", "sponge"]

Write to CSV (formats and "puts" to FileQueue)

Read from FileQueue and return

"Spongebob,Squarepants,sponge"

    def __init__(self, s3_data):
        self.data = s3_data # generator

    def readline(self):
        try:
            return self.data.next()
        except StopIteration:
            return ''

    def read(self, size=-1):
        return self.readline()

Let's add it

class FileLike(object):

        except StopIteration:
            return ''

data =

# converts JSON to list of values
data_for_csv = prep_for_csv(data)

        self.file_queue = FileQueue()
        self.writer = csv.writer(self.file_queue)

self.writer.writerow(data_for_csv)

return self.file_queue.read()

Extracting data from s3

with requests.get(url, stream=True) as s3_data:
    file_like_data = FileLike(s3_data)
    print(file_like_data.readline())

"Spongebob,Squarepants,sponge"

prints...

Requirements

Imports data from a FILE

File must be a CSV matching the tables FORMAT

Now we're ready to COPY!

with requests.get(url, stream=True) as s3_data:
    file_like_data = FileLike(s3_data=data)
    table = 'spongebob_characters'
    columns = ['fname', 'lname', 'type']

    cursor.copy_from(file_like_data, table, 
                     columns=columns, sep=',')

But wait...

How does this "trick SQL into doing all the work for you"?

The copy_from command reads from our "file"

and all processing happens when data is read!

New Flow

API server

raw data

ingestion script

COPY

{
  "file": "s3://spongebob"
}

HOW TO TRICK SQL INTO DOING ALL THE WORK FOR YOU

Talk Overview

The Problem

Current Process

The Solution

Takeaways

The Problem

The Data

Current Infrastructure

Why is this so slow?

Can we process in bulk?

How does it work?

Requirements of COPY

Problems with COPY

Problem #1

Potential Solution

Streaming data from s3

Requirements

Psycopg2 COPY

What is a file-like object?

Generator -> File

Let's try it

Extracting data from s3

Requirements

Problem #2

The Data

JSON -> CSV

CSV Writer

CSV Writer

JSON -> CSV Flow

Let's add it

Extracting data from s3

Requirements

Now we're ready to COPY!

But wait...

How does this "trick SQL into doing all the work for you"?

New Flow

BUT WAS IT FASTER?!

Original Estimate: ~2 days

New Method: ~20 minutes!

Takeaways

Objects may not always be what they seem

If it walks like a duck and it quacks like a duck, then it must be a duck

The End

Thank's for coming!!