Recordify-ing the Graph

What / Why?

Custom records generalized the product model so it could store to store additional types of records outside of just products (eg. colors, recipes, etc.)
Needed to generalize the graph sooner rather than later to make it easier to publicize the graph before more product fields were added

type Organization {
  product(id: Identifier!): Product
}

type Product {
  adjacentProducts(filter: String = "=", numAdjacent: Int = 10, sort: String): AdjacentProducts
  children(pagination: PaginationInput): ProductPaginatedList!
  parent: Product
}

type Organization {
  record(id: Identifier!): Record
}

type Record {
  adjacentRecords(filter: String = "=", numAdjacent: Int = 10, sort: String): AdjacentRecords
  children(pagination: PaginationInput): RecordPaginatedList!
  parent: Record
}

Talk Overview

LETS ADD THIS LATER

The Problem

Large amount of data
2x the number of rows currently in our system

"BIG" data

Current Infrastructure

raw data

API server

Post each record 1 at a time
Estimated time: ~48 hours

POST
{
  "first_name": "Molly",
  "last_name": "Leen",
  "conferences": ["PyGotham"]
}

ingestion script

INSERT

Why is this so slow?

Network request for each record
Single threaded
API commits to db on EACH request

PostgreSQL - COPY

Use COPY to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads.

From the PostgreSQL docs:

The Plan

API server

ONE request

raw data

ingestion script

COPY

What is COPY?

https://www.postgresql.org/docs/9.4/static/sql-copy.html

COPY reads from a file or file-like object which is formatted to match the structure of the table

COPY needs a structured file...

...but this file would be very large....

...we don't want to have to download a large file to disk....

...lets define some requirements....

...and the structure of the file is very important....

Requirements

Requirements:

Do not download file to disk
Create records as JSON as if creating only one record
Add API specific metadata

Requirement #1

Do not download file to disk

What is a file?

According to Google:

a collection of data, programs, etc., stored in a computer's memory or on a storage device under a single identifying name.

What is a file-like object?

According to the python docs:

An object exposing a file-oriented API (with methods such as read() or write()) to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.).
https://docs.python.org/3/glossary.html#term-file-object

Psycopg2

PostgreSQL adapter for use in Python apps/scripts/etc.

reads data from a file-like object
object must have read() and readline() methods

Extracting data from s3

With a Pre-Signed s3 URL and python requests, we can iterate over the data line by line using a generator

Pre-Signed s3 URL: Authenticated url to access s3 file via HTTP requests
python requests: Python library for HTTP requests
Generator: A python class that behaves like an iterator

Definitions:

with requests.get(url, stream=True) as data:
    # Data is now a generator we can access either
    # in a loop or by calling data.next()

Generator -> File

We have a generator...

...We need a file-like object with read() and readline() methods...

...Lets build one!

Generator -> File

Copy of pygotham

By Molly Leen

Recordify-ing the Graph

What / Why?

Talk Overview

The Problem

Current Infrastructure

Why is this so slow?

PostgreSQL - COPY

The Plan

What is COPY?

COPY needs a structured file...

Requirements

Requirement #1

What is a file?

What is a file-like object?

Psycopg2

Extracting data from s3

Generator -> File

Generator -> File

Copy of pygotham

More from Molly Leen