Doc Brown

Because every array deserves a loving home.

Brian Davis

The Problem

We have lots of array data.

one-dimensional

two-dimensional

three-dimensional

four-dimensional

etc...

"Big Data" usually looks like this:

{"foo": 1, "bar": 1, "baz": 1}
{"foo": 2, "bar": 2, "baz": 2}
{"foo": 3, "bar": 3, "baz": 3}
{"foo": 4, "bar": 4, "baz": 4}
{"foo": 5, "bar": 5, "baz": 5}
...

So we did this:

[[1, 2, 3, ...], ...]

⬇

{"latitude": 0, "longitude": 0, "value": 1}
{"latitude": 1, "longitude": 0, "value": 2}
{"latitude": 2, "longitude": 0, "value": 3}
...

But...

storage = expensive

querying = expensive

scalability = poor

new dataset = new code

scientific tools = not supported

Array Databases!

storage = cheap

querying = cheap

scientific tools = supported

But...

operationalization = hard

scalability = vertical

versioning = nope

Our Design

Doc Brown is a library, not a service.

This means:

no servers

no bottlenecks*

cheap scalability

low operational cost

* except for the backing store, of course

Our Design

Doc Brown stores data in chunked,

content-addressable,

binary arrays.

This means:

and...

... all have identical performance.

At a fixed cost.

This also means:

... small updates are cheap.

This also means:

... versioning is cheap.

Interfaces

Doc Brown supports several different points of entry.

Clojure

(ns docbrown-demo
  (:require
   [com.climate.docbrown :as db]
   [com.climate.docbrown.slice :as slice]))

(def store-spec "ddb://integration.docbrown.climate.com/demo")

(let [reader (-> store-spec
                 db/dataset-reader
                 db/on-last-version
                 (db/variable-reader :foo))]
  ;; fetch a 1000x1000x1000 element cube
  (db/get-slice reader (slice/mk-slice [0 0 0] [1000 1000 1000])))

DAP

from netCDF4 import Dataset

url = "http://docbrown.testing.net/foo"

dataset = Dataset(url)
# fetch a 1000x1000x1000 element cubebar = dataset.variables['bar'][0:1000, 0:1000, 0:1000]

REST + JSON

$ curl -XGET -H "Accept: application/json" \"http://docbrown.testing.net/foo.dods?bar\[0:1:1000\]\[0:1:1000\]\[0:1:1000\]"

TODO

documentation

garbage collection

open source

Credits

Arthur Silva

Alice Liang

Brian Davis

Brian Zimmer

Jeffrey Gerard

Mahsa Eshraghi

Rama Shenai

Sebastian Galkin

Tim Chagnon

So what is Doc Brown?

It's a

standards-compliant,

distributed,

versioning,

N-dimensional

array database.

Wha?

Think of it like a bunch of Rubik's cubes,

wrapped in Git,

sitting in The Cloud...

Why N-Dimensional?

Prior schema struggled to support queries like this:

"Give me the precipitation over the entire Midwest

for the last 48 hours."

Why N-Dimensional?

These are all simple queries in Doc Brown.

Why Versioning?

Data providers change their data all the time.

Existing systems can't easily track these changes.

Why Versioning?

Doc Brown tracks these changes.

Why Versioning?

"Get December's daily temperature at [x, y]."

"Get December's daily temperature at [x, y]

as reported two weeks ago."

Why Distributed?

The data we currently use is small.

Why Distributed?

The data we want to use is BIG.

Why Standards-Compliance?

Writing custom software all the time is wasteful, costly, unsustainable, silly.

Why Standards-Compliance?

Cool analysis tools already exist.

Doc Brown's DAP interface lets us use them.

(And we plan to support more.)

Why Standards-Compliance?

Demo

import numpy as npfrom pydap.client import open_urlfrom videosink import VideoSink# ^^^ We wrote none of those libraries up there.

# Get 72 hours of precip data over a ~400x300 km chunk of the midwest.d = open_url("http://docbrown.testing.net/foo")
data = d['bar'][0:72,2000:2300,2500:2900]

# Rescale the data for 8-bit video.
scaled = (255.0/(data.max()-data.min())*(data-data.min())).astype(np.uint8)

# Aaaaaaaaand... ACTION!video = VideoSink((300, 400), "test", rate=8, byteorder="Y8")
for frame in xrange(72):
    video.run(data_rescaled[frame,:,:])
video.close()

Doc Brown

The Problem

Array Databases!

Our Design

Our Design

Interfaces

TODO

Credits

So what is Doc Brown?

Wha?

Why N-Dimensional?

Why N-Dimensional?

Why Versioning?

Why Versioning?

Why Versioning?

Why Distributed?

Why Distributed?

Why Standards-Compliance?

Why Standards-Compliance?

Why Standards-Compliance?

Demo

Doc Brown

More from monodeldiablo