Building a database in Clojure

And what it taught me

@robashton

Disclaimer

I've not done Transducers yet

A BIT OF HISTORY (ME)

A decade of enterprise .NET experience

a decade of enterprise JS experience

three years of Clojure fiddling

a year of enterprise Erlang experience

Now a certified Haskell pusher

I hated functional programming at university

But (apparently) it's the future

"TEACH ME FP OH WISE ONE"

So I wrote space invaders in ClojureScript

I needed something easier

So I wrote a database in Clojure

CravenDB

github.com/robashton/cravendb

(demo)

Lesson #1

Parens are your friend

Paredit

An editor without Paredit is not an editor

(Vim and Emacs users rejoice o/)

Rainbow Braces

(So Pretty!)

(Demo)

Lesson #2

The REPL is Boss

Editor integration

Emacs has a gazillion attempts at this

Vim has vim-fireplace/redl

LightTable has InstaREPL

If your editor can't REPL your editor is broken

(Demo)

Lesson #3

(Inside out, Bottom Up)

Example - Conway

Any live cell with fewer than two live neighbours dies, as if caused by under-population.
Any live cell with two or three live neighbours lives on to the next generation.
Any live cell with more than three live neighbours dies, as if by overcrowding.
Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.

"Write me the function that returns the next world state"

(defn next-world-state [world]
     ; Code goes here
  )

No No No!

"Any live cell with fewer than two live neighbours dies

as if caused by under-population."

"Any live cell with fewer than two live neighbours dies

as if caused by under-population."

Break it down

Just the data we need

(defn starves? [is-alive neighbour-count]
  (and (< live-neighbours 2) is-alive))

Composition


(defn live-neighbours [cell])
(defn is-alive? [cell])

(defn dies? [cell]
  (or 
    (starves? 
     (is-alive? cell) (live-neighbours cell)
    (overcrowds?
     (is-alive? cell) (live-neighbours cell)))

Lots of small expressions aid reason

(defn is-alive? [cell]
  (= (:current-state cell) :alive))

Any live cell

(defn neighbour-count [cell grid]
  ; whatever
)

How many neighbours?

Lesson #4

Keep it flat (if you can)

You can build arbitrarily complex maps

{
  indexes: [
              :path "/indexes/ponies"
              :type :in-memory
              :lucene { :handle ... }
              :pending [ { :id 2 :paths [ "/bar" "/foo" ] } ]
           ]
  ; etc
}

And manipulate them

(update-in my-map [ :indexes 0 :path] "new-path")

Passing it around leads to confusion

(defn add-index [db-state])

What's in db-state??

Better this

(indexes/update-path new-path id indexes)

Build modules around each flat data structure in application state

Lesson #5

Let somebody else do the hard work

Clojars + Lein

lein deps

maven
(no XML though!)

project.clj

(defproject cravendb "0.1.0-SNAPSHOT"
  :description "A clojure-oriented document-oriented database"
  :url "http://robashton.github.io/cravendb"
  :min-lein-version "2.2.0"
  :dependencies [[org.clojure/clojure "1.5.1"] 
                 ; etc]

HTTP-KIT

(run-server my-handler { :port 8080 })

Liberator

Defining handlers for http-kit

It's all about the resources

It's all about http correctness

A route from Craven

(ANY "/document/:id" [id]
 (resource
  :allowed-methods [:put :get :delete :head]
  :etag (fn [ctx] (etag-from-metadata ctx))
  :put! (fn [ctx] (db/put-document instance id (read-body ctx)))
  :delete! (fn [_] (db/delete-document instance id))
  :handle-ok (fn [_] (db/load-document instance id))))

All the deps

Text

    :dependencies [[org.clojure/clojure "1.5.1"]
                   [org.clojure/core.async "0.1.256.0-1bf8cf-alpha"]
                   [ring/ring-core "1.1.7"]
                   [org.clojure/data.csv "0.1.2"] ;; For load  purposes
                   [com.cemerick/url "0.1.0"]
                   [liberator "0.9.0"]
                   [instaparse "1.2.2"]
                   [http-kit "2.1.12"]
                   [compojure "1.1.5"]
                   [serializable-fn "1.1.3"]
                   [clojurewerkz/vclock "1.0.0"]
                   [clj-time "0.6.0"]
                   [org.fusesource.leveldbjni/leveldbjni-all "1.7"]
                   [me.raynes/fs "1.4.4"]
                   [http.async.client "0.5.2"]
                   [org.clojure/tools.logging "0.2.6"]
                   [org.slf4j/slf4j-log4j12 "1.6.6"]
                   [org.clojure/core.incubator "0.1.3"]
                   [org.apache.lucene/lucene-core "4.4.0"]
                   [org.apache.lucene/lucene-queryparser "4.4.0"]
                   [org.apache.lucene/lucene-analyzers-common "4.4.0"]
                   [org.clojure/data.codec "0.1.0"]
                   [org.apache.lucene/lucene-highlighter "4.4.0"]

Lesson #6

Interop with legacy Java is GREAT

Java OSS is plentiful

But Java sucks (so does Scala, before any of you get started)

Lucene

(Classic Java, one of the best indexing systems around)

Added like any other dependency

[org.apache.lucene/lucene-core "4.4.0"]
[org.apache.lucene/lucene-queryparser "4.4.0"]
[org.apache.lucene/lucene-analyzers-common "4.4.0"]

Import it

(Lol @ Java namespaces)

(:import
           (org.apache.lucene.analysis.standard StandardAnalyzer)
           (org.apache.lucene.store FSDirectory RAMDirectory)
           (org.apache.lucene.util Version)
           (org.apache.lucene.index IndexWriterConfig IndexWriter DirectoryReader)
           (org.apache.lucene.search IndexSearcher Sort SortField SortField$Type)
           (org.apache.lucene.queryparser.classic QueryParser)
           (org.apache.lucene.document Document Field Field$Store Field$Index
                                      TextField IntField FloatField StringField)))

Use it

; Create a RAM directory called 'dir'
(def dir (RAMDirectory.))

; Create an index writer over that dir
(def writer (IndexWriter. dir))

; Create an index reader over that dir
(def reader (IndexReader. dir))

; Query that reader
(IndexQuery reader "*")

Lesson #7

Interop with legacy Java SUCKS

Classes + Interfaces + FactoryFactoryProvider

Maps, Vectors, Lists

Java in Clojure

(defn create-index [file]
  (let [analyzer (StandardAnalyzer. Version/LUCENE_CURRENT)
        directory (FSDirectory/open file)
        config (IndexWriterConfig. Version/LUCENE_CURRENT analyzer) ]
    (LuceneIndex. analyzer directory config)))

A pervasive legacy

Lucene abstracts to "Objects"
Clojure operates on transparent "data"
(not (= :java :clojure-best-practices))

Hide it

Convert *everything* into maps and lists

(index-result-to-map [index-result]
  {
     :name (.getName index)
     :total-count (.getTotalCount index)
     :items (map index-item-to-map (.getItems index))
  })

Paging native crap

;; Naive implementation

(get-results [count skip index query]
   (let [
         real-count-to-request (+ count skip)
         results (lucene/query index query real-count-to-request)
         still-needed-count (- count (length results))]
      (if (> still-needed-count 0)
          (flatten results 
            (get-results still-needed-count real-count-to-request index query)))))

Laziness

(defn lucene-producer [tx reader opts]
  (fn [offset amount]
    (->>
      (lucene/query reader
                    (:filter opts)
                    (+ offset amount)
                    (:sort-by opts)
                    (:sort-order opts))
      (drop offset)
      (valid-documents tx))))

A producer function

Laziness

(defn lucene-page
  ([producer page-size] (lucene-page producer 0 page-size))
  ([producer current-offset page-size]
   {
    :results (producer current-offset page-size)
    :next (fn [] (lucene-page producer (+ current-offset page-size) page-size))
   }))

State per page

Laziness

And a recursive generator function

(defn lucene-seq
  ([page] (lucene-seq page (:results page)))
  ([page src]
   (cond
     (empty? (:results page)) ()
     (empty? src) (lucene-seq ((:next page)))
     :else (cons (first src) (lazy-seq (lucene-seq page (rest src)))))))

Favour lazy sequences over crappy paging code (etc)

Lesson #8

Native resources are a pain

What's wrong with this?

(let [handle (open-file "foo.txt")]
   (map to-user (read-lines handle)))

Clojure is Lazy

(def results
  (let [handle (open-file "foo.txt")]
    (map to-user (read-lines handle))))


(println results) ; CRASH

Laziness without Purity

(Haskell doesn't have this problem)

How do you build an API around this?

(get-all-the-lines-from "foo.txt")

Well now it's not lazy....

Don't try to hide resources

(with-open [handle (open-resource "foo.txt")]
 (do-stuff-with-resource))

Resources

Don't try to hide resource usage from end-user
Give them an 'open' method
Give them an API to operate over that resource
Make them responsible for closing it
Deal with it.

Lesson #9

Concurrency is something you still need to be aware of

A Problem

Databases have multiple clients

HTTP GET
HTTP PUT
HTTP POST
HTTP GET
HTTP POST

A problem

Shared state

(A collection of in-memory indexes for example)

A solution

; Atom called x with value of 1
(atom x 1)

(println x) ; Atom called 'x' value of 1
(println @x) ; de-reference atom, get 1

; Increase whatever is in x by '1'
(swap! x inc)

A solution

Stuff shared state into atoms and agents
Hide this behind a suitable interface
Give a transparent API over the top of this

A problem

That's a bloody mess

core.async

Channels and Processes (CSP)

An event loop

(defn event-loop [initial-state input]
  (go
    (loop [state initial-state]
      (if-let [event (<! input)]
        (recur (dispatch-event state))))

Event loops

Can look after some private local state
Can look after a collection of resources
Can coordinate multi-threaded access over that state
Look a lot like actors

Lesson #10

(Should have used Erlang)

I ended up with a Clojure-based actor system

  (go
    (loop [state (initial-state engine)]
    (if-let [{:keys [cmd data]} (<! command-channel)]
     (do
      (debug "handling index loop command" cmd)
       (recur (case cmd
         :schedule-indexing (main-indexing-process state)
         :notify-finished-indexing (main-indexing-process-ended state)
         :removed-index state ;; WUH OH
         :new-index (add-chaser state data)
         :chaser-finished (finish-chaser state data)
         :storage-request (storage-request state data))))
      (do
        (debug "being asked to shut down")
        (wait-for-main-indexing state)
        (wait-for-chasers state)
        (close-open-indexes state))))))

Looks a lot like OTP

 handle_info(schedule_indexing, State) ->
 handle_info(finished_indexing, State ->
 handle_info(removed_index, State) ->

Only without

A good supervision structure
nice IO list support for binary message passing
inter-process communication
A good native interop story

Bonus Lesson #11

Spyscope, polymorphism, records and transparent state

If we have time.

Lesson #0

Share and Learn