Building a database in Clojure
And what it taught me
@robashton
Disclaimer
I've not done Transducers yet
A BIT OF HISTORY (ME)
A decade of enterprise .NET experience
a decade of enterprise JS experience
three years of Clojure fiddling
a year of enterprise Erlang experience
Now a certified Haskell pusher
I hated functional programming at university
But (apparently) it's the future
"TEACH ME FP OH WISE ONE"
So I wrote space invaders in ClojureScript
I needed something easier
So I wrote a database in Clojure
CravenDB
github.com/robashton/cravendb
(demo)
Lesson #1
Parens are your friend
Paredit
An editor without Paredit is not an editor
(Vim and Emacs users rejoice o/)
Rainbow Braces
(So Pretty!)
(Demo)
Lesson #2
The REPL is Boss
Editor integration
Emacs has a gazillion attempts at this
Vim has vim-fireplace/redl
LightTable has InstaREPL
If your editor can't REPL your editor is broken
(Demo)
Lesson #3
(Inside out, Bottom Up)
Example - Conway
- Any live cell with fewer than two live neighbours dies, as if caused by under-population.
- Any live cell with two or three live neighbours lives on to the next generation.
- Any live cell with more than three live neighbours dies, as if by overcrowding.
- Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
"Write me the function that returns the next world state"
(defn next-world-state [world]
; Code goes here
)
No No No!
"Any live cell with fewer than two live neighbours dies
as if caused by under-population."
"Any live cell with fewer than two live neighbours dies
as if caused by under-population."
Break it down
Just the data we need
(defn starves? [is-alive neighbour-count]
(and (< live-neighbours 2) is-alive))
Composition
(defn live-neighbours [cell])
(defn is-alive? [cell])
(defn dies? [cell]
(or
(starves?
(is-alive? cell) (live-neighbours cell)
(overcrowds?
(is-alive? cell) (live-neighbours cell)))
Lots of small expressions aid reason
(defn is-alive? [cell]
(= (:current-state cell) :alive))
Any live cell
(defn neighbour-count [cell grid]
; whatever
)
How many neighbours?
Lesson #4
Keep it flat (if you can)
You can build arbitrarily complex maps
{
indexes: [
:path "/indexes/ponies"
:type :in-memory
:lucene { :handle ... }
:pending [ { :id 2 :paths [ "/bar" "/foo" ] } ]
]
; etc
}
And manipulate them
(update-in my-map [ :indexes 0 :path] "new-path")
Passing it around leads to confusion
(defn add-index [db-state])
What's in db-state??
Better this
(indexes/update-path new-path id indexes)
Build modules around each flat data structure in application state
Lesson #5
Let somebody else do the hard work
Clojars + Lein
lein deps
maven
(no XML though!)
project.clj
(defproject cravendb "0.1.0-SNAPSHOT"
:description "A clojure-oriented document-oriented database"
:url "http://robashton.github.io/cravendb"
:min-lein-version "2.2.0"
:dependencies [[org.clojure/clojure "1.5.1"]
; etc]
HTTP-KIT
(run-server my-handler { :port 8080 })
Liberator
Defining handlers for http-kit
It's all about the resources
It's all about http correctness
A route from Craven
(ANY "/document/:id" [id]
(resource
:allowed-methods [:put :get :delete :head]
:etag (fn [ctx] (etag-from-metadata ctx))
:put! (fn [ctx] (db/put-document instance id (read-body ctx)))
:delete! (fn [_] (db/delete-document instance id))
:handle-ok (fn [_] (db/load-document instance id))))
All the deps
Text
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/core.async "0.1.256.0-1bf8cf-alpha"]
[ring/ring-core "1.1.7"]
[org.clojure/data.csv "0.1.2"] ;; For load purposes
[com.cemerick/url "0.1.0"]
[liberator "0.9.0"]
[instaparse "1.2.2"]
[http-kit "2.1.12"]
[compojure "1.1.5"]
[serializable-fn "1.1.3"]
[clojurewerkz/vclock "1.0.0"]
[clj-time "0.6.0"]
[org.fusesource.leveldbjni/leveldbjni-all "1.7"]
[me.raynes/fs "1.4.4"]
[http.async.client "0.5.2"]
[org.clojure/tools.logging "0.2.6"]
[org.slf4j/slf4j-log4j12 "1.6.6"]
[org.clojure/core.incubator "0.1.3"]
[org.apache.lucene/lucene-core "4.4.0"]
[org.apache.lucene/lucene-queryparser "4.4.0"]
[org.apache.lucene/lucene-analyzers-common "4.4.0"]
[org.clojure/data.codec "0.1.0"]
[org.apache.lucene/lucene-highlighter "4.4.0"]
Lesson #6
Interop with legacy Java is GREAT
Java OSS is plentiful
But Java sucks (so does Scala, before any of you get started)
Lucene
(Classic Java, one of the best indexing systems around)
Added like any other dependency
[org.apache.lucene/lucene-core "4.4.0"]
[org.apache.lucene/lucene-queryparser "4.4.0"]
[org.apache.lucene/lucene-analyzers-common "4.4.0"]
Import it
(Lol @ Java namespaces)
(:import
(org.apache.lucene.analysis.standard StandardAnalyzer)
(org.apache.lucene.store FSDirectory RAMDirectory)
(org.apache.lucene.util Version)
(org.apache.lucene.index IndexWriterConfig IndexWriter DirectoryReader)
(org.apache.lucene.search IndexSearcher Sort SortField SortField$Type)
(org.apache.lucene.queryparser.classic QueryParser)
(org.apache.lucene.document Document Field Field$Store Field$Index
TextField IntField FloatField StringField)))
Use it
; Create a RAM directory called 'dir'
(def dir (RAMDirectory.))
; Create an index writer over that dir
(def writer (IndexWriter. dir))
; Create an index reader over that dir
(def reader (IndexReader. dir))
; Query that reader
(IndexQuery reader "*")
Lesson #7
Interop with legacy Java SUCKS
Classes + Interfaces + FactoryFactoryProvider
vs
Maps, Vectors, Lists
Java in Clojure
(defn create-index [file]
(let [analyzer (StandardAnalyzer. Version/LUCENE_CURRENT)
directory (FSDirectory/open file)
config (IndexWriterConfig. Version/LUCENE_CURRENT analyzer) ]
(LuceneIndex. analyzer directory config)))
A pervasive legacy
- Lucene abstracts to "Objects"
- Clojure operates on transparent "data"
- (not (= :java :clojure-best-practices))
Hide it
Convert *everything* into maps and lists
(index-result-to-map [index-result]
{
:name (.getName index)
:total-count (.getTotalCount index)
:items (map index-item-to-map (.getItems index))
})
Paging native crap
;; Naive implementation
(get-results [count skip index query]
(let [
real-count-to-request (+ count skip)
results (lucene/query index query real-count-to-request)
still-needed-count (- count (length results))]
(if (> still-needed-count 0)
(flatten results
(get-results still-needed-count real-count-to-request index query)))))
Laziness
(defn lucene-producer [tx reader opts]
(fn [offset amount]
(->>
(lucene/query reader
(:filter opts)
(+ offset amount)
(:sort-by opts)
(:sort-order opts))
(drop offset)
(valid-documents tx))))
A producer function
Laziness
(defn lucene-page
([producer page-size] (lucene-page producer 0 page-size))
([producer current-offset page-size]
{
:results (producer current-offset page-size)
:next (fn [] (lucene-page producer (+ current-offset page-size) page-size))
}))
State per page
Laziness
And a recursive generator function
(defn lucene-seq
([page] (lucene-seq page (:results page)))
([page src]
(cond
(empty? (:results page)) ()
(empty? src) (lucene-seq ((:next page)))
:else (cons (first src) (lazy-seq (lucene-seq page (rest src)))))))
Favour lazy sequences over crappy paging code (etc)
Lesson #8
Native resources are a pain
What's wrong with this?
(let [handle (open-file "foo.txt")]
(map to-user (read-lines handle)))
Clojure is Lazy
(def results
(let [handle (open-file "foo.txt")]
(map to-user (read-lines handle))))
(println results) ; CRASH
Laziness without Purity
(Haskell doesn't have this problem)
How do you build an API around this?
(get-all-the-lines-from "foo.txt")
Well now it's not lazy....
Don't try to hide resources
(with-open [handle (open-resource "foo.txt")]
(do-stuff-with-resource))
Resources
- Don't try to hide resource usage from end-user
- Give them an 'open' method
- Give them an API to operate over that resource
- Make them responsible for closing it
- Deal with it.
Lesson #9
Concurrency is something you still need to be aware of
A Problem
Databases have multiple clients
HTTP GET
HTTP PUT
HTTP POST
HTTP GET
HTTP POST
A problem
Shared state
(A collection of in-memory indexes for example)
A solution
; Atom called x with value of 1
(atom x 1)
(println x) ; Atom called 'x' value of 1
(println @x) ; de-reference atom, get 1
; Increase whatever is in x by '1'
(swap! x inc)
A solution
- Stuff shared state into atoms and agents
- Hide this behind a suitable interface
- Give a transparent API over the top of this
A problem
That's a bloody mess
core.async
Channels and Processes (CSP)
An event loop
(defn event-loop [initial-state input]
(go
(loop [state initial-state]
(if-let [event (<! input)]
(recur (dispatch-event state))))
Event loops
- Can look after some private local state
- Can look after a collection of resources
- Can coordinate multi-threaded access over that state
- Look a lot like actors
Lesson #10
(Should have used Erlang)
I ended up with a Clojure-based actor system
(go
(loop [state (initial-state engine)]
(if-let [{:keys [cmd data]} (<! command-channel)]
(do
(debug "handling index loop command" cmd)
(recur (case cmd
:schedule-indexing (main-indexing-process state)
:notify-finished-indexing (main-indexing-process-ended state)
:removed-index state ;; WUH OH
:new-index (add-chaser state data)
:chaser-finished (finish-chaser state data)
:storage-request (storage-request state data))))
(do
(debug "being asked to shut down")
(wait-for-main-indexing state)
(wait-for-chasers state)
(close-open-indexes state))))))
Looks a lot like OTP
handle_info(schedule_indexing, State) ->
handle_info(finished_indexing, State ->
handle_info(removed_index, State) ->
Only without
- A good supervision structure
- nice IO list support for binary message passing
- inter-process communication
- A good native interop story
Bonus Lesson #11
Spyscope, polymorphism, records and transparent state
If we have time.
Lesson #0
Share and Learn
Building a database in Clojure
By Rob Ashton
Building a database in Clojure
- 2,609