Datalog

Beyond SQL

Relational DB

Data stored in relations (tables)

Fixed Schema

Foreign-key relationships

Columns or combination of columns as index

Document DB

Data stored as documents

Mostly schema-free

Some foreign-key relationships

Document properties as index

Graph DB

Data as Graph structures with nodes and annotated edges

Fixed Schema (everything is a graph)

Query-by-navigation

Node-based index

And many more...

Datalog

Declarative logic programming language

Subset of Prolog (1972)

Based on first-order logic

Not turing complete

Implementations in C/C++, Clojure, Erlang, Haskell, Java, Lua, OCaml, Prolog, Python, Racket, Ruby, Rust, Tcl

Datalog Example

jedi(luke).
sith_lord(anakin).
sith_lord(ben).

good(X) :- jedi(X).

mother(padme,luke).
mother(padme,leia).
mother(leia,ben).
mother(shmi,anakin).

father(anakin,luke).
father(anakin,leia).

parent(X,Y) :- mother(X,Y).
parent(X,Y) :- father(X,Y).

sibling(X,Y) :- parent(Z,X), parent(Z,Y), X \== Y.

grand_parent(X,Y) :- parent(X,Z), parent(Z,Y).

force_sensitive(X) :- parent(shmi,X).
force_sensitive(X) :- grand_parent(shmi,X).

ancestor(X,Y) :- parent(X,Y).
ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y).

% force_sensitive(X) :- ancestor(shmi,X).

*Atoms are lower-case, variables are upper-case, % indicates a comment

Demo-Time!

Datalog DB

(in 6 easy steps)

Represent entities as sets of facts:

[attribute value]

[father "Vader"]

[name "Luke"]

[job "Jedi"]

Combine facts to entity, give it an ID:

[entity-id attribute value]

[12 father "Vader"]

[12 name "Luke"]

[12 job "Jedi"]

Datalog DB

Make it transactional:

[entity-id attribute value trx-id]

 

[11 father "Vader" 4300]

[11 name "Luke"    4300]

[11 job     "Jedi"      4300]

[12 father "Han"    4300]

[12 name "Ben"     4300]

[12 job "Sith Lord" 4300]

[12 alias "Kylo"       4300]

Datalog DB

Add/revoke flag

[entity-id attribute value trx-id add?]

 

[12 job "Sith Lord"               4301 false]

[12 job "Supreme Leader" 4301 true]

(* true is default and can be omitted)

Datalog DB

Refer to other entities via their id.

 

[13 name "Han"       4302]

[13 job "Smuggler"  4302]

[12 father "Han"      4302 false]

[12 father 13            4302]

 

[14 name "Anakin"  4302]

[14 job "Sith Lord"   4302]

[11 father "Vader"   4302 false]

[11 father 14            4302]

Aside: Data Structure Literals

;; <- semicolon introduces a comment

"foo" ;; a string (duh)

:bar  ;; a keyword (i.e. interned string)

:db/id ;; keyword id with a namespace

[ ] ;; vector

{ } ;; map from key->value

{ "key1" "value1"
  :key2 "value2"
}

#{ } ;; set

()   ;; list

What are our entities now?

As of Trx 4302

;; datoms
[11 :father 14]
[11 :name "Luke"]
[11 :job "Jedi"]


[12 :father 13]
[12 :name "Ben"]
[12 :job "Supreme Leader"]
[12 :alias "Kylo"]


[13 :name "Han"]
[13 :job "Smuggler"]

[14 :name "Anakin"]
[14 :job "Sith Lord"]

Let's add a DB system and some data...

;; Map form
{ :db/id 11 
  :father 14
  :name "Luke"
  :job "Jedi" }

{ :db/id 12
  :father 13
  :name "Ben"
  :job "Supreme Leader"
  :alias "Kylo" }

{:db/id 13
 :name "Han" :job "Smuggler"}

{:db/id 14
 :name "Anakin":job "Sith Lord"
}

Datahike: A datalog DB

(def uri "datahike:file:///tmp/tutorial")
(d/create-database uri)
(def conn (d/connect uri))
;; set up schema, which are also datoms. 
;; We use some helper functions to generate them
;; example: (schema/str-attr :test)
;; => #:db{:ident :test, :valueType :db.type/string, :cardinality :db.cardinality/one}
;; 
(d/transact conn [(schema/str-attr :name)
                  (schema/one-to-one :father)
                  (schema/str-attr :job)
                  (schema/many-str-attr :alias)
                  ])
;; add some data, propose Entity IDs via :db/id
(d/transact conn [{:db/id 11 :name "Luke"   :job "Jedi" :father 14}
                  {:db/id 12 :name "Ben"    :job "Supreme Leader" :father 13
                             :alias "Kylo" }
                  {:db/id 13 :name "Han"    :job "Smuggler"}
                  {:db/id 14 :name "Anakin" :job "Sith Lord"}
                  ])

Querying via Datalog

Find entity-ids, attributes, values by unification:

[:find ?name           ;; ?name is returned
 :where
   [?e :job "Jedi"]    ;; ?e is a free variable
   [?e :name ?name]    ;; "Jedi" is a constant
                       ;; unify statements to
] 
=> #{["Luke"]}         ;; find match

Querying

Navigating references

[:find ?name ?father
         :where
         [?e :name ?name]
         [?e :father ?f]
         [?f :name ?father]
]
> #{["Luke" "Anakin"] ["Ben" "Han"]}

Changing Schema

And adding data

#:db{:ident :born, :valueType :db.type/long, 
     :cardinality :db.cardinality/one}

;; birthdate relative to the Battle of Yavin (SW:ANH)
[{:db/id 11 :born  -19}    ;; Luke
 {:db/id 12 :born    5}    ;; Ben Solo
 {:db/id 13 :born  -29}    ;; Han Solo
 {:db/id 14 :born  -41}]   ;; Anakin

Querying with predicates

[:find ?name ?b
 :where
 [?e :name ?name]
 [?e :born ?b]
 [(< 0 ?b)]
]
=> #{["Ben" 5]}

(d/q 
'[:find ?name ?b
  :in $ ?year
  :where
  [?e :name ?name]
  [?e :born ?b]
  [(< ?year ?b)]] 
   (d/db conn) 0)

Aggregates

[:find (count ?e)
 :where
 [?e :name _]]
 => 4
 
[:find (min ?b)
 :where
 [_ :born ?b]]
 => [-41]
 
(min ?xs)
(max ?xs)
(count ?xs)
(count-distinct ?xs)
(sum ?xs)
(avg ?xs)
(median ?xs)
(variance ?xs)
(stddev ?xs)

Aggregates 2

[:find (sample 2 ?name)
 :where
 [_ :name ?name]]
 
[:find (min 2 ?born)
 :where
 [?e :name ?name]
 [?e :born ?born]]
 => [-41 -29]

 
(distinct ?xs)
(min n ?xs)
(max n ?xs)
(rand n ?xs)
(sample n ?xs)

 

Or use your own functions!

Pulling

[*] 11               ;; select all attributes
=> {:db/id 11, 
    :born -19, 
    :job "Jedi",
    :name "Luke",
    :father #:db{:id 14}}
    
[:name :job] 11           ;; one attribute
=> {:name "Luke", :job "Jedi"}


[:name {:father [:name]}] 11  ;; pull referenced entity
=> {:name "Luke", :father {:name "Anakin"}}

Retrieve entities (and referenced entities) via selector and entity-id

Pulling 2

[* {:_father [*]}] 14
=>
{:db/id 14,
 :born -41,
 :job "Sith Lord",
 :name "Anakin",
 :_father [{:db/id 11, 
           :born -19, 
           :father #:db{:id 14}, 
           :job "Jedi", 
           :name "Luke"}]}

References are bi-directional

Querying and Pulling

(d/q '[:find [pull ?e [:name :born]]
         :where
         [?e :job ?j]
         [(not= ?j "Jedi")]
       ] (d/db conn))
=> ([{:name "Han", :born -29}] 
    [{:name "Ben", :born 5}] 
    [{:name "Anakin", :born -41}])

Let's do both!

and much more (rules, computed values etc.)

=> queries themselves are just data

Immutable Datalog DBs

Append-only, mono or bi-temporal

  • Data is only appended, not deleted* or updated
  • Facts can be retracted, which is recorded as fact
  • Transactions are also facts
  • Transaction-metadata like timestamp is stored as well

=> an immutable series of timestamped transactions

 

=> the database can be viewed in different points in time

*Most vendors still provide excision to hard-delete facts

Database as value

Remember (d/db conn)?

Database at the current time!

 

Let's go back in time!

(d/as-of db timestamp!)

(def current-db (d/db conn))
(def db-of-2000 (d/as-of current-db #inst "2000-01-01"))

(d/pull current-db
 '[:name] 11
)
=> {:name "Luke"}

(d/pull db-of-2000
 '[:name] 11
)
=> nil

Transactions as value

Transactions are facts too

;; structure of a fact/datom
[entity attribute value trx]
;; added with every transaction
[trx :db/txInstant timestamp]

;;; we can add
;;; other metadata
[trx :tx/changed-by "beders"]

Transactions in queries

Find when and how facts where added or retracted

;; all transactions
[:find [[pull ?tx [*]] ...]
 :where [?tx :db/txInstant _]]
 =>
 [#:db{:id 536870918, :txInstant #inst"2020-01-09T17:48:25.381-00:00"}
 #:db{:id 536870916, :txInstant #inst"2020-01-09T09:53:50.720-00:00"}
 ....
 ]
 
 ;; all trxs where :born attribute was added
 [:find ?tx ?time
  :where [?tx :db/txInstant ?time]
         [_ :born _ ?tx true]
         ]
 =>
 #{[536870916 #inst"2020-01-09T09:53:50.720-00:00"] 
   [536870917 #inst"2020-01-09T10:00:17.160-00:00"]}
 
 ;; all facts added in a trx
 [:find ?e ?a ?v
  :in $ ?tx
  :where [?e ?a ?v ?tx]
 ] (d/db conn) 536870914
 =>
 {[11 :job "Jedi"] ....}

Implementations

Datomic

(Cognitect - 2012)

Storage backend:

DynamoDB, Cassandra, ORA, PostgreSQL

Used by:

ca, Netflix, FB, TimeInc, NuBank (80TB data, 2100 transactors, 13 shards)

 

 

Crux

(Juxt - 2019)

Bi-temporal, trx-time and valid-time

Storage backends:

RocksDB, Kafka (configurable)

Used by:

Oscaro (eCommerce)

Avisi (AtlasCRM)

 

 

Implementations

Datahike

(@Replikativ - 2019)

Uses hitchhiker trees (fractal, snapshottable, on-disk)

Good for single machines, millions of facts

DataScript

(@Tonsky - 2014)

In-memory, in-browser

Used by:

PrecursorApp (collab editing)

Cognician (coaching platform)

 

Introduction to Datalog for DBs

By Jochen Bedersdorfer

Introduction to Datalog for DBs

Overview of Datalog (the prolog subset) and how it's used in databases like Datahike or Datomic

  • 887