How to write a search engine in 15 lines of code

Paul Chiusano

@pchiusano

(an introduction to Unison)

@unisonweb

Programming: FUN!!!1!!

A troubling observation...

vast computing resources of civilization

A single OS process

The gap

Docker

Kubernetes

Terraform

Kafka

DynamoDB

S3

EC2

ElasticSearch

Kibana

Prometheus

Grafana

PagerDuty

etcd

ELB

Route 53

Consul

iptables

systemd

Flannel

Weave

Lambda

App Engine

rkt

CoreOS

Zookeeper

Redis

memcached

Initechinize

(okay, that one I made up)

Protobufs

Thrift

A better model

factorial : Number -> Number
factorial n = 
  Vector.fold-left (*) 1 (Vector.range 1 (n + 1))
-- Evaluate factorial at another node
factorial-at : Node -> Number -> Remote Number
factorial-at alice n =
  do Remote
    Remote.transfer alice
    pure (factorial n)
-- apply a function f to two arguments
f x y
f (x + 1) y

-- type signature
sort : forall a . Order a -> Vector a -> Vector a

.

.

.

.

.

.

.

.

.

.

.

.

.

-- Create an empty Index
Index.empty : forall k v . Remote (Index k v)

-- Insert a key value pair into the index
-- can use '∀' instead of 'forall'
Index.insert : ∀ k v . k -> v -> Index k v -> Remote Unit

-- Lookup a key in an index. May return None
Index.lookup : ∀ k v . k -> Index k v -> Remote (Optional v)

Persistent key-value storage

-- There's just a single value of type Unit, Unit!
Unit : Unit

-- Optional
Some 42 : Optional Number
None : Optional Number

.

.

index-example : Node -> Node -> Remote Text
index-example alice bob = do Remote
  Remote.transfer alice
  ind := Index.empty -- create the index on alice
  Index.insert "Alice" "Jones" ind
  Index.insert "Bob" "Smith" ind
  Remote.transfer bob
  Index.lookup "Alice" ind

Key-value storage usage

.

.

.

A search engine in 15 lines of code

A search index

Keyword Set of urls containing the keyword
programming {haskell.org, lambda-the-ultimate.org, unisonweb.org ...}
unison {2016.fullstackfest.com/speakers, unisonweb.org, ... }
scala {scala-lang.org, scala.epfl.ch, ... }
2016 olympics {olympic.org/rio-2016, ... }
...

search for:

"unison programming"

Keyword Set of urls containing the keyword
programming {haskell.org, lambda-the-ultimate.org, unisonweb.org ...}
unison {2016.fullstackfest.com/speakers, unisonweb.org, ... }
alias Url = Text
alias Keyword = Text
alias Set v = Index v Unit
alias SearchIndex = DIndex Keyword (Set Url)

search : Number -> Vector Keyword -> SearchIndex
      -> Remote (Vector Url)
search limit query ind = do Remote
  url-sets := Remote.traverse (k -> DIndex.lookup k ind) query
  zero = IndexedTraversal.empty
  url-sets := Remote.map (Optional.fold zero Index.traversal) url-sets
  merge = IndexedTraversal.intersect (Order.by-2nd Hash.Order)
  urls? = Vector.fold-balanced1 merge url-sets
  -- urls : Vector (Url, Hash Url)
  urls := IndexedTraversal.take-keys limit (Optional.get-or zero urls)
  pure (Vector.map 1st urls)

Text

Text

Text

.

.

-- Pick the nodes responsible for a key, using rendezvous hashing
DIndex.nodes-for-key : ∀ k v . k -> DIndex k v -> Remote (Vector Node)
DIndex.nodes-for-key k ind = do Remote
  nodes := Index.keys ind
  hashes := Remote.traverse (node -> hash! (node, k)) nodes
  (nodes `Vector.zip` hashes)
  |> Vector.sort Hash.Order 2nd
  |> Vector.take DIndex.Replication-Factor
  |> Vector.map 1st
  |> pure
alias DIndex k v = Index Node (Index k v)

For key "Alice", cluster: node1, node2, node3

hash (node1, "Alice"), hash (node2, "Alice") ...

choose node(s) whose hash value highest

.

.

Remote.spawn : Remote Node

-- spawn a node, transfer control there
-- then continue computation
do Remote
  n := Remote.spawn
  Remote.transfer n
  ...

Creating nodes

us-east : Node
eu-central : Node
...

Remote.spawn : Remote Node
Remote.spawn-at : Node -> Remote Node

-- Create 10,000 nodes and add them to a DIndex cluster
do Remote
  ind := DIndex.empty
  -- could also spawn at eu-central, or both regions!
  cluster := Remote.replicate 10000 (Remote.spawn-at us-east)
  Remote.traverse (n -> Remote.at' n (DIndex.join ind)) cluster
  ...

Creating nodes (cont)

.

.

.

.

.

.

How???

factorial-at alice n =
  do Remote
    Remote.transfer alice
    pure (factorial n)

.

factorial n =
  Vector.fold-left (*) 1 (Vector.range 1 (n + 1))

blah z =
  Vector.fold-left (*) 1 (Vector.range 1 (z + 1))

Using hashes for identity

#Q82jfkasdf823jbc192
factorial-at alice n =
  do Remote
    Remote.transfer alice
    pure (factorial n)
factorial-at alice n =
  do Remote
    Remote.transfer alice
    pure (#Q82jfkasdf823jbc192 n)

.

.

.

.

.

.

.

.

factorial n =
  Vector.fold-left (*) 1 (Vector.range 1 (n + 1))

Implications: an immutable codebase

#Q82jfkasdf823jbc192
factorial-at alice n =
  do Remote
    Remote.transfer alice
    pure (#Q82jfkasdf823jbc192 n)
#zzzzzyyyl8as9dfasdl
factorial n = 43

.

.

unisonweb.org

Contributors / advisors: Dan Doel, Sam Griffin, Ed Kmett, Arya Irani, Michael Pilquist ...

@unisonweb

Questions?

alias Html = Text

Http.get-url : Url -> Remote (Either Text Html)
Html.get-links : Html -> Vector Html.Link
Html.plain-textify : Html -> Text

Text.words : Text -> Vector Text

Web.crawl : Vector Url -> DIndex Keyword (Set Url) -> Remote Unit

A1: crawler, IndexedTraversal

alias IndexedTraversal k v = 
   ( Remote (Optional k) -- first key
   , k -> Remote (Optional v) -- lookup
   , k -> Remote (Optional k)); -- next valid key

  

A2: Fast streaming intersection

[ 1  2  3  16  45  48  65  100  109]
[ 1  13  14  109]
[ 2  3  16  45  48  65  100  109]
[ 13  14  109]
[ 1 ]
[ 2  3  16  45  48  65  100  109]
[ 13  14  109]
[ 1 ]
[ 16  45  48  65  100  109]
[ 13  14  109]
[ 1 ]
[ 16  45  48  65  100  109]
[ 13  14  109]
[ 1 ]
[ 16  45  48  65  100  109]
[ 109 ]
[ 1 ]
[ 16  45  48  65  100  109]
[ 109 ]
[ 1 ]
[ 109 ]
[ 109 ]
[ 1 ]
= [ 1  109 ]

A3: A better DIndex

- minimize # hashes per lookup

- replicate based on demand for key

- more advanced load-balancing

- decentralize cluster state

- Paxos / Raft as a Unison lib

A4: Eliminating diamond dependency problem

- Depend on two libs, alice, and bob

- alice library depends on carol-v1

- bob library depends on carol-v2

- problem 95% artificial

- alice depends on carol-v1 only for 'factorial', bob depends on carol-v2 for 'quicksort' - NO CONFLICT!!

A4(a): Eliminating diamond dependency problem

- alice library depends on carol-v1

- bob library depends on carol-v2

- alice depends on carol-v1 only for 'factorial', bob depends on carol-v2 for (improved) 'factorial'

- why can't we allow both versions to be used??