Data Pipelines

IN CORE.ASYNC & SQS

priyatam mudivarti

WHY

compose async pipelines

what if

we stop writing functions and start defining a process?

concepts

climbing the abstraction ladder

Function

 limiting?

synchronous

complects error handling with logic

hard to pass shared state

structure of input is often desirable

...  too many functions!

If you can express your process in terms of a reducing function, you are in business. — Rich Hickey

Transducer

compose better


This creation of logic via composition of sequence-iterating functions complects the created logic with the machinery of sequences. — Rich Hickey

 ;; a lot faster than you think because transducers make it so!
 (defn letter-frequency [input]
    (transduce (comp (filter #(Character/isLetter %))
                     (map #(Character/toLowerCase %))
                     (map (fn [v] {v 1})))
               (partial merge-with +) 
               input))

 ;; Beautiful. Also slow. This is slideware!
 (def max-value
    (comp first first reverse (partial sort-by last)))


 (defn transync [filenames]
   (let [result (map (comp max-value
                           letter-frequency
                           slurp) 
                      filenames)]
     ;; No cheating. Actually do the work.
     (doall result)
     result))

Channels

you need to manage them!

Channels are oriented towards the flow aspects of a system. It is oblivious to the nature of the threads which use it.

 ;; Channels play nicely with transducers!
  (let [channel (chan 1 (map #(.toUpperCase %)))]
    (async/onto-chan channel ["hello" "clojure" "remote"] true)
    (async/go-loop [cur (<! channel)]
      (println cur)
      (when-let [n (<! channel)] 
        (recur n))))

async/pipeline

connect channels

;; Connecting 2 channels with an async process that includes
;; a transducer.
(let [input1  (chan)
      output1 (chan)]
  (async/pipeline 1 output1 (map inc) input1)
  (async/onto-chan input1 (range 0 10))
  (print-all-channel! output1))
;; Connecting 2 channels with an async process that includes
;; a transducer.
(let [input1  (chan)
      output1 (chan)]
  (async/pipeline 1 output1 (map inc) input1)
  (async/onto-chan input1 (range 0 10))
  (print-all-channel! output1))


(let [input0  (chan)
      middle1 (chan)
      middle2 (chan)
      output3 (chan)]
    (async/pipeline 1 middle1 (map inc) input0)
    (async/pipeline 1 middle2 (map (partial * 10)) middle1)
    (async/pipeline 1 output3
                    (map #(str % " zillion channel declarations!"))
                    middle2)
    (async/onto-chan input0 (range 0 5))
    (print-all-channel! output3))

this can get tedious

  • channel management (create, close)
  • error handling
  • messages are not durable
  • can't compose
  • harder inter-pipeline communication
  • can't short circuit
  • hard to mix and match
    side-effects with pure
    functions

 

what if channel management was free?

(let [input (chan)]
    (print-all-channel! (==> [input]
                             (map inc)
                             (map (partial * 10))
                             (map #(str % " zillion fewer channel decs!"))))
    (async/onto-chan input (range 0 3)))

introducing  ==>

process

connecting async pipelines
in a stack, each frame
transitioning from one state to another

push original message {...}

pop original message

{...}

{:from :to}

{:from :to}

{:from :to}

async.task

async task = core.async-pipeline + (magic)

an idempotent process

10

block

drive

10 = no of concurrent threads

async.task

async.task

async.task

async.task

process

async

(defn mapduce
  "Make a map transducer from a bare function. It extracts
   a value from fromkey and inserts the value generated into tokey.
   If no tokey is provided, the function will update the value
   in fromkey."
  ([fromkey f] (mapduce fromkey fromkey f))
  ([fromkey tokey f]
   (map (fn [m] (assoc m tokey (f (fromkey m))))))
  ([context fromkey tokey f]
   (map (fn [m] (assoc m tokey (f (fromkey m) context))))))

mapduce

Make a map transducer from a bare function.

(defn process-events
  [context in-chan out-chan signal-chan broadcast-chan err-chan]
  (==> [in-chan err-chan]
       (map (partial push-task ::original))
       (mapduce-in [::original :body] [:raw-bytes-saved] parse-msg)
       :block (mapduce :raw-bytes-saved :raw-bytes-loaded #(load-raw-bytes [context %]))
       (add-side-effect #(log/info "LOADED RAW BYTES" (count (:raw-bytes-loaded %))))
       :block (mapduce :raw-bytes-loaded :report-data-uploaded upload-report-data)
       :block (mapduce :report-data-uploaded :report-validated validate-report)
       (add-side-effect (partial dynamo/append))
       (map (partial pop-task ::original))
       :drive out-chan))

a process transitioning states

Durable messages

Eventual Consistency

Atleast once, exactly once semantics

Timeout, Retention, Delay

Maps nicely to N Proceses with M channels 

Visually inspect messages in queue

Logging

Monitoring

Dead Letter Queues

SDKs across languages (java sdk is async)

Provisioning

 

Why sqs?

I don't want to write this!

manage processes

interprocess comunication

signals and broadcasts

10

async

drive

20

drive

10

signal?

signal?

y

y

n

n

sqs

load data

validate data

process data

store datalog

load user info

validate user actions

process events

persist into db

done

done

done

sig

bro

err

out

in

in

out

err

bro

sig

block

block

async

dispatcher

process 1

process 2

async

async

==>

compose async pipelines

  • creates and manages intermediate chans
  • immutable stack
  • autoclose sqs messages after successful
    process across the stack with :drive

  • retry logic after runtime exceptions
  • turn on concurrency knobs
  • auto close channels
  • simpler dsl

scheduler

schedule async pipelines


(defn ->scheduler-config-chan
  "Create a scheduler chan based on the pre configured scheduler. Returns a channel that
   responds to the 'scheduled times'."
  [{:keys [broadcast-chan scheduler-config] :as context}]
  (log/local "Evaluating scheduler config" scheduler-config relay-chan)
  (s/validate Config scheduler-config)
  (when-let [batch-type (:batch-type scheduler-config)]
    (condp = batch-type
      :broadcast  (async/chan) ;; will be managed by app.core/broadcast
      :continuous (arm/schedule 
                    (periodic/periodic-seq (time/now) (time/seconds 1)))                                         
      :polling    (arm/schedule 
                    (periodic/periodic-seq (time/now)
                                           (time/minutes (:poll-in-minutes scheduler-config))))
      :cron       (arm/schedule 
                    (end-of-business-day (t/number->date (:schedule-at scheduler-config))))
      :manual     (async/chan))))

(defn schedule-task
  "Start a periodic batch job based on a scheduler config."
  [{:keys [db-config scheduler-config-chan signal-chan] :as context}]
  (go-loop [_ (<! scheduler-config-chan)]
    (start-batch context))
    (when-let [next (<! scheduler-config-chan)]
      (recur next))))

(let [context { ... }]
  (->scheduler-config-chan context)
  (schedule-task context))

idempotency

given a message, you can process it multiple times without side effects

works great with "atleast once" semantics in sqs!

(def PipelineMessage
  {:task-type Str
   :timestamp Num
   :payload   {:id Num
               :amount Num
               :uid Num}
   :meta      {Any Any}
   (s/optional-key :errors) [PipelineError]})

use unique ids for dedup

IMPLEMENTATION



(defmacro ==> [[input-channel error-channel] & raw-command-forms]
  (when (empty? raw-command-forms)
    (extract-command-exception ["EMPTY-START"]))
  (let [grouped-command-forms (extract-commands raw-command-forms)
        chan-seq-name         (gensym "pipeline-channel-source")
        first-drive-cmd       (first (filter #(= (get-cmd-type %) :drive)
                                             grouped-command-forms))
        is-driving?           (not (nil? first-drive-cmd))
        has-error-channel?    (not (nil? error-channel))
        error-channel-name    (gensym "pipeline-error-channel")]
    `(let [~chan-seq-name (concat (list ~input-channel) (repeatedly async/chan))
           ~error-channel-name ~(or error-channel `(async/chan))]
       ~@(commands->pipeline-forms chan-seq-name 
                                   has-error-channel? 
                                   error-channel-name 
                                   grouped-command-forms)
       ~(if is-driving?
          (last first-drive-cmd)
          `(nth ~chan-seq-name ~(count grouped-command-forms))))))

(defn- command->pipeline-form 
  [channel-seq-name error-channel? error-channel-name idx cmd-forms]
  (let [cmd-type    (get-cmd-type cmd-forms)
        paralellism (or (some (fn [x] (when (number? x) x)) cmd-forms) 1)
        dispatch    (case cmd-type
                      :xduce 'async/pipeline
                      :block 'async/pipeline-blocking
                      :drive 'armature.core/drive-helper
                      :async 'armature.core/pipeline-async-helper)
        input-chan-form `(nth ~channel-seq-name ~idx)
        output-chan-form `(nth ~channel-seq-name ~(inc idx))
        cmd (last cmd-forms)]
    (if error-channel?
      `(~dispatch ~paralellism ~output-chan-form ~cmd ~input-chan-form true
        ;; This is the error handler, a bit ugly but helpful
        (fn [e#] (async/go (async/>! ~error-channel-name 
                                     {:error e# :index ~idx :command (quote ~cmd)})) nil))
      `(~dispatch ~paralellism ~output-chan-form ~cmd ~input-chan-form))))

(defn- commands->pipeline-forms 
  [channel-seq-name error-channel? error-channel-name cmd-forms-list]
  (let [first-drive-cmd (first (filter #(= (get-cmd-type %) :drive)
                                       cmd-forms-list))]
    (when (and first-drive-cmd (not (= first-drive-cmd (last cmd-forms-list))))
      (throw (Exception. ":drive directive makes no sense except in the tail of ==>")))
    (map-indexed (partial command->pipeline-form channel-seq-name error-channel? 
                          error-channel-name) 
                  cmd-forms-list)))

Component

setup consumer + producer

(defn start-queue-consumer!
  "Start a SQS consumer and return event, error, and finalize channels."
  [connection queue-url {:keys [max-consumption-window
                                long-poll-duration
                                stop-check-fn]
                         :or   {max-consumption-window 20
                                long-poll-duration 20
                                stop-check-fn (fn [] false)}}]
  (let [^AmazonSQSAsyncClient instance (:instance connection)
        msg-request (receive-msg-request queue-url long-poll-duration)
        raw-result-chan (chan max-consumption-window)
        error-chan (chan)
        events-chan (chan (* 10 max-consumption-window))
        finalizer-chan (chan (* 10 max-consumption-window))
        handler (arm-aws/respond-with-channels raw-result-chan error-chan)
        rescheduler (partial reschedule-consumption
                             instance
                             handler
                             msg-request
                             stop-check-fn)
        channel-scrubber-xf (comp (map rescheduler)
                                  (mapcat unbundle-message-result)
                                  (map (partial uncrack-message queue-url)))]

    ;; Schedule async processing
    (async/pipeline 1 events-chan channel-scrubber-xf raw-result-chan)
    (arm/sink! (partial delete-message! instance) finalizer-chan error-chan)

    ;; Call once to "kickstart" the pipeline
    (rescheduler ::nonce)

    {:event-channel events-chan
     :error-channel error-chan
     :finalize-channel finalizer-chan}))

(defn start-queue-writer!
  "Start a SQS writer and return write, error, and result channels. The incoming
   message should be a map whose value is in another map with a keyword 'armature'"
  [{:keys [instance] :as connection}, queue-url
   {:keys [parallelism
           max-consumption-window]
    :or   {parallelism 1
           max-consumption-window 20}}]
  (let [input-chan (chan max-consumption-window)
        output-chan (chan)
        error-chan (chan)
        handler (arm-aws/respond-with-error-channel error-chan)
        writer (partial write-to-queue instance queue-url handler)
        writer-xf (mapduce :armature :armature-task writer)]
    (async/pipeline parallelism output-chan writer-xf input-chan)
    (arm/sink! identity output-chan)
   {:write-channel input-chan
    :error-channel error-chan
    :result-channel output-chan}))
component/Lifecycle
(start [self]
    (if-not queue-url
      (throw+ {:type ::bad-config
               :message "Could not connect to queue"})
      (let [global-lock (when writes-enabled?
                          (let [lock (zookeeper/interprocess-write-lock "/app/service-lock")]
                            ;; block until a lock can be acquired, (we want only one consumer)
                            (while (not (zookeeper/acquire-lock-with-millis-timeout lock 250))
                              (log/info "Tried to acquire lock but failed. Trying again in 30s")
                              (Thread/sleep (* 30 1000))))
                            lock))
            context {:access-key access-key
                     :secret-key secret-key
                     :queue-url queue-url
                     :broadcast-queue-url broadcast-queue-url
                     :scheduler-config scheduler-config
                     :consumer? consumer?
                     :global-lock global-lock}
            {:keys [service-chan
                    signal-chan
                    broadcast-chan
                    scheduler-config-chan]} (reconciler/start-service! context)
            full-config (assoc self
                               :context context
                               :service-chan service-chan
                               :signal-chan signal-chan
                               :broadcast-chan broadcast-chan
                               :scheduler-config-chan scheduler-config-chan)]
        (log/info "Component is listening at" queue-url)
        full-config)))
  • Currently designed for a 1 consumer + N producers
  • Zookeeper lock for failover to another service
  • Error-handling at scale still untested
  • Need tools for channel monitoring
  • When things fail,  I just restart
  • Hard to unit test without ceremony
  • No Centralized Supervisor (wip)

 

LIMITATIONS

  • Avoid any kind of design that depends knowing how many channels are open vs closed
  • Don't do IO inside a go block, it will block the thread executing the go block
  • When you design asynchronous libraries, provide interfaces that seem synchronous in some way
  • Don't use bandalore—wrap the newer async Java api

 

things to remember

SQS

>signal

>broadcast

repl

a distributed async

pipeline stack

scheduler

process-3

process-1

process-2

component

dispatcher

SQS

each process accepts five chans: in, out, err, sig, bro

wrapping up

- manage channels over async pipelines

- use signals to compose pipelines

- use broadcast for external communication 

- map 1 SQS to N Processes

- each process takes 5 channels: in, out, err, sig, bro

- manage errors via global error-chan

- build idempotent pipelines with unique msg id

- use side-effects vs pure async tasks

- learn stack-oriented programming! 

 

communicating sequential processes

REFERENCES

Introduction by Rich Hickey
https://www.infoq.com/presentations/clojure-core-async

Timothy Baldridge's core.async walkthrough
https://www.youtube.com/watch?v=enwIIGzhahw

david nolen on core.async in cljs
https://www.youtube.com/watch?v=AhxcGGeh5ho

http://www.braveclojure.com/core-async/

 

credits

Thanks, Dave Fayram for helping me build this library.

armature and robot parts images from adafruit.com

other images sourced from google image search, copyright by respective owners:
https://www.upuno.com/upuno-web2/wp-content/uploads/2015/03/lisa-L-armature-product-img-1.jpg
http://sculptures.website/wp-content/uploads/2016/12/armature-sculpture.jpg
http://electriciantraining.tpub.com/14177/img/14177_60_2.jpg
http://www.kineticarmatures.com/images/custom%20sleepy%20bear.jpg

 

 

@priyatam | priyatam.com

Async Data Pipelines in Clojure

By Priyatam Mudivarti

Async Data Pipelines in Clojure

Building Data Pipelines in core.async and SQS

  • 1,252