Data Pipelines
IN CORE.ASYNC & SQS
priyatam mudivarti
WHY
compose async pipelines
what if
we stop writing functions and start defining a process?
concepts
climbing the abstraction ladder
Function
limiting?
synchronous
complects error handling with logic
hard to pass shared state
structure of input is often desirable
... too many functions!
If you can express your process in terms of a reducing function, you are in business. — Rich Hickey
Transducer
compose better
This creation of logic via composition of sequence-iterating functions complects the created logic with the machinery of sequences. — Rich Hickey
;; a lot faster than you think because transducers make it so!
(defn letter-frequency [input]
(transduce (comp (filter #(Character/isLetter %))
(map #(Character/toLowerCase %))
(map (fn [v] {v 1})))
(partial merge-with +)
input))
;; Beautiful. Also slow. This is slideware!
(def max-value
(comp first first reverse (partial sort-by last)))
(defn transync [filenames]
(let [result (map (comp max-value
letter-frequency
slurp)
filenames)]
;; No cheating. Actually do the work.
(doall result)
result))
Channels
you need to manage them!
Channels are oriented towards the flow aspects of a system. It is oblivious to the nature of the threads which use it.
;; Channels play nicely with transducers!
(let [channel (chan 1 (map #(.toUpperCase %)))]
(async/onto-chan channel ["hello" "clojure" "remote"] true)
(async/go-loop [cur (<! channel)]
(println cur)
(when-let [n (<! channel)]
(recur n))))
async/pipeline
connect channels
;; Connecting 2 channels with an async process that includes
;; a transducer.
(let [input1 (chan)
output1 (chan)]
(async/pipeline 1 output1 (map inc) input1)
(async/onto-chan input1 (range 0 10))
(print-all-channel! output1))
;; Connecting 2 channels with an async process that includes
;; a transducer.
(let [input1 (chan)
output1 (chan)]
(async/pipeline 1 output1 (map inc) input1)
(async/onto-chan input1 (range 0 10))
(print-all-channel! output1))
(let [input0 (chan)
middle1 (chan)
middle2 (chan)
output3 (chan)]
(async/pipeline 1 middle1 (map inc) input0)
(async/pipeline 1 middle2 (map (partial * 10)) middle1)
(async/pipeline 1 output3
(map #(str % " zillion channel declarations!"))
middle2)
(async/onto-chan input0 (range 0 5))
(print-all-channel! output3))
this can get tedious
- channel management (create, close)
- error handling
- messages are not durable
- can't compose
- harder inter-pipeline communication
- can't short circuit
- hard to mix and match
side-effects with pure
functions
what if channel management was free?
(let [input (chan)]
(print-all-channel! (==> [input]
(map inc)
(map (partial * 10))
(map #(str % " zillion fewer channel decs!"))))
(async/onto-chan input (range 0 3)))
introducing ==>
process
connecting async pipelines
in a stack, each frame
transitioning from one state to another
push original message {...}
pop original message
{...}
{:from :to}
{:from :to}
{:from :to}
async.task
async task = core.async-pipeline + (magic)
an idempotent process
10
block
drive
10 = no of concurrent threads
async.task
async.task
async.task
async.task
process
async
(defn mapduce
"Make a map transducer from a bare function. It extracts
a value from fromkey and inserts the value generated into tokey.
If no tokey is provided, the function will update the value
in fromkey."
([fromkey f] (mapduce fromkey fromkey f))
([fromkey tokey f]
(map (fn [m] (assoc m tokey (f (fromkey m))))))
([context fromkey tokey f]
(map (fn [m] (assoc m tokey (f (fromkey m) context))))))
mapduce
Make a map transducer from a bare function.
(defn process-events
[context in-chan out-chan signal-chan broadcast-chan err-chan]
(==> [in-chan err-chan]
(map (partial push-task ::original))
(mapduce-in [::original :body] [:raw-bytes-saved] parse-msg)
:block (mapduce :raw-bytes-saved :raw-bytes-loaded #(load-raw-bytes [context %]))
(add-side-effect #(log/info "LOADED RAW BYTES" (count (:raw-bytes-loaded %))))
:block (mapduce :raw-bytes-loaded :report-data-uploaded upload-report-data)
:block (mapduce :report-data-uploaded :report-validated validate-report)
(add-side-effect (partial dynamo/append))
(map (partial pop-task ::original))
:drive out-chan))
a process transitioning states
Durable messages
Eventual Consistency
Atleast once, exactly once semantics
Timeout, Retention, Delay
Maps nicely to N Proceses with M channels
Visually inspect messages in queue
Logging
Monitoring
Dead Letter Queues
SDKs across languages (java sdk is async)
Provisioning
Why sqs?
I don't want to write this!
manage processes
interprocess comunication
signals and broadcasts
10
async
drive
20
drive
10
signal?
signal?
y
y
n
n
sqs
load data
validate data
process data
store datalog
load user info
validate user actions
process events
persist into db
done
done
done
sig
bro
err
out
in
in
out
err
bro
sig
block
block
async
dispatcher
process 1
process 2
async
async
==>
compose async pipelines
- creates and manages intermediate chans
- immutable stack
-
autoclose sqs messages after successful
process across the stack with :drive - retry logic after runtime exceptions
- turn on concurrency knobs
- auto close channels
- simpler dsl
scheduler
schedule async pipelines
(defn ->scheduler-config-chan
"Create a scheduler chan based on the pre configured scheduler. Returns a channel that
responds to the 'scheduled times'."
[{:keys [broadcast-chan scheduler-config] :as context}]
(log/local "Evaluating scheduler config" scheduler-config relay-chan)
(s/validate Config scheduler-config)
(when-let [batch-type (:batch-type scheduler-config)]
(condp = batch-type
:broadcast (async/chan) ;; will be managed by app.core/broadcast
:continuous (arm/schedule
(periodic/periodic-seq (time/now) (time/seconds 1)))
:polling (arm/schedule
(periodic/periodic-seq (time/now)
(time/minutes (:poll-in-minutes scheduler-config))))
:cron (arm/schedule
(end-of-business-day (t/number->date (:schedule-at scheduler-config))))
:manual (async/chan))))
(defn schedule-task
"Start a periodic batch job based on a scheduler config."
[{:keys [db-config scheduler-config-chan signal-chan] :as context}]
(go-loop [_ (<! scheduler-config-chan)]
(start-batch context))
(when-let [next (<! scheduler-config-chan)]
(recur next))))
(let [context { ... }]
(->scheduler-config-chan context)
(schedule-task context))
idempotency
given a message, you can process it multiple times without side effects
works great with "atleast once" semantics in sqs!
(def PipelineMessage
{:task-type Str
:timestamp Num
:payload {:id Num
:amount Num
:uid Num}
:meta {Any Any}
(s/optional-key :errors) [PipelineError]})
use unique ids for dedup
IMPLEMENTATION
(defmacro ==> [[input-channel error-channel] & raw-command-forms]
(when (empty? raw-command-forms)
(extract-command-exception ["EMPTY-START"]))
(let [grouped-command-forms (extract-commands raw-command-forms)
chan-seq-name (gensym "pipeline-channel-source")
first-drive-cmd (first (filter #(= (get-cmd-type %) :drive)
grouped-command-forms))
is-driving? (not (nil? first-drive-cmd))
has-error-channel? (not (nil? error-channel))
error-channel-name (gensym "pipeline-error-channel")]
`(let [~chan-seq-name (concat (list ~input-channel) (repeatedly async/chan))
~error-channel-name ~(or error-channel `(async/chan))]
~@(commands->pipeline-forms chan-seq-name
has-error-channel?
error-channel-name
grouped-command-forms)
~(if is-driving?
(last first-drive-cmd)
`(nth ~chan-seq-name ~(count grouped-command-forms))))))
(defn- command->pipeline-form
[channel-seq-name error-channel? error-channel-name idx cmd-forms]
(let [cmd-type (get-cmd-type cmd-forms)
paralellism (or (some (fn [x] (when (number? x) x)) cmd-forms) 1)
dispatch (case cmd-type
:xduce 'async/pipeline
:block 'async/pipeline-blocking
:drive 'armature.core/drive-helper
:async 'armature.core/pipeline-async-helper)
input-chan-form `(nth ~channel-seq-name ~idx)
output-chan-form `(nth ~channel-seq-name ~(inc idx))
cmd (last cmd-forms)]
(if error-channel?
`(~dispatch ~paralellism ~output-chan-form ~cmd ~input-chan-form true
;; This is the error handler, a bit ugly but helpful
(fn [e#] (async/go (async/>! ~error-channel-name
{:error e# :index ~idx :command (quote ~cmd)})) nil))
`(~dispatch ~paralellism ~output-chan-form ~cmd ~input-chan-form))))
(defn- commands->pipeline-forms
[channel-seq-name error-channel? error-channel-name cmd-forms-list]
(let [first-drive-cmd (first (filter #(= (get-cmd-type %) :drive)
cmd-forms-list))]
(when (and first-drive-cmd (not (= first-drive-cmd (last cmd-forms-list))))
(throw (Exception. ":drive directive makes no sense except in the tail of ==>")))
(map-indexed (partial command->pipeline-form channel-seq-name error-channel?
error-channel-name)
cmd-forms-list)))
Component
setup consumer + producer
(defn start-queue-consumer!
"Start a SQS consumer and return event, error, and finalize channels."
[connection queue-url {:keys [max-consumption-window
long-poll-duration
stop-check-fn]
:or {max-consumption-window 20
long-poll-duration 20
stop-check-fn (fn [] false)}}]
(let [^AmazonSQSAsyncClient instance (:instance connection)
msg-request (receive-msg-request queue-url long-poll-duration)
raw-result-chan (chan max-consumption-window)
error-chan (chan)
events-chan (chan (* 10 max-consumption-window))
finalizer-chan (chan (* 10 max-consumption-window))
handler (arm-aws/respond-with-channels raw-result-chan error-chan)
rescheduler (partial reschedule-consumption
instance
handler
msg-request
stop-check-fn)
channel-scrubber-xf (comp (map rescheduler)
(mapcat unbundle-message-result)
(map (partial uncrack-message queue-url)))]
;; Schedule async processing
(async/pipeline 1 events-chan channel-scrubber-xf raw-result-chan)
(arm/sink! (partial delete-message! instance) finalizer-chan error-chan)
;; Call once to "kickstart" the pipeline
(rescheduler ::nonce)
{:event-channel events-chan
:error-channel error-chan
:finalize-channel finalizer-chan}))
(defn start-queue-writer!
"Start a SQS writer and return write, error, and result channels. The incoming
message should be a map whose value is in another map with a keyword 'armature'"
[{:keys [instance] :as connection}, queue-url
{:keys [parallelism
max-consumption-window]
:or {parallelism 1
max-consumption-window 20}}]
(let [input-chan (chan max-consumption-window)
output-chan (chan)
error-chan (chan)
handler (arm-aws/respond-with-error-channel error-chan)
writer (partial write-to-queue instance queue-url handler)
writer-xf (mapduce :armature :armature-task writer)]
(async/pipeline parallelism output-chan writer-xf input-chan)
(arm/sink! identity output-chan)
{:write-channel input-chan
:error-channel error-chan
:result-channel output-chan}))
component/Lifecycle
(start [self]
(if-not queue-url
(throw+ {:type ::bad-config
:message "Could not connect to queue"})
(let [global-lock (when writes-enabled?
(let [lock (zookeeper/interprocess-write-lock "/app/service-lock")]
;; block until a lock can be acquired, (we want only one consumer)
(while (not (zookeeper/acquire-lock-with-millis-timeout lock 250))
(log/info "Tried to acquire lock but failed. Trying again in 30s")
(Thread/sleep (* 30 1000))))
lock))
context {:access-key access-key
:secret-key secret-key
:queue-url queue-url
:broadcast-queue-url broadcast-queue-url
:scheduler-config scheduler-config
:consumer? consumer?
:global-lock global-lock}
{:keys [service-chan
signal-chan
broadcast-chan
scheduler-config-chan]} (reconciler/start-service! context)
full-config (assoc self
:context context
:service-chan service-chan
:signal-chan signal-chan
:broadcast-chan broadcast-chan
:scheduler-config-chan scheduler-config-chan)]
(log/info "Component is listening at" queue-url)
full-config)))
- Currently designed for a 1 consumer + N producers
- Zookeeper lock for failover to another service
- Error-handling at scale still untested
- Need tools for channel monitoring
- When things fail, I just restart
- Hard to unit test without ceremony
- No Centralized Supervisor (wip)
LIMITATIONS
- Avoid any kind of design that depends knowing how many channels are open vs closed
- Don't do IO inside a go block, it will block the thread executing the go block
- When you design asynchronous libraries, provide interfaces that seem synchronous in some way
- Don't use bandalore—wrap the newer async Java api
things to remember
SQS
>signal
>broadcast
repl
a distributed async
pipeline stack
scheduler
process-3
process-1
process-2
component
dispatcher
SQS
each process accepts five chans: in, out, err, sig, bro
wrapping up
- manage channels over async pipelines
- use signals to compose pipelines
- use broadcast for external communication
- map 1 SQS to N Processes
- each process takes 5 channels: in, out, err, sig, bro
- manage errors via global error-chan
- build idempotent pipelines with unique msg id
- use side-effects vs pure async tasks
- learn stack-oriented programming!
communicating sequential processes
REFERENCES
Introduction by Rich Hickey https://www.infoq.com/presentations/clojure-core-async Timothy Baldridge's core.async walkthrough https://www.youtube.com/watch?v=enwIIGzhahw david nolen on core.async in cljs https://www.youtube.com/watch?v=AhxcGGeh5ho http://www.braveclojure.com/core-async/
credits
Thanks, Dave Fayram for helping me build this library.
armature and robot parts images from adafruit.com
other images sourced from google image search, copyright by respective owners:
https://www.upuno.com/upuno-web2/wp-content/uploads/2015/03/lisa-L-armature-product-img-1.jpg
http://sculptures.website/wp-content/uploads/2016/12/armature-sculpture.jpg
http://electriciantraining.tpub.com/14177/img/14177_60_2.jpg
http://www.kineticarmatures.com/images/custom%20sleepy%20bear.jpg
@priyatam | priyatam.com
Async Data Pipelines in Clojure
By Priyatam Mudivarti
Async Data Pipelines in Clojure
Building Data Pipelines in core.async and SQS
- 1,252