Fast, Simple Concurrency with Scala Native
This talk is about:
- bare-metal concurrency on scala native
- native as a platform
- scala as a platform
- sustainable libraries and communities
and it's about native-loop, a new concurrency package for Scala Native 0.4
- an extensible event loop and IO system
- backed by the libuv event loop
- works with other C libraries (libcurl, etc)
http://github.com/scala-native/scala-native-loop
Talk Outline
- Background:
- Concurrency in Scala
- Scala Native
- Why LibUV?
- API overview
- Implementation deep dive:
- libuv-based ExecutionContext
- The future
About me:
- @RichardWhaling
- Lead Data Engineer at M1Finance
- Author of:
Background: Scala
- Scala is a programming language created by Martin Odersky in 2001 at EPFL
- Odersky previously wrote Sun's Java compiler
- Scala builds on Java by running on a Java Virtual Machine (JVM), anywhere Java runs
- Scala gained industry prominence in 2009, when Twitter ported their backend from Ruby
- Scala has since influenced Kotlin, Swift, and Rust
- There are two alternate Scala backends:
- Scala.js - compiles to JS
- Scala Native - compiles to machine code or WASM
Background: Concurrency
- Programs that do one thing at a time are called synchronous
- Programs that do more than one thing at a time are called asynchronous or concurrent
- Synchronous programs are easier to write correctly but bottleneck for certain workloads
- Concurrent programs can perform better but can be fiendishly complex and buggy
- Java and C++ basically got concurrency wrong
- Javascript was the first mainstream language to move in the right direction
Concurrency in Scala
def delay(duration:Duration)(implicit ec:ExecutionContext):Future[Unit]
delay(3 seconds).flatMap { _ =>
println("beep")
delay(2 seconds)
}.flatMap { _ =>
println("boop")
delay(1 second)
}.onComplete { _ =>
println("done")
}
- Values are passed in Futures
- No mutable state
- ExecutionContext abstracts over backends
Common Concurrency Issues
- Blocking code starves the ExecutionContext
- Race conditions with mutable state
So we introduce a higher-level model: Actors, Streaming, etc...
Yet problems persist.
Hot Take: the JVM's threading and memory model is hostile to closure-heavy patterns.
Not At All Hot Take: a huge proportion of JVM libraries will happily block a thread
Design Questions
In Scala Native we are not constrained by the JVM,
nor can we piggyback on its successes;
whatever we build, we build from scratch.
Can Scala Native provide a model of IO and concurrency that is true to Scala?
Can we provide a model of IO and concurrency that is a better fit than the JVM?
Scala Native is:
- Scala!
- a scalac plugin
- targets LLVM rather than JVM bytecode
- no JVM, good coverage of JDK classes
- produces compact, optimized native binaries
- most ordinary Scala code "just works"
- full control over memory allocation
- struct and array layout
- C FFI
- An embedded scala DSL with the capabilities of C
But that's not all!
Caveat
- this low-level functionality is powerful but dangerous
- you don't need unsafe functionality to use scala native
- ideally: idiomatic Scala API on top of low-level code
- Scala Native is (for now) single-threaded
- No JDK - essential classes re-implemented in Scala
- C FFI fills in the gaps in capabilities
But what is "idiomatic Scala"?
What are the essential capabilities we need for an ecosystem?
Introducing libuv
- Cross-platform C event loop
- Used by node.js, julia, neovim, many others
- Comprehensive non-blocking IO capabilities
- Supports Windows, Mac, BSD
libuv and Scala Native
- Scala Native code runs on a single thread
- use system primitives for high-performance IO:
- epoll
- kqueue
- IOCP
- consistent callback API
- battle-hardened, production-grade performance
- extensible with other C libraries (curl, postgres, redis)
How will we avoid node's
"callback heck"/"pyramid of doom"
anti-pattern?
A Tour of Native-Loop
High Level Design Goals:
- ExecutionContext and Future support
- Comprehensive IO
- HTTP/HTTPS client/server
- Minimalism
- Sustainability
Meta-goal: an unopinionated base for other idioms: streams, actor model, IO monad, etc
EventLoop
trait EventLoopLike extends ExecutionContextExecutor {
def addExtension(e:LoopExtension):Unit
def run(mode:Int = UV_RUN_DEFAULT):Unit
}
- Fully functional ExecutionContext
- LoopExtension mechanism for adding additional modules and C libraries
- Exposes low-level loop for libuv calls
- Loop.run() to invoke by hand
- SN issue: allow the runtime to invoke the loop automatically prior to exiting
Timer
object Timer {
def delay(duration:Duration)(implicit ec:ExecutionContext):Future[Unit]
}
object Main {
implicit val ec:ExecutionContext = EventLoop
def main(args:Array[String]):Unit = {
println("hello!")
Timer.delay(3 seconds).onComplete { _ =>
println("goodbye!")
}
EventLoop.run()
}
}
- Returns a () after duration
- Can also do repeating schedules
- Chainable with map/flatMap/etc
Streams (take 1)
type Pipe[I,O]
def main(args:Array[String]):Unit = {
val p = FilePipe(c"./data.txt")
.map { d =>
println(s"consumed $d")
d
}.addDestination(Tokenizer("\n"))
.addDestination(Tokenizer(" "))
.map { d => d + "\n" }
.addDestination(FileOutputPipe(c"./output.txt", false))
println("running")
uv_run(EventLoop.loop,UV_RUN_DEFAULT)
}
- First attempt at IO relied on Pipe[I,O] as a primitive
- Reactive streams don't model raw sockets or files well
- Redesign goal: provide a base for high-level IO
- Reactive Streams, Monix, Cats-effect, ZIO, etc.
Streams, take 2
trait Stream {
def open(fd:Int): Stream
def stream(fd:Int)(f:String => Unit):Future[Unit]
def write(s:String):Future[Unit]
def pause():Unit
def close():Future[Unit]
}
def main(args:Array[String]):Unit = {
val stdout = open(STDOUT)
Pipe(STDIN).stream { line =>
println(s"read line $line")
val split = line.trim().split(" ")
for (word <- split) {
stdout.write(word + "\n")
}
}
uv_run(EventLoop.loop,UV_RUN_DEFAULT)
}
- Exposes all LibUV capabilities, including pausing
- No type parameters, strings only
- Open question: Future vs Callbacks?
Curl
Curl.get(c"https://www.example.com").map { response =>
Response(200,"OK",Map(),response.body)
}
Curl gives us a highly capable service client, basically for free
- Massively scalable
- Fully integrated with LibUV polls/timers
- HTTPS support
- FTP, SCP, many other protocols
- Goal - STTP and requests-scala support
- Hard problem - is there a "primitive" low-level API we could provide?
Imperative Server API
trait Server {
def serve(port:Int)(handler:(Request,Conection) => Unit):Unit
}
def main(args:Array[String]):Unit = {
Server.serve(9999) { request, connection =>
val response = makeResponse(request)
connectionHandle.sendResponse(response)
}
EventLoop.run()
}
- an imperative-style callback lets us pass a request and a capability to respond to a handler
- this lets us abstract over sync/async/streaming
- is there a way to avoid defining a Request and Response type?
- It would be great if Scala had something like Rack/WSGI. Lightweight Servlets?
- Sca-lets?
Servers
def main(args:Array[String]):Unit = {
Service()
.getAsync("/async") { r => Future {
s"got (async routed) request $r"
}.map { message => OK(
Map("asyncMessage" -> message)
)
}
}
.get("/") { r => OK {
Map("default_message" -> s"got (default routed) request $r")
}
}
.run(9999)
uv_run(EventLoop.loop, UV_RUN_DEFAULT)
}
- We're shipping a high-level service API as well
- Async support, JSON support
- Router is rudimentary
- Next steps: port Cask, play.api.routing.sird, others?
All Of the Above
def main(args:Array[String]):Unit = {
Service()
.getAsync("/async") { r => Future {
s"got (async routed) request $r"
}.map { message => OK(
Map("asyncMessage" -> message)
)
}
}
.getAsync("/fetch/example") { r =>
Curl.get(c"https://www.example.com").map { response =>
Response(200,"OK",Map(),response.body)
}
}
.get("/") { r => OK {
Map("default_message" -> s"got (default routed) request $r")
}
}
.run(9999)
uv_run(EventLoop.loop, UV_RUN_DEFAULT)
}
Because LibUV coordinates everything, all of these capabilities can be mixed and matched
Performance
- Raw HTTP request throughput metrics aren't representative
- Aiming for high hundreds/low thousands of requests/second on benchmarks with good QOS
- Real impact is service density:
- < 1 CPU/instance
- 100-200 MB RAM
- < 10 MB binary/container
- Versus realistic JVM Scala usage:
- 2-4x+ CPU
- 5x+ RAM / instance
- 10x+ smaller footprint
- Real scenario: 8 services, 3 - 5 instances/service
Scala Native is:
- Scala!
- a scalac plugin
- targets LLVM rather than JVM bytecode
- no JVM, good coverage of JDK classes
- produces compact, optimized native binaries
- most ordinary Scala code "just works"
- full control over memory allocation
- struct and array layout
- C FFI
- An embedded scala DSL with the capabilities of C
But that's not all!
Caveat
- this low-level functionality is powerful but dangerous
- you don't need unsafe functionality to use scala native
- ideally: idiomatic Scala API on top of low-level code
- Scala Native is (for now) single-threaded
- No JDK - essential classes re-implemented in Scala
- C FFI fills in the gaps in capabilities
But what is "idiomatic Scala"?
Example: Memory Allocation
val raw_data:Ptr[Byte] = malloc(sizeof[Long])
val int_ptr:Ptr[Int] = raw_data.asInstanceOf[Int]
// the ! operator updates a pointer's contents on the left-hand side
!int_ptr = 0
// on the right-hand side, it dereferences the pointer to read its value
printf(c"pointer at %p has value %d\n", int_ptr, !int_ptr)
!int_ptr = 1
printf(c"pointer at %p has value %d\n", int_ptr, !int_ptr)
//this will segfault
free(raw_data)
printf(c"pointer at %p has value %d\n", int_ptr, !int_ptr)
not shown: arrays and pointer arithmetic
Example: C FFI
@extern
object Quicksort {
type Comparator = CFuncPtr2[Ptr[Byte],Ptr[Byte],Int]
def qsort(array:Ptr[Byte],num:CSize,size:CSize,cmp:Comparator):Unit = extern
}
object App {
type MyStruct = CStruct3[CString,CString,Int]
val myStructComp = new Comparator {
def apply(left:Ptr[Byte],right:[Byte]):Int = {
val l = left.asInstanceOf[Ptr[MyStruct]]
val r = right.asInstanceOf[Ptr[MyStruct]]
if (l._3 < r._3) {
-1
} else if (l._3 == l._3) {
0
else {
1
}
}
}
}
Example: C FFI
@extern
object Quicksort {
type Comparator = CFuncPtr2[Ptr[Byte],Ptr[Byte],Int]
def qsort(array:Ptr[Byte],num:CSize,size:CSize,cmp:Comparator):Unit = extern
}
object App {
type MyStruct = CStruct3[CString,CString,Int]
val myStructComp = new Comparator {
def apply(left:Ptr[Byte],right:[Byte]):Int = {
val l = left.asInstanceOf[Ptr[MyStruct]]
val r = right.asInstanceOf[Ptr[MyStruct]]
l - r
}
}
def main(args:Array[String]):Unit = {
// ...
val data:Ptr[MyStruct] = ???
val data_size = ???
qsort(data,data_size,sizeof[MyStruct],myStructComp)
// ...
}
}
Greenspun's tenth rule
- Void pointers, function arguments, and casts are C's affordances for generic programming
- This technique feels closer to a dynamic language than a type-safe one, but it suffices
- LibUV's callbacks will let us insert enough Scala Native code into the event loop to provide a full concurrent environment (without leaking unsafe details)
- So how does it work?
"Any sufficiently complicated C or Fortran program contains an ad hoc implementation of half of Common Lisp"
ExecutionContext from Scratch
- Scala Native includes an EC already
- The catch - it runs after main() returns
object ExecutionContext {
def global: ExecutionContextExecutor = QueueExecutionContext
private object QueueExecutionContext extends ExecutionContextExecutor {
def execute(runnable: Runnable): Unit = queue += runnable
def reportFailure(t: Throwable): Unit = t.printStackTrace()
}
private val queue: ListBuffer[Runnable] = new ListBuffer
private[runtime] def loop(): Unit = { // this runs after main() returns
while (queue.nonEmpty) {
val runnable = queue.remove(0)
try {
runnable.run()
} catch {
case t: Throwable =>
QueueExecutionContext.reportFailure(t)
}
}
}
}
LibUV's IO system
libuv abstracts over different operating systems
and different kinds of IO
Consistent model of callbacks attached to handles
LibUV's event loop
We just need to adapt a queue-based EC to libuv's lifecycle of callbacks
- We queue up work
- A prepare handle run immediately prior to IO
- It runs tasks until the queue is exhausted
- When there are no more tasks and no more IO, we are done!
- The catch - how do we track IO that isn't a Future?
EventLoop and LoopExtensions
trait EventLoopLike extends ExecutionContextExecutor {
def addExtension(e:LoopExtension):Unit
def run(mode:Int = UV_RUN_DEFAULT):Unit
}
trait LoopExtension {
def activeRequests():Int
}
The LoopExtension trait lets us coordinate Future execution with other IO tasks on the same loop, and modularize our code.
Our EventLoop
object EventLoop extends EventLoopLike {
val loop = uv_default_loop()
private val taskQueue = ListBuffer[Runnable]()
def execute(runnable: Runnable): Unit = taskQueue += runnable
def reportFailure(t: Throwable): Unit = {
println(s"Future failed with Throwable $t:")
t.printStackTrace()
}
// ...
execute() is invoked as soon as a Future is ready to start running, but we can defer it until a callback fires
Our EventLoop callback
// ...
private def dispatchStep(handle:PrepareHandle) = {
while (taskQueue.nonEmpty) {
val runnable = taskQueue.remove(0)
try {
runnable.run()
} catch {
case t: Throwable => reportFailure(t)
}
}
if (taskQueue.isEmpty && !extensionsWorking) {
println("stopping dispatcher")
LibUV.uv_prepare_stop(handle)
}
}
private val dispatcher_cb = CFunctionPtr.fromFunction1(dispatchStep)
private def initDispatcher(loop:LibUV.Loop):PrepareHandle = {
val handle = stdlib.malloc(uv_handle_size(UV_PREPARE_T))
check(uv_prepare_init(loop, handle), "uv_prepare_init")
check(uv_prepare_start(handle, dispatcher_cb), "uv_prepare_start")
return handle
}
private val dispatcher = initDispatcher(loop)
// ...
LoopExtensions
private val extensions = ListBuffer[LoopExtension]()
private def extensionsWorking():Boolean = {
extensions.exists( _.activeRequests > 0)
}
def addExtension(e:LoopExtension):Unit = {
extensions.append(e)
}
An ExecutionContext is useless without meaningful async capabilities.
We'll implement the simplest one, a delay.
Timer
object Timer extends LoopExtension {
EventLoop.addExtension(this)
var serial = 0L
var timers = mutable.HashMap[Long,Promise[Unit]]() // the secret sauce
override def activeRequests():Int =
timers.size
def delay(dur:Duration):Future[Unit] = ???
val timerCB:TimerCB = ???
}
@extern
object TimerImpl {
type Timer: Ptr[Long] // why long and not byte?
def uv_timer_init(loop:Loop, handle:TimerHandle):Int = extern
def uv_timer_start(handle:TimerHandle, cb:TimerCB,
timeout:Long, repeat:Long):Int = extern
def uv_timer_stop(handle:TimerHandle):Int = extern
}
How is it safe to treat Timer as Ptr[Long]?
Working with Opaque Structs
Sometimes it's necessary to work with a data structure without knowing its internal layout.
- Allocation and initialization is hard, we'll need help
- Once initialized, we can pass the address as a Ptr[Byte]
- Ptr[CStruct1[T]] can be cast to Ptr[T]
- Data can be safely cast to types that have the same leading fields:
Ptr[CStruct3[Long,Float,CString]]
Ptr[CStruct1[Long]]
Ptr[Long]
Timer
def delay(dur:Duration):Future[Unit] = {
val millis = dur.toMillis
val promise = Promise[Unit]()
serial += 1
val timer_id = serial
timers(timer_id) = promise
val timer_handle = stdlib.malloc(uv_handle_size(UV_TIMER_T))
uv_timer_init(EventLoop.loop,timer_handle)
val timer_data = timer_handle.asInstanceOf[Ptr[Long]]
!timer_data = timer_id
uv_timer_start(timer_handle, timerCB, millis, 0)
promise.future
}
We can store an 8-byte serial number in the TimerHandle, and retrieve it in our callback.
Timer
val timerCB = new TimerCB {
def apply(handle:TimerHandle):Unit = {
println("callback fired!")
val timer_data = handle.asInstanceOf[Ptr[Long]]
val timer_id = !timer_data
val timer_promise = timers(timer_id)
timers.remove(timer_id)
println(s"completing promise ${timer_id}")
timer_promise.success(())
}
}
We can dereference the TimerHandle safely - the compiler thinks it's a Ptr[Long] so it only reads the first 8 bytes.
Then we use the serial number for a map lookup to retrieve our state.
Timer
object Main {
implicit val ec:ExecutionContext = EventLoop
def main(args:Array[String]):Unit = {
println("hello!")
Timer.delay(3 seconds).onComplete { _ =>
println("goodbye!")
}
EventLoop.run()
}
}
Streaming IO with LibUV
Next, we'll design a minimalistic wrapper around some of libuv's streaming IO capabilities.
Scope:
- Streaming from standard input only
- String data only, no parsing
- Imperative style
- Wrap C FFI and pointer manipulation
Hot take: Scala has a lot of great streaming/monadic IO libraries. Let's build a toolkit to support them, rather than yet another variant.
Streaming IO: C Functions
def uv_pipe_init(loop:Loop, handle:PipeHandle, ipc:Int):Int = extern
def uv_pipe_open(handle:PipeHandle, fd:Int):Int = extern
type PipeHandle = Ptr[Long]
def uv_read_start(client:PipeHandle, allocCB:AllocCB, readCB:ReadCB): Int = extern
type AllocCB = CFuncPtr3[TCPHandle,CSize,Ptr[Buffer],Unit]
type ReadCB = CFuncPtr3[TCPHandle,CSSize,Ptr[Buffer],Unit]
- We'll again store a serial # in the PipeHandle
- uv_pipe_open() wants an open file descriptor
- AllocCB is called prior to performing IO
- ReadCB is called immediately after IO is done
Streaming IO: API Sketch
object PipeIO extends LoopExtension {
type ItemHandler = ((String,PipeHandle,Long) => Unit)
type DoneHandler = ((PipeHandle,Long) => Unit)
type Handlers = (ItemHandler, DoneHandler)
var streams = mutable.HashMap[Long,Handlers]()
var serial = 0L
override def activeRequests:Int = {
streams.size
}
def stream(fd:Int)(itemHandler:ItemHandler,
doneHandler:DoneHandler):Long = ???
def streamUntilDone(fd:Int)(handler:ItemHandler):Future[Long] = ???
}
We can actually implement streamUntilDone() on top of stream()
Initialization and Setup
var initialized = false
def checkInit() = if (!initialized) {
println("initializing PipeIO")
EventLoop.addExtension(this)
initialized = true
}
def stream(fd:Int)(itemHandler:ItemHandler,
doneHandler:DoneHandler):Long = {
checkInit
val pipeId = serial
serial += 1
val handle = stdlib.malloc(uv_handle_size(UV_PIPE_T))
uv_pipe_init(EventLoop.loop,handle,0)
val pipe_data = handle.asInstanceOf[Ptr[Long]]
!pipe_data = serial
streams(serial) = (itemHandler,doneHandler)
uv_pipe_open(handle,fd)
uv_read_start(handle,allocCB,readCB)
pipeId
}
Memory Allocation
val allocCB = new AllocCB {
def apply(client:PipeHandle, size:CSize, buffer:Ptr[Buffer]):Unit = {
val buf = stdlib.malloc(4096)
buffer._1 = buf
buffer._2 = 4096
}
}
Read Callback
val readCB = new ReadCB {
def apply(handle:PipeHandle,size:CSize,buffer:Ptr[Buffer]):Unit = {
val pipe_data = handle.asInstanceOf[Ptr[Int]]
val pipeId = !pipe_data
if (size < 0) {
val doneHandler = streams(pipeId)._2
doneHandler(handle,pipeId)
streams.remove(pipeId)
} else {
val data = bytesToString(buffer._1, size)
stdlib.free(buffer._1)
val itemHandler = streams(pipeId)._1
itemHandler(data,handle,pipeId)
}
}
}
def bytesToString(data:Ptr[Byte],len:Long):String = {
val bytes = new Array[Byte](len.toInt)
var c = 0
while (c < len) {
bytes(c) = !(data + c)
c += 1
}
new String(bytes)
}
A More Usable Wrapper
val defaultDone:DoneHandler = (handle, id) =>
()
val promises = mutable.HashMap[Long,Promise[Long]]()
def streamUntilDone(fd:Int)(handler:ItemHandler):Future[Long] = {
val promise = Promise[Long]
val itemHandler:ItemHandler = (data,handle,id) => try {
handler(data,handle,id)
} catch {
case t:Throwable => promise.failure(t)
}
val doneHandler:DoneHandler = (handle,id) =>
promise.success(id)
stream(fd)(itemHandler, doneHandler)
promise.future
}
Putting it all together:
object Main {
implicit val ec:ExecutionContext = EventLoop
def main(args:Array[String]):Unit = {
PipeIO.streamUntilDone(0) { (line,handle,id) =>
println(s"read line ${line.trim()}")
}
EventLoop.run()
}
Putting it all together:
object Main {
implicit val ec:ExecutionContext = EventLoop
def main(args:Array[String]):Unit = {
PipeIO.streamUntilDone(0) { (line,handle,id) =>
println(s"read line ${line.trim()}")
}.onComplete { status =>
println(s"stream completed with $status")
}
EventLoop.run()
}
Putting it all together:
object Main {
implicit val ec:ExecutionContext = EventLoop
def main(args:Array[String]):Unit = {
PipeIO.streamUntilDone(0) { (line,handle,id) =>
println(s"read line ${line.trim()}")
Timer.delay(5 seconds).onComplete { _ =>
println(s"delay fired for ${line.trim()}")
}
}.onComplete { status =>
println(s"stream completed with $status")
}
EventLoop.run()
}
What Else?
- Socket IO
- File IO
- HTTP Client (via libcurl)
- HTTP Server (via nodejs/http-parser)
How to find the right balance of features?
HTTP Client
val resp = Zone { implicit z =>
for (arg <- args) {
val url = toCString(arg)
val resp = Curl.get(url)
resp.onComplete {
case Success(data) =>
println(s"got back response for ${arg} - body of length ${data.body.size}")
println(s"headers:")
for (h <- data.headers) {
println(s"request header: $h")
}
println(s"body: ${data.body}")
case Failure(f) =>
println("request failed",f)
}
}
}
loop.run()
There is great (blocking) Scala Native support in STTP already - consolidate ASAP
HTTP Server
def main(args:Array[String]):Unit = {
Service()
.getAsync("/async/") { r => Future(OK(
Map("asyncMessage" -> s"got (async routed) request $r")
))}
.get("/") { r => OK(
Map("message" -> s"got (routed) request $r")
)}
.run(9999)
println("done")
}
- Layered design - think Rack/WSGI/Servlets
- Analogous to our imperative PipeIO
- Already support async requests, JSON
- Coming soon: FastCGI for nginx integration
Setting a Course
- Improve support in SN for user-supplied event loop
- STTP has great (blocking) curl/native support
- Plan to spin out server API
- Get feedback on streaming IO API
Building a Community
- All of this needs contributors to be sustainable
- There are a LOT of low-hanging fruit out there
- Weekend-scale side projects can have a big impact!
- Huge shout out to Scala Native's contributors
- Get involved!
Thanks!
Copy of Fast, Simple Concurrency with Scala Native
By Richard Whaling
Copy of Fast, Simple Concurrency with Scala Native
- 431