Fast, Simple Concurrency with Scala Native

Richard Whaling

 

@RichardWhaling

Scala Days 2019

This talk is about:

  • bare-metal concurrency on scala native
  • native as a platform
  • scala as a platform
  • sustainable libraries and communities

and it's about native-loop, a new concurrency library for Scala Native 0.4

  • an extensible event loop and IO system
  • backed by the libuv event loop
  • works with other C libraries (libcurl, etc)

http://github.com/scala-native/scala-native-loop

Talk Outline

  1. Background:
    • Concurrency in Scala
    • Scala Native
    • Why LibUV?
  2. API overview
  3. Implementation deep dive:
    • libuv-based ExecutionContext
  4. The future

About me:

  • @RichardWhaling
  • Lead Data Engineer at M1Finance
  • Author of: 

Background: Concurrency

  • Programs that do one thing at a time are called synchronous
  • Programs that do more than one thing at a time are called asynchronous or concurrent
  • Synchronous programs are easier to write correctly but bottleneck for certain workloads
  • Concurrent programs can perform better but can be fiendishly complex and buggy
  • Java and C++ basically got concurrency wrong
  • Javascript was the first mainstream language to move in the right direction

Concurrency in Scala

def delay(duration:Duration)(implicit ec:ExecutionContext):Future[Unit]

delay(3 seconds).flatMap { _ =>
  println("beep")
  delay(2 seconds)
}.flatMap { _ =>
  println("boop")
  delay(1 second)
}.onComplete { _ =>
  println("done")
}
  • Asynchronous results are passed in Futures
  • No mutable state
  • ExecutionContext abstracts over backends
  • Map, flatmap, etc.

Common Concurrency Issues

  • Blocking code starves the ExecutionContext
  • Race conditions with mutable state

So we introduce a higher-level model: Actors, Streaming, etc...

Yet problems persist.

Hot Take: the JVM's threading and memory model is hostile to closure-heavy patterns.

Not At All Hot Take: a huge proportion of JVM libraries will happily block a thread

Design Questions

In Scala Native we are not constrained by the JVM,

nor can we piggyback on its successes;

whatever we build, we build from scratch.

Can Scala Native provide a model of IO and concurrency that is true to Scala?

Can we provide a model of IO and concurrency that is a better fit than the JVM?

Let's talk about Scala Native

Scala Native is:

  • Scala!
  • a scalac plugin
  • targets LLVM rather than JVM bytecode
  • no JVM, good coverage of JDK classes
  • produces compact, optimized native binaries
  • most ordinary Scala code "just works"
  • full control over memory allocation
  • struct and array layout
  • C FFI
  • An embedded scala DSL with the capabilities of C

But that's not all!

Caveat

  • this low-level functionality is powerful but dangerous
  • you don't need unsafe functionality to use scala native
  • ideally: idiomatic Scala API on top of low-level code
  • Scala Native is (for now) single-threaded
  • No JDK - essential classes re-implemented in Scala
  • C FFI fills in the gaps

But what is "idiomatic Scala"?

How can we provide the essential capabilities for the ecosystem?

Introducing libuv

  • Cross-platform C event loop
  • Used by node.js, julia, neovim, many others
  • Comprehensive non-blocking IO capabilities
  • Supports Windows, Mac, BSD

LibUV's IO system

libuv abstracts over different operating systems

and different kinds of IO

 

Consistent model of callbacks attached to handles

libuv and Scala Native

  • Scala Native code runs on a single thread
  • use system primitives for high-performance IO:
    • epoll
    • kqueue
    • IOCP
  • consistent callback API
  • battle-hardened, production-grade performance
  • extensible with other C libraries (curl, postgres, redis)

Can we design our API avoid the "callback hell"

or "pyramid of doom" anti-pattern?

A Tour of Native-Loop

High Level Design Goals:

 

  • ExecutionContext and Future support
  • Comprehensive IO
  • HTTP/HTTPS client/server
  • Minimalism
  • Sustainability

Meta-goal: an unopinionated base for other idioms: streams, actor model, IO monad, etc

EventLoop

trait EventLoopLike extends ExecutionContextExecutor {
  def addExtension(e:LoopExtension):Unit 
  def run(mode:Int = UV_RUN_DEFAULT):Unit
}
  • Fully functional ExecutionContext
  • LoopExtension mechanism for adding additional modules and C libraries
  • Exposes low-level loop for libuv calls
  • Loop.run() to invoke by hand
  • SN issue #1626: override runtime.loop()

Timer

object Timer {
  def delay(duration:Duration)(implicit ec:ExecutionContext):Future[Unit]
}

object Main {
  implicit val ec:ExecutionContext = EventLoop

  def main(args:Array[String]):Unit = {
    println("hello!")
    Timer.delay(3 seconds).onComplete { _ =>
      println("goodbye!")
    }

    EventLoop.run()
  }
}
  • Returns a () after duration
  • Can also do repeating schedules
  • Chainable with map/flatMap/etc

Streams (take 1)

  trait Pipe[I,O] {}

  def main(args:Array[String]):Unit = {
    val p = FilePipe(c"./data.txt")
    .map { d => 
      println(s"consumed $d") 
      d
    }.addDestination(Tokenizer("\n"))
    .addDestination(Tokenizer(" "))
    .map { d => d + "\n" }
    .addDestination(FileOutputPipe(c"./output.txt", false))
    EventLoop.run()
  }
  • First attempt at IO relied on Pipe[I,O] as a primitive
  • Reactive streams don't model raw sockets or files well
  • Redesign goal: provide a base for high-level IO
  • Reactive Streams, Monix, Cats-effect, ZIO, etc.

Streams, take 2

trait Stream {
  def open(fd:Int): Stream
  def stream(fd:Int)(f:String => Unit):Future[Unit]
  def write(s:String):Future[Unit]
  def pause():Unit
  def close():Future[Unit]
}

  def main(args:Array[String]):Unit = {
    val stdout = open(STDOUT)
    Pipe(STDIN).stream { line =>
        println(s"read line $line")
        val split = line.trim().split(" ")
        for (word <- split) {
            stdout.write(word + "\n")
        }
    }
    EventLoop.run()
  }
  • Exposes all LibUV capabilities, including pausing
  • No type parameters, strings only
  • Open question: Future vs Callbacks?

Curl

Curl.get(c"https://www.example.com").map { response =>
          Response(200,"OK",Map(),response.body)
}

Curl gives us a highly capable service client, basically for free

  • Massively scalable
  • Fully integrated with LibUV polls/timers
  • HTTPS support
  • FTP, SCP, many other protocols
  • Goal - STTP and requests-scala support
  • Hard problem - is there a "primitive" low-level API we could provide?

Imperative Server API

trait Server {
  def serve(port:Int)(handler:(Request,Conection) => Unit):Unit
}

def main(args:Array[String]):Unit = {
  Server.serve(9999) { request, connection =>
    val response = makeResponse(request)
    connectionHandle.sendResponse(response)
  }
  EventLoop.run()
}
  • an imperative-style callback lets us pass a request and a capability to respond to a handler
  • this lets us abstract over sync/async/streaming
  • is there a way to avoid defining a Request and Response type?
  • It would be great if Scala had something like Rack/WSGI.  Lightweight Servlets?
  • Sca-lets?

Server DSL(s)

  def main(args:Array[String]):Unit = {
    Service()
      .getAsync("/async") { r => Future { 
          s"got (async routed) request $r"
        }.map { message => OK(
            Map("asyncMessage" -> message)      
          )
        }
      }      
      .get("/") { r => OK {
          Map("default_message" -> s"got (default routed) request $r")
        }
      }
      .run(9999)
  }
  • We're shipping a high-level service API as well
  • Async support, JSON support
  • Router is rudimentary
  • Next steps: port Cask, play.api.routing.sird, others?

All Of the Above

  def main(args:Array[String]):Unit = {
    Service()
      .getAsync("/async") { r => Future { 
          s"got (async routed) request $r"
        }.map { message => OK(
            Map("asyncMessage" -> message)      
          )
        }
      }      
      .getAsync("/fetch/example") { r => 
        Curl.get(c"https://www.example.com").map { response =>
          Response(200,"OK",Map(),response.body)
        }
      }
      .get("/") { r => OK {
          Map("default_message" -> s"got (default routed) request $r")
        }
      }
      .run(9999)
  }

Because LibUV coordinates everything, all of these capabilities can be mixed and matched

Performance

  • Raw HTTP request throughput metrics aren't representative
  • Aiming for high hundreds/low thousands of requests/second on benchmarks with good QOS
  • Real impact is service density:
    • < 1 CPU/instance
    • 100-200 MB RAM
    • < 10 MB binary/container
  • Versus realistic JVM Scala usage:
    • 2-4x+ CPU 
    • 5x+ RAM / instance
    • 10x+ larger footprint
  • Real scenario: 8 clustered services, 3 - 5 instances/service

So how do we implement it?

Systems Programming 101

To implement what we've seen, we will need two techniques from systems programming:

  • pointers
  • unsafe casts

A pointer is a representation of the location of a piece of data.  A Ptr[T] stores the location of a T, as a 8-byte unsigned integer.

An unsafe cast lets us treat a Ptr[T] as any other type, typically another Ptr[X] or just Long

Data structures designed for generic programming in C often have a "void pointer" field for custom data structures.

In Scala Native we represent these as Ptr[Byte]

Example: Memory Allocation

val raw_data:Ptr[Byte] = malloc(sizeof[Long])

val long_ptr:Ptr[Long] = raw_data.asInstanceOf[Ptr[Long]]

// the ! operator updates a pointer's contents on the left-hand side
!long_ptr:Ptr = 0

// on the right-hand side, it dereferences the pointer to read its value
printf(c"pointer at %p has value %d\n", long_ptr:Ptr, !long_ptr:Ptr)

!int_ptr = 1
printf(c"pointer at %p has value %d\n", long_ptr:Ptr, !long_ptr:Ptr)

//this will segfault
free(raw_data)
printf(c"pointer at %p has value %d\n", long_ptr:Ptr, !long_ptr:Ptr)

not shown: arrays and pointer arithmetic

Example: C FFI

@extern
object Quicksort {
  type Comparator = CFuncPtr2[Ptr[Byte],Ptr[Byte],Int]
  def qsort(array:Ptr[Byte],num:CSize,size:CSize,cmp:Comparator):Unit = extern
}
  • Scala Native code can call C functions
  • C functions can call SN code too!
  • But C functions aren't like Scala functions: they are static
  • This constrains some design patterns

Example: C FFI

@extern
object Quicksort {
  type Comparator = CFuncPtr2[Ptr[Byte],Ptr[Byte],Int]
  def qsort(array:Ptr[Byte],num:CSize,size:CSize,cmp:Comparator):Unit = extern
}
object App {
  type MyStruct = CStruct3[CString,CString,Int]
  val myStructComp = new Comparator {
    def apply(left:Ptr[Byte],right:[Byte]):Int = {
      val l = left.asInstanceOf[Ptr[MyStruct]]
      val r = right.asInstanceOf[Ptr[MyStruct]]
      l - r
    }
  }

  def main(args:Array[String]):Unit = {
    // ...
    val data:Ptr[MyStruct] = ???
    val data_size = ???
    qsort(data,data_size,sizeof[MyStruct],myStructComp)
    // ...
  }
}

Greenspun's tenth rule

  • Void pointers, function arguments, and casts are C's affordances for generic programming
  • This technique feels closer to a dynamic language than a type-safe one, but it suffices
  • LibUV's callbacks will let us insert enough Scala Native code into the event loop to provide a full concurrent environment (without leaking unsafe details)
  • So how does it work?

"Any sufficiently complicated C or Fortran program contains an ad hoc, informally specified, bug ridden, and slow implementation of half of Common Lisp"

ExecutionContext from Scratch

  • Scala Native includes an EC already
  • The catch - it runs after main() returns
object ExecutionContext {
  def global: ExecutionContextExecutor = QueueExecutionContext

  private object QueueExecutionContext extends ExecutionContextExecutor {
    def execute(runnable: Runnable): Unit = queue += runnable
    def reportFailure(t: Throwable): Unit = t.printStackTrace()
  }

  private val queue: ListBuffer[Runnable] = new ListBuffer

  private[runtime] def loop(): Unit = {  // this runs after main() returns
    while (queue.nonEmpty) {
      val runnable = queue.remove(0)
      try {
        runnable.run()
      } catch {
        case t: Throwable =>
          QueueExecutionContext.reportFailure(t)
      }
    }
  }
}

LibUV's event loop

We just need to adapt a queue-based EC to libuv's lifecycle of callbacks
 

  • We queue up work
  • A prepare handle run immediately prior to IO
  • It runs tasks until the queue is exhausted
  • When there are no more tasks and no more IO, we are done!
  • The catch - how do we track IO that isn't a Future?

EventLoop and LoopExtensions

trait EventLoopLike extends ExecutionContextExecutor {
  def addExtension(e:LoopExtension):Unit 
  def run(mode:Int = UV_RUN_DEFAULT):Unit
}

trait LoopExtension {
  def activeRequests():Int
}

The LoopExtension trait lets us coordinate Future execution with other IO tasks on the same loop, and modularize our code.

Our EventLoop

object EventLoop extends EventLoopLike {
  val loop = uv_default_loop()

  private val taskQueue = ListBuffer[Runnable]()
  def execute(runnable: Runnable): Unit = taskQueue += runnable
  def reportFailure(t: Throwable): Unit = {
    println(s"Future failed with Throwable $t:")
    t.printStackTrace()
  }
  // ...

execute() is invoked as soon as a Future is ready to start running, but we can defer it until a callback fires

Our EventLoop callback

  // ...
  private def dispatchStep(handle:PrepareHandle) = {
    while (taskQueue.nonEmpty) {
      val runnable = taskQueue.remove(0)
      try {
        runnable.run()
      } catch {
        case t: Throwable => reportFailure(t)
      }
    }
    if (taskQueue.isEmpty && !extensionsWorking) {
      println("stopping dispatcher")
      LibUV.uv_prepare_stop(handle)
    }
  }

  private val dispatcher_cb = CFunctionPtr.fromFunction1(dispatchStep)

  private def initDispatcher(loop:LibUV.Loop):PrepareHandle = {
    val handle = stdlib.malloc(uv_handle_size(UV_PREPARE_T))
    check(uv_prepare_init(loop, handle), "uv_prepare_init")
    check(uv_prepare_start(handle, dispatcher_cb), "uv_prepare_start")
    return handle
  }

  private val dispatcher = initDispatcher(loop)
  // ...

LoopExtensions

  private val extensions = ListBuffer[LoopExtension]()

  private def extensionsWorking():Boolean = {
    extensions.exists( _.activeRequests > 0)
  }

  def addExtension(e:LoopExtension):Unit = {
    extensions.append(e)
  }

All the actual IO work will be implemented as LoopExtensions

 

We'll implement the simplest one, a delay.

Timer

object Timer extends LoopExtension {
  EventLoop.addExtension(this)

  var serial = 0L
  var timers = mutable.HashMap[Long,Promise[Unit]]() // the secret sauce

  override def activeRequests():Int = 
    timers.size

  def delay(dur:Duration):Future[Unit] = ???

  val timerCB:TimerCB = ???
}

@extern
object TimerImpl {
  type Timer: Ptr[Long] // why long and not byte?
  type TimerCB = CFuncPtr1[TimerHandle,Unit]
  def uv_timer_init(loop:Loop, handle:TimerHandle):Int = extern
  def uv_timer_start(handle:TimerHandle, cb:TimerCB, 
                     timeout:Long, repeat:Long):Int = extern
}

How is it safe to treat Timer as Ptr[Long]?

Working with Opaque Structs

Sometimes it's necessary to work with a data structure without knowing its internal layout.

Ptr[CStruct3[Ptr[Byte],Float,CString]]

Ptr[CStruct1[Ptr[Byte]]

Ptr[CStruct1[Long]]

Ptr[Long]

(No safety guarantees)

  • Allocation and initialization is hard, we'll need help
  • "type puns" allow us to cast data between unrelated types:

Timer

  def delay(dur:Duration):Future[Unit] = {
    val millis = dur.toMillis

    val promise = Promise[Unit]()
    serial += 1
    val timer_id = serial
    timers(timer_id) = promise

    val timer_handle = stdlib.malloc(uv_handle_size(UV_TIMER_T))
    uv_timer_init(EventLoop.loop,timer_handle)
    !timer_handle = timer_id
    uv_timer_start(timer_handle, timerCB, millis, 0)

    promise.future
  }

We can store an 8-byte serial number in the TimerHandle, and retrieve it in our callback.

Timer

  val timerCB = new TimerCB { 
    def apply(timer_handle:TimerHandle):Unit = {
      println("callback fired!")
      val timer_id = !timer_handle
      val timer_promise = timers(timer_id)
      timers.remove(timer_id)
      println(s"completing promise ${timer_id}")
      timer_promise.success(())
    }
  }

We can dereference the TimerHandle safely - the compiler thinks it's a Ptr[Long] so it only reads the first 8 bytes.

 

Then we use the serial number for a map lookup to retrieve our state.

Timer

object Main {
  implicit val ec:ExecutionContext = EventLoop

  def main(args:Array[String]):Unit = {
    Timer.delay(3 seconds).flatMap { _ =>
      println("beep")
      Timer.delay(2 seconds)
    }.flatMap { _ =>
      println("boop")
      Timer.delay(1 second)
    }.onComplete { _ =>
      println("done")
    }

    EventLoop.run()
  }
}

It just works!

What's Next?

  • Improve support in SN for user-supplied event loop
  • STTP has great (blocking) curl/native support
  • Plan to spin out server API
  • Get feedback on streaming IO API

Building a Community

  • All of this needs contributors to be sustainable
  • There are a LOT of low-hanging fruit out there
  • Weekend-scale side projects can have a big impact!
  • Huge shout out to Scala Native's contributors
  • Get involved!

Thanks!

@RichardWhaling

Fast, Simple Concurrency with Scala Native

By Richard Whaling

Fast, Simple Concurrency with Scala Native

  • 914