Bootstrapping the Web

with Scala Native

Richard Whaling

Spantree Technology Group

This talk is about:

  • Scala Native
  • Systems progamming
  • Server programming

but also:

  • Working with emerging technology
  • Improvised solutions
  • OS as platform
  • Language as platform

(or how to get things done without the JVM)

Talk Outline

  1. Introduction to Scala Native
  2. Introduction to Server Programming
  3. Minimal Viable Server
  4. Multiplexed Protocols
  5. Multiplexed I/O
  6. Reflections

About Me

Twitter: @RichardWhaling

https://spantree.net/blog/

Scala Native is:

  1. Scala!
  2. A scalac/sbt plugin
  3. An LLVM-based AOT compiler
  4. Great for command-line tools
  5. No JVM
  6. Includes implementations of some JDK classes
  7. Types and Annotations for C interop

The Basics

object Hello {
    def main(args: Array[String]):Unit = {
        println("Hello, CASE!")
    }
}

This just works!

Structs and Pointers

type Vec = CStruct3[Double, Double, Double]

val vec:Ptr[Vec] = stackalloc[Vec]
!vec._1 = 10.0              // initialize fields
!vec._2 = 20.0
!vec._3 = 30.0
length(vec)                 // pass by reference

Interop

@extern object stdlib {
  def malloc(size: CSize): Ptr[Byte] = extern
  def free(ptr: Ptr[Byte]): CInt = extern
}

val ptr = stdlib.malloc(32)
stdlib.free(ptr)

Included Libraries

  1. Implementations of hundreds of JDK classes
  2. Partial ANSI C Bindings
  3. Partial POSIX C Bindings

What can we do?

What can't we do?

What does a server do?

 

 

  1. Listens on a known port
  2. Accepts incoming connections
  3. Reads requests from clients 
  4. Writes back responses

A typical server:

The catch: it has to do all of these at the same time...

with (traditionally) blocking system calls.

TCP Socket System Calls

socket()  -- initializes a new socket and selects protocol
bind()    -- assigns an address and port to a socket
listen()  -- begins accepting incoming connections on a bound socket
accept()  -- takes an incoming connection off the OS backlog

connect() -- initiates an outgoing connection on an unbound socket

read()/recv()/recvmsg()   -- reads bytes from a connected socket
write()/send()/sendmsg()  -- writes bytes to a connected socket
close()                   -- closes a connected socket
ioctl/setsocketopt()/fcntl() -- evil

socket()

bind()

listen()

accept()

read()

write()

close()

Berkeley Socket Dance

socket()

connect()

write()

read()

close()

Server

Client

Berkeley Socket Dance

    def serve(port:UShort): Unit = {
        // Allocate and initialize address struct
        val addr_size = sizeof[sockaddr_in]
        val server_address = malloc(addr_size).cast[Ptr[sockaddr_in]] 
        !server_address._1 = AF_INET.toUShort  // IP Socket
        !server_address._2 = htons(port)       // port
        !server_address._3._1 = INADDR_ANY     // bind to 0.0.0.0

        // Bind and listen on socket
        val sock_fd = socket(AF_INET, SOCK_STREAM, 0) // SOCK_STREAM indicates TCP and not UDP
        val bind_result = bind(sock_fd, server_address.cast[Ptr[sockaddr]], addr_size.toUInt)
        println(s"bind returned $bind_result")
        val listen_result = listen(sock_fd, 128)     
        println(s"listen returned $listen_result")

        // Allocate and initialize client address struct
        val client_address = malloc(addr_size).cast[Ptr[sockaddr_in]]
        val client_addr_size = stackalloc[UInt]
        !client_addr_size = addr_size.toUInt
   
        // Main accept() loop
        while (true) {
            val conn_fd = accept(sock_fd, client_address.cast[Ptr[sockaddr]], client_addr_size)
            println(s"accept returned fd $conn_fd")
            handleConnection(conn_fd)
        }
        close(sock_fd)
    }

Berkeley Socket Dance

    def handleConnection(conn_fd:Int, max_size:Int = 1024): Unit = {
        val line_buffer = malloc(max_size)
        while (true) {
            val read_result = read(conn_fd, line_buffer, max_size)
            println(s"read $read_result bytes")
            if (read_result == 0) // EOF
                return
            line_buffer(read_result) = 0 // Append a string-end marker
            val write_result = write(conn_fd, line_buffer, read_result)
            println(s"wrote $write_result bytes")
        }
    }

What happens when a new connection comes in?

Introducing fork()

  • fork() clones a process in-place

  • one process calls, two return

  • parent-child relationship

  • parent is responsible for supervising the child

  • if a child exits, it stays as a "zombie" until "reaped"

socket()

bind()

listen()

accept()

read()

write()

close()

socket()

connect()

write()

read()

close()

Server

Client

fork()

Introducing fork()

Introducing fork()

    def handleConnection(conn_fd:Int, max_size:Int = 1024): Unit = {
        val pid = fork()
        if (pid != 0) { 
            // In parent process
            println("forked pid $pid to handle connection")
            close(conn_fd)
            return
        } else {
            // In child process
            println("fork returned $pid, in child process")
            val line_buffer = malloc(max_size)
            while (true) {
                val read_result = read(conn_fd, line_buffer, 1024)
                println(s"read $read_result bytes")
                if (read_result == 0) {
                    // Cleanup
                    close(conn_fd)
                    sys.exit()
                }
                line_buffer(read_result) = 0
                val write_result = write(conn_fd, line_buffer, read_result)
                println(s"wrote $write_result bytes")
            }
        }
    }

Downsides

  • Testing
  • Robustness
  • Portability
  • General sanity

How can we avoid writing our own socket code?

The Unix Philosophy

  • Write programs that do one thing and do it well.
  • Write programs to work together.
  • Write programs to handle text streams, because that is a universal interface.

Peter H. Salus,

from A Quarter Century of UNIX

The Unix Philosophy

Conjecture: HTTP is a solved problem.

What is the simplest way to use a stable HTTP server for a SN app?

How does it perform?

Introducing exec()

  • Actually a family of 6 very similar functions

  • Executes a brand-new program - cannot return

  • Can set arguments and environment variables

  • New program inherits open file descriptors

socket()

bind()

listen()

accept()

Introducing exec()

socket()

connect()

write()

read()

close()

Server

Client

fork()

exec()

?

Introducing exec()

    def handleConnectionExec(conn_fd:Int, path:CString, args:Ptr[CString]): Unit = {
        val pid = fork()
        if (pid != 0) {
            println("forked pid $pid to handle connection")
            close(conn_fd)
            return
        } else {
            println("fork returned $pid, in child process")
            execv(path, args)
    }

Almost there!

  • fork()/exec() is enough for stream-oriented services.
  • HTTP adds a request/response protocol
  • HTTP introduces "resources" and other metadata
  • RFC 2616 (HTTP/1.1) is about 280 pages long

What we need:

  • Generic handling of concurrent HTTP connections
  • Flexible routing of requests to various programs
  • Simple request/response protocol for handlers

Apache httpd

CGI

  • Traditional prefork based web server*
  • Directly descended from NCSA httpd
  • May or may not pun on "a patchy" web server
  • Isolated processes per request
  • All communication over standard file IO
  • Headers and params in environment
  • Can be implemented in bash, perl, awk, C...

A Minimal CGI Handler

object Main {
    def main(args: Array[String]): Unit = {
        println("Content-type: text/html\r\n\r\n")
        println("Hello, Strangeloop!")
    }
}

Building the app

# notice the FROM - AS structure
FROM scala-native-base-build AS build 

# Set up the directory structure for our project
RUN mkdir -p /root/project-build/project
WORKDIR /root/project-build

# Resolve all our dependencies and plugins to speed up future compilations
ADD ./project/plugins.sbt project/
ADD ./project/build.properties project/
ADD build.sbt .
RUN sbt update

# Add and compile our actual application source code
ADD . /root/project-build/
RUN sbt clean nativeLink

# Copy the binary executable to a consistent location
RUN cp ./target/scala-2.11/*-out ./dinosaur-build-out

Packaging the app

# Start over from a clean Alpine image, in the same Dockefile
FROM alpine:3.3

# Copy in C libraries from previous build
COPY --from=build \
   /usr/lib/libunwind.so.8 \
   /usr/lib/libunwind-x86_64.so.8 \
   /usr/lib/libgc.so.1 \
   /usr/lib/libstdc++.so.6 \
   /usr/lib/libgcc_s.so.1 \
   /usr/lib/
COPY --from=build \
   /usr/local/lib/libre2.so.0 \
   /usr/local/lib/libre2.so.0

# Copy in the executable
COPY --from=build \
   /root/project-build/dinosaur-build-out /var/www/localhost/cgi-bin/app

COPY httpd.conf /etc/apache2/httpd.conf
COPY mpm.conf /etc/apache2/mpm.conf

RUN apk --update add apache2 apache2-utils

RUN mkdir -p /run/apache2
ADD apache.entrypoint.sh /root/

ENTRYPOINT "/root/apache.entrypoint.sh"

Does it work?

A CGI Micro-framework

object main {
  def main(args: Array[String]): Unit = {
    Router.init()
          .get("/")("<H1>Welcome to Dinosaur!</H1>")
          .get("/hello") { request =>
            "Hello World!"
          }
          .get("/who")( request =>
            request.pathInfo() match {
              case Seq("who") => "Who's there?"
              case Seq("who",x) => "Hello, " + x
              case Seq("who",x,y) => "Hello both of you"
              case _ => "Hello y'all!"
            }
          )
          .get("/bye")( request =>
            request.params("who")
                   .map { x => "Bye, " + x }
                   .mkString(". ")
          )
          .dispatch()
  }
}

A CGI Micro-framework

trait Router {
  def handle(method: Method, path:String)(f: Request => Response):Router
  def get(path:String)(f: Request => Response):Router = handle(GET, path)(f)
  def post(path:String)(f: Request => Response):Router = handle(POST, path)(f)
  def put(path:String)(f: Request => Response):Router = handle(PUT, path)(f)
  def delete(path:String)(f: Request => Response):Router = handle(DELETE, path)(f)
  def dispatch(): Unit
}

case class Request(
  method: Function0[Method],
  pathInfo: Function0[Seq[String]],
  params: Function1[String, Seq[String]]
)

case class Response(
  body: ResponseBody,
  statusCode: Int = 200,
  headers: Map[String, String] = Map("Content-type" -> "text/html; charset=utf-8")
)

A CGI Micro-framework

object CgiUtils {
  def env(key: CString): String = {
    val lookup = stdlib.getenv(key)
    if (lookup == null) {
      ""
    } else {
      fromCString(lookup)
    }
  }

  def parsePathInfo(pathInfo: String): Seq[String] = {
    pathInfo.split("/").filter( _ != "" )
  }

  def parseQueryString(queryString: String): Function1[String, Seq[String]] = {
    val pairs = queryString.split("&").map( pair =>
      pair.split("=") match {
        case Array(key, value) => (key,value)
      }
    ).groupBy(_._1).toSeq
    val groupedValues = for ( (k,v) <- pairs;
                               values = v.toSeq.map(_._2) )
                        yield (k -> values)
    return groupedValues.toMap.getOrElse(_,Seq.empty)
  }
}

A CGI Micro-framework

case class CGIRouter(handlers:Seq[Handler]) extends Router {
  def dispatch(): Unit = {
    val request = Router.parseRequest()
    val matches = for ( h @ Handler(method, pattern, handler) <- this.handlers
                        if request.method() == method
                        if request.pathInfo().startsWith(pattern)) yield h
    val bestHandler = matches.maxBy( _.pattern.size )
    val response = bestHandler.handler(request)
    for ( (k,v) <- response.inferHeaders ) {
      System.out.println(k + ": " + v)
    }
    System.out.println()
    System.out.println(response.bodyToString)
  }
}

Performance

40 ms mean response with 10 users

99th percentile response goes over 1s at 150 users

mean response plateaus around 500 ms at 300 users

peaks around 400 requests/sec

Compared to a python-based CGI app, which exhibits:

136 ms mean response with 10 users

99th percentile response goes over 1s at 75 users

mean response plateaus around 5s at 250 users

peaks around 200 requests/sec

Performance

But compared to a trivial node.js/Express app:

 

median response 7 ms with 10 users

99th percentile stays under 1s up to 2000 users

error rate approaches 15% around 500 users

peaks around 2000 requests/sec

 

 

 

Performance

How can we do better?

 

 

  1. Multiplexed Protocols
  2. Multiplexed I/O

How can we do better without spending years of our lives?

Multiplexed Protocols

  • Technique for combining streams onto a single connection
  • Relies on a proxy server to handle raw HTTP
  • All requests and responses are "framed" with an identifier
  • Proxy is responsible for routing responses to correct client.

Two prominent examples:

  1. FastCGI
  2. HTTP/2

 

Web Server

 

 

FastCGI Application

 

 

HTTP Client

 

 

HTTP Client

 

 

HTTP Client

 

HTTP

FastCGI

FastCGI

What makes FastCGI different from regular CGI?

 

  1. Persistent processes
  2. Persistent connections
  3. Multiplexed requests
  4. Framed strings + metadata

One catch -- we need a socket.

But do we need concurrency?

FastCGI

Parsing algorithm:​

Read 8 byte header from socket
Extract type, Request ID, length, padding from header
Read (length + padding bytes) from socket
if (type == FCGI_STDIN & length == 0):
    request is complete, invoke handler and write response
else:
    append to pending buffers for Request ID

FastCGI

  def readHeader(input: Ptr[Byte], offset:Long): RecordHeader = {
    val version = input(0 + offset) & 0xFF
    val rec_type = (input(1 + offset) & 0xFF) match {
      case 0 => FCGI_UNKNOWN_TYPE
      case 1 => FCGI_BEGIN_REQUEST
      case 2 => FCGI_ABORT_REQUEST
      case 3 => FCGI_END_REQUEST
      case 4 => FCGI_PARAMS
      case 5 => FCGI_STDIN
      case 6 => FCGI_STDOUT
      case 7 => FCGI_STDERR
      case 8 => FCGI_DATA
      case 9 => FCGI_GET_VALUES
      case 10 => FCGI_GET_VALUES_RESULT
      case _ => FCGI_UNKNOWN_TYPE
    }
    val req_id_b1 = (input(2 + offset) & 0xFF)
    val req_id_b0 = (input(3 + offset) & 0xFF)
    val req_id = (req_id_b1 << 8) + (req_id_b0 & 0xFF)
    val length = ((input(4 + offset) & 0xFF) << 8) + (input(5 + offset) & 0xFF)
    val padding = input(6 + offset) & 0xFF
    RecordHeader(version,rec_type,req_id,length,padding)
  }

FastCGI

  def readParam(byteArray: Ptr[Byte], arr_offset:Long, length:Long)
               : (Ptr[Byte], Ptr[Byte], Long) = {
    val name_len_offset = arr_offset + 0
    val (name_len:Long, val_len_offset:Long) = 
      if ((byteArray(name_len_offset) & 0x80) == 0) {
        val len = byteArray(name_len_offset)
        (len, arr_offset + 1)
      } else {
        val len = ((byteArray(name_len_offset) & 0x7F) << 24) +
                  ((byteArray(name_len_offset + 1) & 0xFF) << 16) +
                  ((byteArray(name_len_offset + 2) & 0xFF) << 8) +
                  (byteArray(name_len_offset + 3) & 0xFF)
        (len, arr_offset + 4)
      }

    val (val_len:Long, content_offset:Long) = 
      if ((byteArray(val_len_offset) & 0x80) == 0) {
        val len = byteArray(val_len_offset)
        (len, val_len_offset + 1)
      } else {
        val len = ((byteArray(val_len_offset) & 0x7F) << 24) +
                  ((byteArray(val_len_offset + 1) & 0xFF) << 16) +
                  ((byteArray(val_len_offset + 2) & 0xFF) << 8) +
                  (byteArray(val_len_offset + 3) & 0xFF)
        (len, val_len_offset + 4)
      }
    val name = byteArray + content_offset
    val value = byteArray + content_offset + name_len
    val next_param_offset = content_offset + name_len + val_len
    (name, value, next_param_offset)
  }

Improvising a socket

#!/bin/bash
rm /tmp/app.socket
rm /tmp/app.fifo
mkfifo /tmp/app.fifo
nginx -g "daemon off;" &
export ROUTER_MODE=FCGI
nc -l -U /tmp/app.socket < /tmp/app.fifo | /var/www/localhost/cgi-bin/dinosaur-build-out > /tmp/app.fifo

Nginx

nc

app

socket

fifo

(better option: write a proxy in ~80 lines of Go)

Performance

  • mean response in 4ms under light load
  • 500 users -- .1% error rate, 283ms mean response
  • Backlog starts to overflow around 1000 users
  • Overflows register as fast refusals rather than timeouts
  • Peaks around 1500 requests/sec

Multiplexed I/O

  • Traditional options: select() and poll()
  • Non-standard options: epoll, kqueue, iocp*
  • All provide ways to poll the state of many sockets
  • Polls listening and connection sockets at once
  • "Quirky"
  • Tends to require use of ioctl(), setsockopt(), fcntl()
  • Not especially portable

Multiplexed I/O

listener = setUpListeningSocket()
pollSet = set(listener)
while true:
    readySockets = poll(pollSet)
    for socket in readySockets:
        if socket == listener:
            newConnection = accept(listener)
            pollSet.add(newConnection)
        else:
            if socket.readyToRead:
                read(socket)
            else if socket.readyToWrite:
                write(socket)

LibUV

LibUV, The node.js event loop:

  • cross-platform C library (Linux, BSD, Windows)
  • multiplexed IO on a single thread/single process.
  • backed by native async primitives: epoll/kqueue/iocp
  • callback-oriented API
  • strict memory management requirements

LibUV

@link("uv")
@extern
object LibUV {
  type PipeHandle = Ptr[Byte]
  type Loop = Ptr[Byte]
  type Buffer = CStruct2[Ptr[Byte],CSize]
  type WriteReq = Ptr[Ptr[Byte]]
  type ShutdownReq = Ptr[Ptr[Byte]]
  type Connection = Ptr[Byte]
  type ConnectionCB = CFunctionPtr2[PipeHandle,Int,Unit]
  type AllocCB = CFunctionPtr3[PipeHandle,CSize,Ptr[Buffer],Unit]
  type ReadCB = CFunctionPtr3[PipeHandle,CSSize,Ptr[Buffer],Unit]
  type WriteCB = CFunctionPtr2[WriteReq,Int,Unit]
  type ShutdownCB = CFunctionPtr2[ShutdownReq,Int,Unit]
  type CloseCB = CFunctionPtr1[PipeHandle,Unit]

  def uv_default_loop(): Loop = extern
  def uv_loop_size(): CSize = extern
  def uv_handle_size(h_type:Int): CSize = extern
  def uv_req_size(r_type:Int): CSize = extern
  def uv_pipe_init(loop:Loop, handle:PipeHandle, ipcFlag:Int ): Unit = extern
  def uv_pipe_bind(handle:PipeHandle, socketName:CString): Int = extern
  def uv_listen(handle:PipeHandle, backlog:Int, callback:ConnectionCB): Int = extern
  def uv_accept(server:PipeHandle, client:PipeHandle): Int = extern
  def uv_read_start(client:PipeHandle, allocCB:AllocCB, readCB:ReadCB): Int = extern
  def uv_write(writeReq:WriteReq, client:PipeHandle, bufs: Ptr[Buffer], numBufs: Int, writeCB:WriteCB): Int = extern
  def uv_read_stop(client:PipeHandle): Int = extern
  def uv_shutdown(shutdownReq:ShutdownReq, client:PipeHandle, shutdownCB:ShutdownCB): Int = extern
  def uv_close(handle:PipeHandle, closeCB: CloseCB): Unit = extern
  def uv_run(loop:Loop, runMode:Int): Int = extern
}

LibUV

  def dispatch(): Unit = {
    val loop:Loop = uv_default_loop()
    val pipe_size = uv_handle_size(7)
    val pipe:PipeHandle = stackalloc[Byte](pipe_size)
    uv_pipe_init(loop, pipe, 0)

    var r = uv_pipe_bind(pipe, c"/tmp/app.socket")
    println(s"uv_pipe_bind returned $r")

    r = uv_listen(pipe, 4096, onConnectCB)
    println(s"uv_listen returned $r")

    r = uv_run(loop, 0)
    println(s"uv_run returned $r")
  }

  def onConnect(server:PipeHandle, status:Int): Unit = {
    println("connection received!")
    val client:PipeHandle = stdlib.malloc(pipe_size)
    uv_pipe_init(loop, client, 0)
    var r = uv_accept(server, client)
    println(s"uv_accept returned $r")
    uv_read_start(client, onAllocCB, onReadCB)
  }
  val onConnectCB = CFunctionPtr.fromFunction2(onConnect)

LibUV

  def onRead(pipe:PipeHandle, size:CSSize, buffer:Ptr[Buffer]): Unit = {
    if (size >= 0) {
      var position = 0
      // We are going to store the positions of the CGI parameter and STDIN frames
      var params:(Int,RecordHeader) = (0,null)
      var stdin:(Int,RecordHeader) = (0,null)
      // Scan the input buffer for the positions of useful metadata
      while (position < size) {
        val header = readHeader(!buffer._1,position)
        reqId = header.reqId
        if (header.rec_type == FCGI_PARAMS & header.length > 0)
          params = (position,header)
        else if (header.rec_type == FCGI_STDIN & header.length > 0)
          stdin = (position, header)
        position += (8 + header.length + header.padding)
      }
      // Generate a response and enqueue it to the pipe (re-use the input buffer for output)
      val write_req:WriteReq = stdlib.malloc(write_req_size).cast[WriteReq]
      !write_req = !buffer._1
      !buffer._2 = makeResponse(reqId, params, stdin, !write_req)
      uv_write(write_req, pipe, buffer, 1, onWriteCB)
    } else {
      // or we have read 0 bytes and can close the connection
      uv_read_stop(pipe)
      val shutdownReq = stdlib.malloc(shutdown_req_size).cast[ShutdownReq]
      !shutdownReq = pipe
      uv_shutdown(shutdownReq, pipe, myShutdownCB)
      stdlib.free(!buffer._1)
    }
  }
  val onReadCB = CFunctionPtr.fromFunction3(onRead)

Performance

When deployed on a UNIX socket behind nginx:

  • mean response 4ms under light load
  • with 1000 users:
    • mean response 140ms
    • error rates about 1/2 of node
    • no timeouts

🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

Reflections

What do our languages really need to provide?

 

Does serving up HTTP belong in our app or in infrastructure?

What can we expect from our OS?

What can we expect from our cluster?

Things are about to change.

Bootstrapping the Web with Scala Native

By Richard Whaling

Bootstrapping the Web with Scala Native

  • 2,692