Bootstrapping the Web

with Scala Native

Richard Whaling

Spantree Technology Group

This talk is about:

Scala Native
Systems progamming
Server programming

but also:

Working with emerging technology
Improvised solutions
OS as platform
Language as platform

(or how to get things done without the JVM)

Talk Outline

Introduction to Scala Native
Introduction to Server Programming
Minimal Viable Server
Multiplexed Protocols
Multiplexed I/O
Reflections

About Me

Twitter: @RichardWhaling

https://spantree.net/blog/

Scala Native is:

Scala!
A scalac/sbt plugin
An LLVM-based AOT compiler
Great for command-line tools
No JVM
Includes implementations of some JDK classes
Types and Annotations for C interop

The Basics

object Hello {
    def main(args: Array[String]):Unit = {
        println("Hello, CASE!")
    }
}

This just works!

Structs and Pointers

type Vec = CStruct3[Double, Double, Double]

val vec:Ptr[Vec] = stackalloc[Vec]
!vec._1 = 10.0              // initialize fields
!vec._2 = 20.0
!vec._3 = 30.0
length(vec)                 // pass by reference

Interop

@extern object stdlib {
  def malloc(size: CSize): Ptr[Byte] = extern
  def free(ptr: Ptr[Byte]): CInt = extern
}

val ptr = stdlib.malloc(32)
stdlib.free(ptr)

Included Libraries

Implementations of hundreds of JDK classes
Partial ANSI C Bindings
Partial POSIX C Bindings

What can we do?

What can't we do?

What does a server do?

Listens on a known port
Accepts incoming connections
Reads requests from clients
Writes back responses

A typical server:

The catch: it has to do all of these at the same time...

with (traditionally) blocking system calls.

TCP Socket System Calls

socket()  -- initializes a new socket and selects protocol
bind()    -- assigns an address and port to a socket
listen()  -- begins accepting incoming connections on a bound socket
accept()  -- takes an incoming connection off the OS backlog

connect() -- initiates an outgoing connection on an unbound socket

read()/recv()/recvmsg()   -- reads bytes from a connected socket
write()/send()/sendmsg()  -- writes bytes to a connected socket
close()                   -- closes a connected socket

ioctl/setsocketopt()/fcntl() -- evil

socket()

bind()

listen()

accept()

read()

write()

close()

Berkeley Socket Dance

socket()

connect()

write()

read()

close()

Server

Client

Berkeley Socket Dance

    def serve(port:UShort): Unit = {
        // Allocate and initialize address struct
        val addr_size = sizeof[sockaddr_in]
        val server_address = malloc(addr_size).cast[Ptr[sockaddr_in]] 
        !server_address._1 = AF_INET.toUShort  // IP Socket
        !server_address._2 = htons(port)       // port
        !server_address._3._1 = INADDR_ANY     // bind to 0.0.0.0

        // Bind and listen on socket
        val sock_fd = socket(AF_INET, SOCK_STREAM, 0) // SOCK_STREAM indicates TCP and not UDP
        val bind_result = bind(sock_fd, server_address.cast[Ptr[sockaddr]], addr_size.toUInt)
        println(s"bind returned $bind_result")
        val listen_result = listen(sock_fd, 128)     
        println(s"listen returned $listen_result")

        // Allocate and initialize client address struct
        val client_address = malloc(addr_size).cast[Ptr[sockaddr_in]]
        val client_addr_size = stackalloc[UInt]
        !client_addr_size = addr_size.toUInt
   
        // Main accept() loop
        while (true) {
            val conn_fd = accept(sock_fd, client_address.cast[Ptr[sockaddr]], client_addr_size)
            println(s"accept returned fd $conn_fd")
            handleConnection(conn_fd)
        }
        close(sock_fd)
    }

Berkeley Socket Dance

    def handleConnection(conn_fd:Int, max_size:Int = 1024): Unit = {
        val line_buffer = malloc(max_size)
        while (true) {
            val read_result = read(conn_fd, line_buffer, max_size)
            println(s"read $read_result bytes")
            if (read_result == 0) // EOF
                return
            line_buffer(read_result) = 0 // Append a string-end marker
            val write_result = write(conn_fd, line_buffer, read_result)
            println(s"wrote $write_result bytes")
        }
    }

What happens when a new connection comes in?

Introducing fork()

fork() clones a process in-place
one process calls, two return
parent-child relationship
parent is responsible for supervising the child
if a child exits, it stays as a "zombie" until "reaped"

socket()

bind()

listen()

accept()

read()

write()

close()

socket()

connect()

write()

read()

close()

Server

Client

fork()

Introducing fork()

    def handleConnection(conn_fd:Int, max_size:Int = 1024): Unit = {
        val pid = fork()
        if (pid != 0) { 
            // In parent process
            println("forked pid $pid to handle connection")
            close(conn_fd)
            return
        } else {
            // In child process
            println("fork returned $pid, in child process")
            val line_buffer = malloc(max_size)
            while (true) {
                val read_result = read(conn_fd, line_buffer, 1024)
                println(s"read $read_result bytes")
                if (read_result == 0) {
                    // Cleanup
                    close(conn_fd)
                    sys.exit()
                }
                line_buffer(read_result) = 0
                val write_result = write(conn_fd, line_buffer, read_result)
                println(s"wrote $write_result bytes")
            }
        }
    }

Downsides

Testing
Robustness
Portability
General sanity

How can we avoid writing our own socket code?

The Unix Philosophy

Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

Peter H. Salus,

from A Quarter Century of UNIX

The Unix Philosophy

Conjecture: HTTP is a solved problem.

What is the simplest way to use a stable HTTP server for a SN app?

How does it perform?

Introducing exec()

Actually a family of 6 very similar functions
Executes a brand-new program - cannot return
Can set arguments and environment variables
New program inherits open file descriptors

socket()

bind()

listen()

accept()

Introducing exec()

socket()

connect()

write()

read()

close()

Server

Client

fork()

exec()

Introducing exec()

    def handleConnectionExec(conn_fd:Int, path:CString, args:Ptr[CString]): Unit = {
        val pid = fork()
        if (pid != 0) {
            println("forked pid $pid to handle connection")
            close(conn_fd)
            return
        } else {
            println("fork returned $pid, in child process")
            execv(path, args)
    }

Almost there!

fork()/exec() is enough for stream-oriented services.
HTTP adds a request/response protocol
HTTP introduces "resources" and other metadata
RFC 2616 (HTTP/1.1) is about 280 pages long

What we need:

Generic handling of concurrent HTTP connections
Flexible routing of requests to various programs
Simple request/response protocol for handlers

Apache httpd

CGI

Traditional prefork based web server*
Directly descended from NCSA httpd
May or may not pun on "a patchy" web server

Isolated processes per request
All communication over standard file IO
Headers and params in environment
Can be implemented in bash, perl, awk, C...

A Minimal CGI Handler

object Main {
    def main(args: Array[String]): Unit = {
        println("Content-type: text/html\r\n\r\n")
        println("Hello, Strangeloop!")
    }
}

Building the app

# notice the FROM - AS structure
FROM scala-native-base-build AS build 

# Set up the directory structure for our project
RUN mkdir -p /root/project-build/project
WORKDIR /root/project-build

# Resolve all our dependencies and plugins to speed up future compilations
ADD ./project/plugins.sbt project/
ADD ./project/build.properties project/
ADD build.sbt .
RUN sbt update

# Add and compile our actual application source code
ADD . /root/project-build/
RUN sbt clean nativeLink

# Copy the binary executable to a consistent location
RUN cp ./target/scala-2.11/*-out ./dinosaur-build-out

Packaging the app

# Start over from a clean Alpine image, in the same Dockefile
FROM alpine:3.3

# Copy in C libraries from previous build
COPY --from=build \
   /usr/lib/libunwind.so.8 \
   /usr/lib/libunwind-x86_64.so.8 \
   /usr/lib/libgc.so.1 \
   /usr/lib/libstdc++.so.6 \
   /usr/lib/libgcc_s.so.1 \
   /usr/lib/
COPY --from=build \
   /usr/local/lib/libre2.so.0 \
   /usr/local/lib/libre2.so.0

# Copy in the executable
COPY --from=build \
   /root/project-build/dinosaur-build-out /var/www/localhost/cgi-bin/app

COPY httpd.conf /etc/apache2/httpd.conf
COPY mpm.conf /etc/apache2/mpm.conf

RUN apk --update add apache2 apache2-utils

RUN mkdir -p /run/apache2
ADD apache.entrypoint.sh /root/

ENTRYPOINT "/root/apache.entrypoint.sh"

Does it work?

A CGI Micro-framework

object main {
  def main(args: Array[String]): Unit = {
    Router.init()
          .get("/")("<H1>Welcome to Dinosaur!</H1>")
          .get("/hello") { request =>
            "Hello World!"
          }
          .get("/who")( request =>
            request.pathInfo() match {
              case Seq("who") => "Who's there?"
              case Seq("who",x) => "Hello, " + x
              case Seq("who",x,y) => "Hello both of you"
              case _ => "Hello y'all!"
            }
          )
          .get("/bye")( request =>
            request.params("who")
                   .map { x => "Bye, " + x }
                   .mkString(". ")
          )
          .dispatch()
  }
}

A CGI Micro-framework

trait Router {
  def handle(method: Method, path:String)(f: Request => Response):Router
  def get(path:String)(f: Request => Response):Router = handle(GET, path)(f)
  def post(path:String)(f: Request => Response):Router = handle(POST, path)(f)
  def put(path:String)(f: Request => Response):Router = handle(PUT, path)(f)
  def delete(path:String)(f: Request => Response):Router = handle(DELETE, path)(f)
  def dispatch(): Unit
}

case class Request(
  method: Function0[Method],
  pathInfo: Function0[Seq[String]],
  params: Function1[String, Seq[String]]
)

case class Response(
  body: ResponseBody,
  statusCode: Int = 200,
  headers: Map[String, String] = Map("Content-type" -> "text/html; charset=utf-8")
)

A CGI Micro-framework

object CgiUtils {
  def env(key: CString): String = {
    val lookup = stdlib.getenv(key)
    if (lookup == null) {
      ""
    } else {
      fromCString(lookup)
    }
  }

  def parsePathInfo(pathInfo: String): Seq[String] = {
    pathInfo.split("/").filter( _ != "" )
  }

  def parseQueryString(queryString: String): Function1[String, Seq[String]] = {
    val pairs = queryString.split("&").map( pair =>
      pair.split("=") match {
        case Array(key, value) => (key,value)
      }
    ).groupBy(_._1).toSeq
    val groupedValues = for ( (k,v) <- pairs;
                               values = v.toSeq.map(_._2) )
                        yield (k -> values)
    return groupedValues.toMap.getOrElse(_,Seq.empty)
  }
}

A CGI Micro-framework

case class CGIRouter(handlers:Seq[Handler]) extends Router {
  def dispatch(): Unit = {
    val request = Router.parseRequest()
    val matches = for ( h @ Handler(method, pattern, handler) <- this.handlers
                        if request.method() == method
                        if request.pathInfo().startsWith(pattern)) yield h
    val bestHandler = matches.maxBy( _.pattern.size )
    val response = bestHandler.handler(request)
    for ( (k,v) <- response.inferHeaders ) {
      System.out.println(k + ": " + v)
    }
    System.out.println()
    System.out.println(response.bodyToString)
  }
}

Performance

40 ms mean response with 10 users

99th percentile response goes over 1s at 150 users

mean response plateaus around 500 ms at 300 users

peaks around 400 requests/sec

Compared to a python-based CGI app, which exhibits:

136 ms mean response with 10 users

99th percentile response goes over 1s at 75 users

mean response plateaus around 5s at 250 users

peaks around 200 requests/sec

Performance

But compared to a trivial node.js/Express app:

median response 7 ms with 10 users

99th percentile stays under 1s up to 2000 users

error rate approaches 15% around 500 users

peaks around 2000 requests/sec

Performance

How can we do better?

Multiplexed Protocols
Multiplexed I/O

How can we do better without spending years of our lives?

Multiplexed Protocols

Technique for combining streams onto a single connection
Relies on a proxy server to handle raw HTTP
All requests and responses are "framed" with an identifier
Proxy is responsible for routing responses to correct client.

Two prominent examples:

FastCGI
HTTP/2

Web Server

FastCGI Application

HTTP Client

HTTP

FastCGI

What makes FastCGI different from regular CGI?

Persistent processes
Persistent connections
Multiplexed requests
Framed strings + metadata

One catch -- we need a socket.

But do we need concurrency?

FastCGI

Parsing algorithm:

Read 8 byte header from socket
Extract type, Request ID, length, padding from header
Read (length + padding bytes) from socket
if (type == FCGI_STDIN & length == 0):
    request is complete, invoke handler and write response
else:
    append to pending buffers for Request ID

FastCGI

  def readHeader(input: Ptr[Byte], offset:Long): RecordHeader = {
    val version = input(0 + offset) & 0xFF
    val rec_type = (input(1 + offset) & 0xFF) match {
      case 0 => FCGI_UNKNOWN_TYPE
      case 1 => FCGI_BEGIN_REQUEST
      case 2 => FCGI_ABORT_REQUEST
      case 3 => FCGI_END_REQUEST
      case 4 => FCGI_PARAMS
      case 5 => FCGI_STDIN
      case 6 => FCGI_STDOUT
      case 7 => FCGI_STDERR
      case 8 => FCGI_DATA
      case 9 => FCGI_GET_VALUES
      case 10 => FCGI_GET_VALUES_RESULT
      case _ => FCGI_UNKNOWN_TYPE
    }
    val req_id_b1 = (input(2 + offset) & 0xFF)
    val req_id_b0 = (input(3 + offset) & 0xFF)
    val req_id = (req_id_b1 << 8) + (req_id_b0 & 0xFF)
    val length = ((input(4 + offset) & 0xFF) << 8) + (input(5 + offset) & 0xFF)
    val padding = input(6 + offset) & 0xFF
    RecordHeader(version,rec_type,req_id,length,padding)
  }

FastCGI

  def readParam(byteArray: Ptr[Byte], arr_offset:Long, length:Long)
               : (Ptr[Byte], Ptr[Byte], Long) = {
    val name_len_offset = arr_offset + 0
    val (name_len:Long, val_len_offset:Long) = 
      if ((byteArray(name_len_offset) & 0x80) == 0) {
        val len = byteArray(name_len_offset)
        (len, arr_offset + 1)
      } else {
        val len = ((byteArray(name_len_offset) & 0x7F) << 24) +
                  ((byteArray(name_len_offset + 1) & 0xFF) << 16) +
                  ((byteArray(name_len_offset + 2) & 0xFF) << 8) +
                  (byteArray(name_len_offset + 3) & 0xFF)
        (len, arr_offset + 4)
      }

    val (val_len:Long, content_offset:Long) = 
      if ((byteArray(val_len_offset) & 0x80) == 0) {
        val len = byteArray(val_len_offset)
        (len, val_len_offset + 1)
      } else {
        val len = ((byteArray(val_len_offset) & 0x7F) << 24) +
                  ((byteArray(val_len_offset + 1) & 0xFF) << 16) +
                  ((byteArray(val_len_offset + 2) & 0xFF) << 8) +
                  (byteArray(val_len_offset + 3) & 0xFF)
        (len, val_len_offset + 4)
      }
    val name = byteArray + content_offset
    val value = byteArray + content_offset + name_len
    val next_param_offset = content_offset + name_len + val_len
    (name, value, next_param_offset)
  }

Improvising a socket

#!/bin/bash
rm /tmp/app.socket
rm /tmp/app.fifo
mkfifo /tmp/app.fifo
nginx -g "daemon off;" &
export ROUTER_MODE=FCGI
nc -l -U /tmp/app.socket < /tmp/app.fifo | /var/www/localhost/cgi-bin/dinosaur-build-out > /tmp/app.fifo

Nginx

app

socket

fifo

(better option: write a proxy in ~80 lines of Go)

Performance

mean response in 4ms under light load
500 users -- .1% error rate, 283ms mean response
Backlog starts to overflow around 1000 users
Overflows register as fast refusals rather than timeouts
Peaks around 1500 requests/sec

Multiplexed I/O

Traditional options: select() and poll()
Non-standard options: epoll, kqueue, iocp*
All provide ways to poll the state of many sockets
Polls listening and connection sockets at once
"Quirky"
Tends to require use of ioctl(), setsockopt(), fcntl()
Not especially portable

Multiplexed I/O

listener = setUpListeningSocket()
pollSet = set(listener)
while true:
    readySockets = poll(pollSet)
    for socket in readySockets:
        if socket == listener:
            newConnection = accept(listener)
            pollSet.add(newConnection)
        else:
            if socket.readyToRead:
                read(socket)
            else if socket.readyToWrite:
                write(socket)

LibUV

LibUV, The node.js event loop:

cross-platform C library (Linux, BSD, Windows)
multiplexed IO on a single thread/single process.
backed by native async primitives: epoll/kqueue/iocp
callback-oriented API
strict memory management requirements

LibUV

@link("uv")
@extern
object LibUV {
  type PipeHandle = Ptr[Byte]
  type Loop = Ptr[Byte]
  type Buffer = CStruct2[Ptr[Byte],CSize]
  type WriteReq = Ptr[Ptr[Byte]]
  type ShutdownReq = Ptr[Ptr[Byte]]
  type Connection = Ptr[Byte]
  type ConnectionCB = CFunctionPtr2[PipeHandle,Int,Unit]
  type AllocCB = CFunctionPtr3[PipeHandle,CSize,Ptr[Buffer],Unit]
  type ReadCB = CFunctionPtr3[PipeHandle,CSSize,Ptr[Buffer],Unit]
  type WriteCB = CFunctionPtr2[WriteReq,Int,Unit]
  type ShutdownCB = CFunctionPtr2[ShutdownReq,Int,Unit]
  type CloseCB = CFunctionPtr1[PipeHandle,Unit]

  def uv_default_loop(): Loop = extern
  def uv_loop_size(): CSize = extern
  def uv_handle_size(h_type:Int): CSize = extern
  def uv_req_size(r_type:Int): CSize = extern
  def uv_pipe_init(loop:Loop, handle:PipeHandle, ipcFlag:Int ): Unit = extern
  def uv_pipe_bind(handle:PipeHandle, socketName:CString): Int = extern
  def uv_listen(handle:PipeHandle, backlog:Int, callback:ConnectionCB): Int = extern
  def uv_accept(server:PipeHandle, client:PipeHandle): Int = extern
  def uv_read_start(client:PipeHandle, allocCB:AllocCB, readCB:ReadCB): Int = extern
  def uv_write(writeReq:WriteReq, client:PipeHandle, bufs: Ptr[Buffer], numBufs: Int, writeCB:WriteCB): Int = extern
  def uv_read_stop(client:PipeHandle): Int = extern
  def uv_shutdown(shutdownReq:ShutdownReq, client:PipeHandle, shutdownCB:ShutdownCB): Int = extern
  def uv_close(handle:PipeHandle, closeCB: CloseCB): Unit = extern
  def uv_run(loop:Loop, runMode:Int): Int = extern
}

LibUV

  def dispatch(): Unit = {
    val loop:Loop = uv_default_loop()
    val pipe_size = uv_handle_size(7)
    val pipe:PipeHandle = stackalloc[Byte](pipe_size)
    uv_pipe_init(loop, pipe, 0)

    var r = uv_pipe_bind(pipe, c"/tmp/app.socket")
    println(s"uv_pipe_bind returned $r")

    r = uv_listen(pipe, 4096, onConnectCB)
    println(s"uv_listen returned $r")

    r = uv_run(loop, 0)
    println(s"uv_run returned $r")
  }

  def onConnect(server:PipeHandle, status:Int): Unit = {
    println("connection received!")
    val client:PipeHandle = stdlib.malloc(pipe_size)
    uv_pipe_init(loop, client, 0)
    var r = uv_accept(server, client)
    println(s"uv_accept returned $r")
    uv_read_start(client, onAllocCB, onReadCB)
  }
  val onConnectCB = CFunctionPtr.fromFunction2(onConnect)

LibUV

  def onRead(pipe:PipeHandle, size:CSSize, buffer:Ptr[Buffer]): Unit = {
    if (size >= 0) {
      var position = 0
      // We are going to store the positions of the CGI parameter and STDIN frames
      var params:(Int,RecordHeader) = (0,null)
      var stdin:(Int,RecordHeader) = (0,null)
      // Scan the input buffer for the positions of useful metadata
      while (position < size) {
        val header = readHeader(!buffer._1,position)
        reqId = header.reqId
        if (header.rec_type == FCGI_PARAMS & header.length > 0)
          params = (position,header)
        else if (header.rec_type == FCGI_STDIN & header.length > 0)
          stdin = (position, header)
        position += (8 + header.length + header.padding)
      }
      // Generate a response and enqueue it to the pipe (re-use the input buffer for output)
      val write_req:WriteReq = stdlib.malloc(write_req_size).cast[WriteReq]
      !write_req = !buffer._1
      !buffer._2 = makeResponse(reqId, params, stdin, !write_req)
      uv_write(write_req, pipe, buffer, 1, onWriteCB)
    } else {
      // or we have read 0 bytes and can close the connection
      uv_read_stop(pipe)
      val shutdownReq = stdlib.malloc(shutdown_req_size).cast[ShutdownReq]
      !shutdownReq = pipe
      uv_shutdown(shutdownReq, pipe, myShutdownCB)
      stdlib.free(!buffer._1)
    }
  }
  val onReadCB = CFunctionPtr.fromFunction3(onRead)

Performance

When deployed on a UNIX socket behind nginx:

mean response 4ms under light load
with 1000 users:
- mean response 140ms
- error rates about 1/2 of node
- no timeouts

🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

Reflections

What do our languages really need to provide?

Does serving up HTTP belong in our app or in infrastructure?

What can we expect from our OS?

What can we expect from our cluster?

Things are about to change.

Bootstrapping the Web with Scala Native

By Richard Whaling

Bootstrapping the Web with Scala Native

7 years ago
2,736

Bootstrapping the Web

with Scala Native

This talk is about:

but also:

Talk Outline

About Me

Scala Native is:

The Basics

Structs and Pointers

Interop

Included Libraries

What does a server do?

TCP Socket System Calls

Berkeley Socket Dance

Berkeley Socket Dance

Berkeley Socket Dance

What happens when a new connection comes in?

Introducing fork()

Introducing fork()

Introducing fork()

Downsides

How can we avoid writing our own socket code?

The Unix Philosophy

The Unix Philosophy

Introducing exec()

Introducing exec()

Introducing exec()

Almost there!

What we need:

Apache httpd

CGI

A Minimal CGI Handler

Building the app

Packaging the app

A CGI Micro-framework

A CGI Micro-framework

A CGI Micro-framework

A CGI Micro-framework

Performance

Performance

Performance

Multiplexed Protocols

FastCGI

FastCGI

FastCGI

FastCGI

Improvising a socket

Performance

Multiplexed I/O

Multiplexed I/O

LibUV

LibUV

LibUV

LibUV

Performance

🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

Reflections

Bootstrapping the Web with Scala Native

More from Richard Whaling