Starting from Scratch with Scala Native

Richard Whaling

M1 Finance

 

Scala Days Europe 2018

This talk is about:

  • Scala Native
  • Systems progamming

but also:

  • Working with emerging technology
  • Improvised solutions
  • OS as platform

(or how to get things done without the JVM)

Talk Outline

  1. Introduction to Scala Native
  2. Case Study: Google NGrams
    • Find Maximum
    • Sort
    • Aggregate and Sort
    • Performance vs JVM
  3. Reflections: When, Where, Why?

About Me

Twitter: @RichardWhaling

Scala Native contributor, but speaking only for myself

Author of "Modern Systems Programming in Scala", coming soon from Pragmatic

Software Engineer at M1 Finance

Scala Native is:

  1. Scala!
  2. A scalac/sbt plugin
  3. An LLVM-based AOT compiler
  4. Great for command-line tools
  5. No JVM
  6. Includes implementations of some JDK classes
  7. Types and operators for C interop

The Basics

object Hello {
    def main(args: Array[String]):Unit = {
        println("Hello, Scala Days!")
    }
}

This just works!

The Basics

import scalanative.native._, stdio._
object Hello {
    def main(args: Array[String]):Unit = {
        printf(c"Hello, Scala Days!\n")
    }
}

This just works!

The Basics

import scalanative.native._, stdio._
object Hello {
    def main(args: Array[String]):Unit = {
        val who:CString = c"Scala Days"
        stdio.printf(c"Hello, %s!\n", who)
    }
}

it really is the glibc printf()

What's a CString?

+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| Offset | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | A  | B  | C  | D  |
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| Char   | H  | e  | l  | l  | o  | ,  |    | w  | o  | r  | l  | d  | !  |    |
| Hex    | 48 | 65 | 6C | 6C | 6F | 2C | 20 | 77 | 6F | 72 | 6C | 64 | 21 | 00 |
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
  • Just like in C, strings are zero-terminated arrays of characters
  • The length of a CString is not stored as metadata
  • The value of a CString itself is the address of the first character
  • Spoiler alert: CString is just an alias for Ptr[CChar]
  • And CChar is just an alias for Byte
  • Like other Ptr types in Scala Native, CStrings are mutable.

What's a CString?

val str:CString = c"hello, world"
val str_len = strlen(str)
printf(c"the string '%s' at %p is %d bytes long\n", str, str, str_len)
printf(c"the CString value 'str' is %d bytes long\n", sizeof[CString])

for (offset <- 0L to str_len) {
  val chr:CChar = str(offset)
  printf(c"'%c' is %d bytes long and has binary value %d\n",
    chr, sizeof[CChar], chr)
}
  • We scan for the end of the string with strlen()
  • The length from strlen() does NOT include the null terminator
  • We can retrieve individual characters by str(offset)
  • Array lookup by offset is a constant time operation

What's a CString?

/project/path> ./target/scala-2.11/cstring_experiment_1-out
the string 'hello, world' at address 0x55e525a2c944 is 12 bytes long
the CString value 'str' itself is 8 bytes long
'h' is 1 bytes long and has binary value 104
'e' is 1 bytes long and has binary value 101
'l' is 1 bytes long and has binary value 108
'l' is 1 bytes long and has binary value 108
'o' is 1 bytes long and has binary value 111
',' is 1 bytes long and has binary value 44
' ' is 1 bytes long and has binary value 32
'w' is 1 bytes long and has binary value 119
'o' is 1 bytes long and has binary value 111
'r' is 1 bytes long and has binary value 114
'l' is 1 bytes long and has binary value 108
'd' is 1 bytes long and has binary value 100
'' is 1 bytes long and has binary value 0

As in C, pointers (addresses) are first-class values.

The value of a CString is its address, not its content

What's a CString?

val str = c"hello, world"
val str_len = strlen(str)
printf(c"the string '%s' at %p is %d bytes long\n", str, str, str_len)
printf(c"the value 'str' itself is %d bytes long\n", sizeof[CString])

for (offset <- 0L to str_len) {
  val chr_addr = str + offset // pointer address arithmetic
  val chr = !chr_addr         // pointer address dereference
  stdio.printf(c"'%c'\t(%d) at address %p is %d bytes long\n",
    chr, chr, chr_addr, sizeof[CChar])
}
  • addition (+)
  • dereference (!)

A CString is a Ptr[Byte], so we can re-implement array lookup with two basic pointer operators:

What's a CString?

the string 'hello, world' at address 0x5653b7aa0974 is 12 bytes long
the Ptr[Byte] value 'str' itself is 8 bytes long
'h'	(104) at address 0x5653b7aa0974 is 1 bytes long
'e'	(101) at address 0x5653b7aa0975 is 1 bytes long
'l'	(108) at address 0x5653b7aa0976 is 1 bytes long
'l'	(108) at address 0x5653b7aa0977 is 1 bytes long
'o'	(111) at address 0x5653b7aa0978 is 1 bytes long
','	(44) at address 0x5653b7aa0979 is 1 bytes long
' '	(32) at address 0x5653b7aa097a is 1 bytes long
'w'	(119) at address 0x5653b7aa097b is 1 bytes long
'o'	(111) at address 0x5653b7aa097c is 1 bytes long
'r'	(114) at address 0x5653b7aa097d is 1 bytes long
'l'	(108) at address 0x5653b7aa097e is 1 bytes long
'd'	(100) at address 0x5653b7aa097f is 1 bytes long
''	(0) at address 0x5653b7aa0980 is 1 bytes long

What's a Pointer?

+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| Offset | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | A  | B  | C  | D  |
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| Char   | H  | e  | l  | l  | o  | ,  |    | w  | o  | r  | l  | d  | !  |    |
| Hex    | 48 | 65 | 6C | 6C | 6F | 2C | 20 | 77 | 6F | 72 | 6C | 64 | 21 | 00 |
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
  • A pointer is the numeric address of a byte of memory
  • Pointers are generally the same size as the machine word: 64 bits
  • Imagine the address space as an indexed sequence of bytes
  • Pointers act like bare integers, like mutable cells, and like arrays
  • Pointer arithmetic is typed: ptr_T + 1 increments by sizeof[T]

Stack Allocation

  • Pointers can refer to stack or heap addresses (or Zones)
  • Stack pointers are valid for the duration of the calling function
  • We'll talk about heap pointer later
  • Zones are awesome but not covered in this talk
val short_lived_int:Ptr[Int] = stackalloc[Int]
val three_short_lived_ints:Ptr[Int] = stackalloc[Int](3)
val uninitialized_string_buffer:CString = stackalloc[CChar](16)

C FFI

@extern object mystdio {
  def fgetc(stream: Ptr[FILE]): CInt = extern
  def fgets(str: CString, count: CInt, stream: Ptr[FILE]): CString = extern
  def fputc(ch: CInt, stream: Ptr[FILE]): CInt = extern

  @name("scalanative_libc_stdin")
  def stdin: Ptr[FILE] = extern
  @name("scalanative_libc_stdout")
  def stdout: Ptr[FILE] = extern
}

val buffer = stackalloc[Byte](1024)
val line = mystdio.fgets(buffer, 1023, mystdio.stdin)
  • Scala Native's C-style types make C interop easy
  • Simple to create a binding for standard or third-party C functions
  • Without the JVM, we rely on ANSI/POSIX C functions
  • Scala can often improve on C in terms of safety and clarity

A few C functions

def fprintf(stream: Ptr[FILE], format: CString, args: CVararg*): CInt
def fgets(str: CString, count: CInt, stream: Ptr[FILE]): CString
def sscanf(buffer: CString, format: CString, args: CVararg*): CInt

def strcmp(lhs: CString, rhs: CString): CInt
def strncpy(dest: CString, src: CString, count: CSize): CString

Let's be real: the C stdlib's string facilities are badly broken.

We'll be ensuring safety in three ways:

  1. Relying on a few less-broken functions like fgets()
  2. Statically limiting the maximum size of our strings
  3. Wrapping a few particularly problematic functions

Fixing strncpy()

def strncpy(dest: CString, src: CString, count: CSize): CString
def safer_strncpy(src:Ptr[Byte], dest:Ptr[Byte], dest_size:Ptr[Byte]):Int = {
  val src_size = strlen(src)
  strncpy(dest, src, src_size)
  if (src_size >= dest_size) {
    dest(dest_size - 1) = 0
    return dest_size - 1
  } else {
    return src_size
  }
}
  • strncpy() is a little better than strcpy(), but not enough
  • It won't overrun the bounds of dest
  • If dest is <= strlen(src), it will NOT zero-terminate dest
  • If dest is not zero-terminated, later operations on dest will break
  • So let's fix that.

Google NGrams

  • Tab-delimited text files
  • Each line has word - year - count - doc_count
  • Separated by 1st letter
  • "A" ~2GB

Word counts for the entire Google Books corpus, ~50GB total

This is big enough data to ask some interesting questions:

  • What's the most frequent word?
  • What are the top 20 words?
  • What are the top 20 words aggregated
    over all years?

Google NGrams

A'Aang_NOUN	1879	45	5
A'Aang_NOUN	1882	5	4
A'Aang_NOUN	1885	1	1
A'Aang_NOUN	1891	1	1
A'Aang_NOUN	1899	20	4
A'Aang_NOUN	1927	3	1
A'Aang_NOUN	1959	5	2
A'Aang_NOUN	1962	2	2
A'Aang_NOUN	1963	1	1
A'Aang_NOUN	1966	45	13
A'Aang_NOUN	1967	6	4
A'Aang_NOUN	1968	5	4
A'Aang_NOUN	1970	6	2
A'Aang_NOUN	1975	4	1
A'Aang_NOUN	2001	1	1
A'Aang_NOUN	2004	3	1
A'que_ADJ	1808	1	1
A'que_ADJ	1849	2	1
A'que_ADJ	1850	1	1
A'que_ADJ	1852	4	3

Maximum Count (JVM)

  var max = 0
  var max_word = ""
  var max_year = 0

  for (line <- scala.io.Source.stdin.getLines) {
    val split_fields = line.split("\\s+")

    val word = split_fields(0)
    val year = split_fields(1)
    val count = split_fields(2).toInt

    if (count > max) {
      max = count
      max_word = word
      max_year = year
    }
  }
  println(s"max count: ${max_word}, ${max_year}; ${max} occurrences")

Maximum Count (Native)

val linebuffer = stackalloc[Byte](1024)
val max_count = stackalloc[Int]
val max_word = stackalloc[Byte](1024)
val max_year = stackalloc[Int]

while (fgets(stdin, linebuffer, 1023) != null) }
    scan_and_compare(linebuffer, 1023, max_count, max_word, max_year)
}

printf(c"maximum word count: %d %s %d\n", max_count, max_word, max_year)

Our strategy:

  • Allocate storage for the current max and a single line of input
  • While we can read lines from stdin into the line_buffer:
  • pass the line_buffer and pointers to the current max into a scan_and_compare() function

Find Maximum Count

def scan_and_compare(buffer:Ptr[Byte], max_count:Ptr[Int],        
                     max_word:Ptr[Byte], max_year:Ptr[Int]):Unit = {
    val tmp_count = stackalloc[Int]
    val tmp_word = stackalloc[Byte](1024)
    val tmp_year = stackalloc[Int]
    val tmp_doc_count = stackalloc[Int]
    val scan_result = sscanf(buffer, c"%1023s %d %d %d\n", 
           tmp_word,tmp_count, tmp_year, tmp_doc_count)
    if (scan_result != 4) { 
        throw new Exception("Bad sscanf result")
    }
    if (!tmp_count > !max_count) {
        val word_length = strlen(temp_word)
        safer_strncpy(temp_word, max_word, 1024)
        !max_count = !temp_count
        !max_year = !temp_year        
    }
}
  • stack allocation is basically free
  • checking scanf results is a PAIN
  • the dereference operator ! lets us treat pointers as mutable cells

Performance

  • The performance differential starts small but diverges rapidly
  • This program should not generate a large heap -
  • The JVM implementation allocates heavily, however
  • Whereas our native implementation only allocates on the stack

Sorting NGrams

  • Tab-delimited text files
  • Each line has word - year - count - doc_count
  • Separated by 1st letter
  • "A" ~2GB

Word counts for the entire Google Books corpus, ~50GB total

For our next trick:

  • Sort the entire "A" file
  • This will use heap much more intensively
  • Little garbage - most memory allocated stays around
  • (This is the inverse of the last program we wrote)

Sorting: JVM

case class NGram(word:String, count:Int, year:Int, doc_count:Int)

def read_input(input:Source):ArrayBuffer[NGram] = {
    val data = ArrayBuffer[NGram]()
    var lines_read = 0
    for (line <- scala.io.Source.stdin.getLines) {
      val split_fields = line.split("\\s+")

      val word = split_fields(0)
      val year = split_fields(1).toInt
      val count = split_fields(2).toInt
      val doc_count = split_fields(3).toInt
      val new_item = NGram(word, year, count, doc_count)
      data += new_item
    }
    return data
}

Sorting: JVM

def main(args:Array[String]):Unit = {
    val data:ArrayBuffer[NGram] = read_input(scala.io.Source.stdin)

    val by_count_ascending = Ordering.by { n:NGram => n.count }.reverse
    val sorted = data.sorted(by_count_ascending)

    val show_count = if (lines_read < 20) lines_read else 20
    for (i <- 0 until show_count) {
      println(s"${sorted(i).word} ${sorted(i).count}")
    }
}

Sorting: Native

How do we do this in a Native idiom?

  • Model NGramData as a Struct(?)
  • Read input into an array of NGramData
  • Resize array as necessary
  • Sort array when done.

This will require some new techniques and new syscalls

  • structs
  • malloc/realloc
  • qsort

What's a CStruct?

type StructPoint = CStruct2[Int, Int]
val point = stackalloc[StructPoint]

point._1 = 5
point._2 = 12

+--------+----+----+----+----+----+----+----+----+
| Offset | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
+--------+----+----+----+----+----+----+----+----+
| Value  | 5                 | 12                | 
+--------+----+----+----+----+----+----+----+----+
| Hex    | 05 | 00 | 00 | 00 | 0C | 00 | 00 | 00 |
+--------+----+----+----+----+----+----+----+----+
  • A struct is a composite data type in contiguous memory
  • The fields of a struct have a static layout at compile time
  • Arrays of structs are extraordinarily efficient
  • Structs are allocated on the stack or heap explicitly
  • Structs are (currently) treated like tuples
  • Scala-Native 0.4 will add named fields

Modeling an NGram

case class NGram(word:String, count:Int, year:Int, doc_count:Int)

type NGramData = CStruct4[CString, Int, Int, Int]
  • Basically, an NGram is a Tuple4
  • Strings in structs are tricky
  • The Struct contains the address of the string, not the content
  • We will need to separately allocate space for string content
  • These strings will need to outlive our line-input function
  • So we need something longer-lived than stackalloc

Pointers: Stack and Heap

val short_lived_int:Ptr[Int] = stackalloc[Int]
val three_short_lived_ints:Ptr[Int] = stackalloc[Int](3)
val uninitialized_string_buffer:CString = stackalloc[CChar](16)

val uninitialized_buffer:Ptr[Byte] = malloc(1024)
val three_ints:Ptr[Int] = malloc(3 * sizeof[Int]).cast[Ptr[Int]]

val six_ints:Ptr[Int] = realloc(three_ints.cast[Ptr[Byte], 6 * sizeof[Int])
                               .cast[Ptr[Int]]

  • malloc gives us a Ptr[Byte] of the requested size
  • in Scala Native we cast the result of malloc
  • realloc resizes a malloc-managed pointer
  • realloc may invalidate old pointers - use sparingly!
  • no GC - malloc leaks unless you call free(ptr)

Sorting: Native

final case class WrappedArray[T](var data:Ptr[T], var used:Int, var capacity:Int)

def makeWrappedArray[T](size:Int):WrappedArray[T] = {
  val data = malloc(size * sizeof[T]).cast[Ptr[T]]
  return WrappedArray(data, 0, size)
}

def growWrappedArray[T](array:WrappedArray[T], size:Int):Unit = {
  val new_capacity = array.capacity + size
  val new_size = new_capacity * sizeof[T]
  val new_data = realloc(array.data.cast[Ptr[Byte]], new_size)
  wa.data = new_data 
  wa.capacity = new_capacity
}
  • realloc() is potentially expensive
  • in the worst case it can require copying the entire array to a new location
  • we can manage this cost by adjusting the "chunk size" that we allocate and grow by

Sorting: Native

def qsort(data:Ptr[Byte],
          num:Int,
          size:Long,
          comparator:CFunctionPtr2[Ptr[Byte], Ptr[Byte], Int]):Unit = extern

def sort_alphabetically(a:Ptr[Byte], b:Ptr[Byte]):Int = {
  val a_string_pointer = a.cast[Ptr[CString]]
  val b_string_pointer = b.cast[Ptr[CString]]
  return string.strcmp(!a_string_pointer, !b_string_pointer)
}

def sort_by_count(p1:Ptr[Byte], p2:Ptr[Byte]):Int = {
  val ngram_ptr_1 = p1.cast[Ptr[NGramData]]
  val ngram_ptr_2 = p2.cast[Ptr[NGramData]]
  val count_1 = !ngram_ptr_1._2
  val count_2 = !ngram_ptr_2._2
  return count_2 - count_1
}
  • Once we have an array of structs, we can sort it with qsort()
  • qsort takes a function pointer for its comparator
  • Allows for "generic" objects via unsafe casting!

Sorting: Native

  val block_size = 65536 * 16 // ~ 1 million items - too big?
  val linebuffer = stackalloc[Byte](1024)
  var array = makeWrappedArray[NGramData](block_size)

  while (stdio.fgets(line_buffer, 1023, stdin) != null) {
    if (array.used == array.capacity) {
        growWrappedArray(array, block_size)
    }
    parseLine(line_buffer, array.data + array.used)
    array.used += 1
  }

  qsort.qsort(array.data.cast[Ptr[Byte]], array.used, 
              sizeof[NGramData], by_count)

  val to_show = if (array.used <= 20) array.used else 20

  for (i <- 0 until to_show) {
    stdio.printf(c"word n: %s %d\n", !(array.data + i)._1, !(array.data + i)._2)    
  }

Sorting: Native

def parseLine(line_buffer:Ptr[Byte], data:Ptr[NGramData]):Unit = {
  val word = data._1
  val count = data._2
  val year = data._3
  val doc_count = data._4

  val sscanf_result = stdio.sscanf(line_buffer, c"%ms %d %d %d\n", word, year, count, doc_count)
  if (sscanf_result < 4) {
    throw new Exception("input error")
  }
}
  • Because we pass in a valid Ptr[NGramData] we don't have to use stackalloc
  • We do have to allocate space for the string but here I'm letting sscanf do it for me with the "%ms" format 

Performance

  • Equivalent performance at start.
  • JVM has faster sort, Native has faster IO
  • JVM maintains faster sort until 0.7GB
  • After 1.1GB JVM cannot complete on a 7.0GB Docker VM

Aggregating NGrams

  • Tab-delimited text files
  • Each line has word - year - count - doc_count
  • Separated by 1st letter
  • "A" ~2GB

Word counts for the entire Google Books corpus, ~50GB total

Final use case:

  • Group together occurrences of a word for all years
  • Then sort total counts
  • This should use less total heap because we store less
  • We can avoid storing the whole dataset because input is sorted by word

Aggregation: JVM

def read_input(input:Source):ArrayBuffer[NGram] = {
    val data = ArrayBuffer[NGram]()
    var prev_word = ""
    for (line <- scala.io.Source.stdin.getLines) {
      val split_fields = line.split("\\s+")
      // ... check for errors
      val word = split_fields(0)
      val year = split_fields(1).toInt
      val count = split_fields(2).toInt
      val doc_count = split_fields(3).toInt
      if (word == prev_word) {
        data.last.count += count
      } else {
        val new_item = NGram(word, year, count, doc_count)        
        data += new_item
        prev_word = word
      }
    }
    return data
}

Aggregation: Native

Our strategy:

  • Almost identical code to the previous exercise
  • Read input into an array of NGramData
  • Resize as necessary
  • Sort at the end

 

  • Check to see if the word we have is new or not
  • If not a new word, add its count to previous item
  • Return a true/false flag from parseLine
  • if parseLine returns true, then increment a.used

What we'll change:

Aggregation: Native

  var prev_item:Ptr[NGramData] = null
  while (stdio.fgets(line_buffer, 1023, stdin) != null) {
    if (array.used == array.capacity) {
        growWrappedArray(array, block_size)
    }
    val is_new_word = parseLine(line_buffer, array.data + array.used, prev_item)
    if (is_new_word) {
        prev_item = array.data + array.used
        array.used += 1
    }
  }
  • We track a pointer to the last item we read
  • We always read into array.data + array.used
  • We return a flag from parseLine to indicate a new word
  • We bump array.used and prev_item only if the word is new

Aggregation: Native

  def parseLine(line_buffer:CString, prev_item:Ptr[NGramData], 
                current_item:Ptr[NGramData]):Boolean = {
    val temp_word = stackalloc[Byte](1024)
    val temp_count = current_item._2
    val temp_year = current_item._3
    val temp_doc_count = current_item._4

    sscanf(line_buffer, c"%1023s %d %d %d\n", temp_word, temp_year, temp_count, temp_doc_count)
    val new_word_length = strlen(temp_word)

    if (prev_item == null) {
      val new_word_buffer = malloc(new_word_length + 1)
      safer_strncpy(temp_word, new_word_buffer, 1023)
      !current_item._1 = new_word_buffer
      return true
    }
    else if (strcmp(temp_word, !prev_item._1) == 0) {
      !current_item._2 = !current_item._2 + !temp_count
      return false
    } else {
      val new_word_buffer = malloc(new_word_length + 1)
      safer_strncpy(temp_word, new_word_buffer, 1023)
      current_item._1 = new_word_buffer
      return true
    }
  }

Performance

  • Native wins again, initial factor of 2x
  • Native consistently has faster sort by 2x
  • Sort remains under 1s for Native, diverges for JVM
  • Hypothesis: previous scenario exhibited JVM small-string magic?

Reflections

I hope that I've demonstrated that:

 

  • There are substantial benefits to using a C-style memory model within the context of a larger, garbage-collected programming environment
  • The legacy JVM memory model seems to perform unusually poorly for heavy IO/large heap situations

If you accept this, it raises the question:

where could it be appropriate to use Scala Native?

Reflections

I suggest that two of the highest profile applications of Scala fall into this big-heap/heavy-io domain:

 

  • Functional-Relational data pipelines (Spark)
  • Event-based, low latency services (Akka, Finagle, etc)

Incumbent Scala projects in these areas are outstanding software.  But there are hardware changes on the horizon.

Reflections

Caveats:

  • Performance isn't everything
  • Unsafe memory is unsafe
  • C programs are best below 10000 loc
  • Working on large shared C codebases is hard

However:

  • We are getting better at decomposing systems into parts
  • More and more meaningful problems can be solved by small, elegant programs
  • The community is getting very good at metaprogramming
  • This is persistent memory (NVRAM)
  • Large, heterogeneous heaps
  • RDMA - read and write over the network

This will break every assumption about the storage/memory/cache hierarchy made by legacy systems, including Java, including Rust

Hardware is Changing

Reflections

To the extent that a hard break with the past is necessary, Scala Native is not a step back, but a step forward.

Thank You!

Starting from Scratch in Scala Native

By Richard Whaling

Starting from Scratch in Scala Native

  • 824