FPGA Coprocessor

And a dedicated data-processing language

Agenda

  • Motivation
  • Idea overview
  • Language syntax overview
  • System architecture

Motivation

GPGPU is a terrific technology but it has a very steep learning curve and lots of boilerplate code.

 

It's not really different with Xeon Phi or Paralella.

Motivation

Our goal is to create a piece of software/hardware combination that will facilitate the processing of vector-like data easily and efficiently.

 

We want to combine Ruby-like syntax with CUDA-like power.

Idea overview

The project comprises two parts:

  • software part: a simple-yet-expressive language designed with vector operations in mind and hence supporting only numerical operations.

 

  • hardware part: FPGA-based coprocessor capable of doing efficient vector operations like basic vector arithmetic, cumulative functions like min, max, average etc.

Idea overview

Model workflow:

  1. The programmer writes a program on a computer
  2. The program is compiled and passed alongside data to FPGA board.
  3. Coprocessor on FPGA executes the instructions on the data
  4. The data is sent back to the computer, where it is returned to the programmer in some output form

Idea overview

What do we gain?

 

A VERY specialized tool, which makes it fast and easy to use (we are sacrificing generality and gaining the above in return).

 

There are many excellent processors that can do EVERYTHING. We will be able to do just one thing but fast.

Syntax overview

A rather informal specification

Type system

The language is statically typed, which means that all of the type information is known at compile-time. This allows the compiler to report errors that would otherwise have to be tested against. Moreover, the information about the dimensionality is checked at compile-time as well. This is a very prominent feature of the language.

 

As of now, everything has to be explicitly typed, i.e. every declaration needs to be accompanied by a type signature and nothing is inferred (except for the dimensionality of non-scalar values) nor implicit.

Variables

Declaring a variable is very simple:

you write the "var" keyword, then the variable name, a type signature and an initial value.

var x : Int = 5
var v : IntVector[5] = [1, 2, 3, 4, 5]

Note that it is not allowed to declare a variable without giving it an initial value. This is a common source of errors and gives the programmer no additional expressive power so it was eliminated.

Data types

The language supports four basic datatypes:

  • a boolean type with values: true and false
  • a scalar type Int: 32-bit integer signed numbers
  • a vector type IntVector: an ordered sequence of Ints
  • an atom (that is, a value without a value, an identifier)

 

Additionally, there is a convenient wrapper for the matrix, which is internally stored as vector:

  • IntMatrix: a wrapper for IntVector

When creating a non-scalar value, you need to specify its size.

Data types

Creating scalar values:

var vec: IntVector[4] = [42, 24, 22, 44]

Creating vectors:

var x: Int = 42

Accessing vector elements (indexing is zero-based):

print vec[2]
# => 22

Creating matrices:

var mat: IntMatrix[2, 2] = [42, 24, 22, 44]

Accessing matrix elements:

print mat[1,1]
# => 24

Data types

Like in most scientific libraries, the matrix order is column-major. This entails the following:

1. Vectors are "vertical". This means that, given two vectors x and y, x * y' is the outer product and x' * y is the inner product.

2. Matrices are in essence a layer of logic above vectors. The index M[row, col] is calculated as M[col + row*N] where N is the number of columns.

This is roughly how software like BLAS or LAPACK handles this.

Atoms

Atoms are very light (like atoms in Erlang) and their sole purpose is serving as a unique identifier. This means, that by themselves they do not carry any additional information and are used only to distinguish things.

 

We recon that in a data-processing language strings are overly heavy means of labelling things and something much lighter is in order. A quick example:

var x: Atom = Hello
var y: Atom = World
var z: Atom = Hello

x == y  # => false
y == z  # => false
x == z  # => true

You can compare atoms and pass them around. They take up as much space as an int.

If Expression

The well-known "if", but it is an expression, not a statement. This means that the whole expression returns a value that you can, for instance, assign to a variable.

var y: Int = if x > 10 then
                 11
             else
                 x + 1
             end
    

The language is whitespace-insensitive, which means that we could rewrite the above code as follows:

var y: Int = if x > 10 then 11 else x + 1 end

For loop

This is the only loop available in the language, but it is really expressive. It allows you to iterate over vectors:

for (x: Int) in [1, 2, 3, 4, 5] do
    print (x + 5)
end

 And create new vectors (like list comprehensions).

var vec: IntVector = for (x: Int) in [1, 2, 3, 4] do x + 5   # returns [6, 7, 8, 9]

In essence, every for loop is a list comprehension. In the first example the result is simply discarded.

IO

This is a really tough question when the language runs on a separate device (the same problem that CUDA had a while ago with IO). In general, all IO is done on the host (that is, the computer), so if you want to read/write to console or file, it all happens on the host.

 

However, if you want to print something during the execution of the program, it is buffered and sent back to the computer, which handles IO after the program has finished on the device.

IO

IO is quite unusual, because it can only handle numerical data and thus you can only read/write structured data like CSV.

var data: IntVector = readVector(Console)             # read a vector from Console
var mat: IntMatrix = readMatrix(File("example.csv")) # read a matrix from File

var data2: IntVector = for elem in data do elem + 5 end
writeVector(Console)

var mat2: IntMatrix = for x in mat do x * 2 end
writeMatrix(File("example-processed.csv"))

The parsing and writing to CSV is done automatically.

Functions

The syntax in functions is similar to the one known from Ruby. The keyword for defining a function is "def", and the definition ends with "end". You can either separate statements with newlines or with semicolons.

def square(x: Int): Int do
    x * x
end

def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do
    var ax = a * x  # you can declare variables inside functions
    y + ax
end

Note how the type IntVector is parametrized with its length. This allows us to statically check for errors such as wrong dimensionality of vectors and matrices.

Also, you can use s in the function body as any other Int.

Dimensionality

As mentioned before, when creating a vector/matrix, you need to specify its dimensionality. However, with function parameters the issue is more complex. You specify them alongside the types, so that you can enforce, for instance, the same lengths for two vectors and that is checked at compile-time.

# we want x and y to be of the same length
def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do
    var ax = a * x  # you can declare variables inside functions
    y + ax
end

Dimensionality

What about the return type? It is inferred by the compiler like so:

def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do  # the inferred length is s
    var ax = a * x  # you can declare variables inside functions
    y + ax
end

def outer(x: IntVector[dimx], y: IntVector[dimy]): IntVector do # inferred type is IntMatrix[dimx, dimy]
    x * transpose(y)
end

def inner(x: IntVector[dimx], y: IntVector[dimy]): IntMatrix[1, 1] do # this will be checked (and it will succeed)
    transpose(x) * y
end

You can also add the explicit dimensionality to the result. In that case, it will be checked by the compiler.

Architecture overview

A birds-eye view

Coprocessor

The goal is to make the system on FPGA targeted specifically at vector processing. Hence, a lot of operations need to be executed in parallel, in a SIMD sense (just like GPGPUs).

 

The system decodes each assembly instruction and the execution unit performs the operations (which are, for the most part, some operations on vectors, so they fit SIMD well).

Assembly

The language is compiled to an intermediate form, which is easy to translate into machine code and send to the device.

 

The assembly mnemonics comprise a range of basic instructions, such as store, load, move etc, but there are also some vector-specific operations like vector addition, vector times scalar multiplication, dot product and others.

 

Each assembly mnemonic corresponds to a certain opcode that can be executed by the system on FPGA board.

Assembly

Since we support both scalar and vector operations, we introduce a rather convenient convention when assigning opcodes to the mnemonics. All the scalar operations are even number. If you add one to the scalar operation, you get its vector counterpart (if one exists, and vice-versa).

 

For example:

LD     --> 1
LDVEC  --> 2
//...
ADD    --> 5
ADDVEC --> 6

Assembly

The instruction set is still subject to change, due to the iterative nature of the development proces but one can outline the basic instruction set that is bound to be supported in the final design:

LD     reg    addr
LDVEC  regvec addr   len
ST     addr   reg
STVEC  addr   regvec len
ADD
ADDVEC
SUB
SUBVEC
MULE
MULEVEC
DIV
DIVVEC
BRDC   val    len

Memory

Memory addressing scheme is as simple as possible, with linear, contiguous, direct indexing (ie. natural numbers).

 

There are three separate memory units:

  • code
  • scalar data
  • vector data

This allows us to use a very uniform addressing scheme and simplifies many operations.

Registers

There are two types of registers: scalar and vector. Currently the architecture contains two scalar, and two vector registers.

 

When an operation (like addition) is performed, it ALWAYS operates on the registers. If it is a vector addition, vector registers are used (RV0 and RV1) and the result is stored in RV0. For scalar operations it is R0 and R1, respectively.

 

Unary operations use only one register, R(V)0.

Registers

Copying values to and from registers is possible through the LD(VEC) and ST(VEC) instructions (LOAD and STORE, respectively). 

LD     reg    addr        // loads value into reg from addr
LDVEC  regvec addr   len  // loads len elements into regvec from addr

Please note that there is no MOVE instruction, which is a well justified design decision and it contributes to the simplicity of the processor. Move can be easily described as a composition of LD and ST and as such can be left out from the instruction set.

Communication

Every program is first compiled: the compiler checks for type errors (which in our case also means dimensionality errors) and if none are found, translates the source code to assembly and then to machine code.

 

Machine code and data are then passed to the coprocessor through a bus and the processing takes place.

 

When all is done, the data is returned to the host.

Compiler

Parsing

Parsing bases on the idea that everything is an expression, which simplifies the Abstract Syntax Tree and the processing in general.

 

The parsing is done by using the parser combinators concept and more specifically -- Haskell's Parsec library.

 

This allows for both monadic and applicative parsing. Parsing using combinators is much more intuitive and strictly corresponds to the language's grammar than the well-known parsers LR.

Parsing

The (simplified) AST looks like this:

type Module = [Expr]

data Expr = Lit Integer	
          | VecLit [Integer]
          | VarE Var
          | BinOp Op Expr Expr
          | If Expr [Expr] [Expr]
          | Assign Var Expr
          | Decl Var Type Expr
          deriving (Show, Eq, Ord)

type Var = String
data Type = Scalar | Vector Integer deriving (Show, Eq, Ord)


data Op = Add | Sub | Mul | Div deriving (Show, Eq, Ord)

Parsing

Parsing itself is done by 'matching' appropriate token groups to definitions. For example, this is how one could parse a conditional expression:

ifStmt :: Parser A.Expr
ifStmt =
  do L.reserved "if"
     cond  <- aExpression
     L.reserved "then"
     stmt1 <- statement
     L.reserved "else"
     stmt2 <- statement
     L.reserved "end"
     return $ A.If cond stmt1 stmt2

We match the parsed tokens to the definition, binding some to variables and discarding others.

Parsing

Firstly, one would use a lexer to extract tokens from the program text. This is essentailly "labelling" of atomic code pieces and is later used by the parser.

languageDef =
  emptyDef { Token.commentStart    = "{#"
           , Token.commentEnd      = "#}"
           , Token.commentLine     = "#"
           , Token.identStart      = letter
           , Token.identLetter     = alphaNum
           , Token.reservedNames   = [ "if"
                                     , "then"
                                     , "else"
                                     , "end"
                                     , "Int"
                                     , "IntVector"
                                     ]
           , Token.reservedOpNames = ["+", "-", "*", "/", "=", ":", "[", "]", ","]
           }

Parsing

Parsing itself is done by 'matching' appropriate token groups to definitions. For example, this is how one could parse a conditional expression:

ifStmt :: Parser A.Expr
ifStmt =
  do L.reserved "if"
     cond  <- aExpression
     L.reserved "then"
     stmt1 <- statement
     L.reserved "else"
     stmt2 <- statement
     L.reserved "end"
     return $ A.If cond stmt1 stmt2

We match the parsed tokens to the definition, binding some to variables and discarding others.

Code generation

Generating not-so-clumsy assembly code is a non-trivial task. One has to manage register allocation and memory access with great care and some ingenuity.

 

Most modern languages use some form of Intermediate Representation to simplify this task. We generate assembly in two phases:

 

1. abstract assembly

2. real assembly that can be executed by FPGA

Code generation

The first phase of assembly generation uses an infinite number of "abstract" registers. This means we don't (yet) have to worry about swapping registers and memory accesses.

 

Later on, the second phase optimizes the generated assembly so as not to exceed the number available on the processor.

 

This kind of separation means not only easier code generation but also increased modularity (you can easily change the number of registers).

FPGA Coprocessor

By ambrozyk

FPGA Coprocessor

  • 982