FPGA Coprocessor
And a dedicated data-processing language
Agenda
- Motivation
- Idea overview
- Language syntax overview
- System architecture
Motivation
GPGPU is a terrific technology but it has a very steep learning curve and lots of boilerplate code.
It's not really different with Xeon Phi or Paralella.

Motivation
Our goal is to create a piece of software/hardware combination that will facilitate the processing of vector-like data easily and efficiently.
We want to combine Ruby-like syntax with CUDA-like power.

Idea overview
The project comprises two parts:
- software part: a simple-yet-expressive language designed with vector operations in mind and hence supporting only numerical operations.
- hardware part: FPGA-based coprocessor capable of doing efficient vector operations like basic vector arithmetic, cumulative functions like min, max, average etc.
Idea overview
Model workflow:
- The programmer writes a program on a computer
- The program is compiled and passed alongside data to FPGA board.
- Coprocessor on FPGA executes the instructions on the data
- The data is sent back to the computer, where it is returned to the programmer in some output form
Idea overview
What do we gain?
A VERY specialized tool, which makes it fast and easy to use (we are sacrificing generality and gaining the above in return).
There are many excellent processors that can do EVERYTHING. We will be able to do just one thing but fast.
Syntax overview
A rather informal specification
Type system
The language is statically typed, which means that all of the type information is known at compile-time. This allows the compiler to report errors that would otherwise have to be tested against. Moreover, the information about the dimensionality is checked at compile-time as well. This is a very prominent feature of the language.
As of now, everything has to be explicitly typed, i.e. every declaration needs to be accompanied by a type signature and nothing is inferred (except for the dimensionality of non-scalar values) nor implicit.
Variables
Declaring a variable is very simple:
you write the "var" keyword, then the variable name, a type signature and an initial value.
var x : Int = 5
var v : IntVector[5] = [1, 2, 3, 4, 5]
Note that it is not allowed to declare a variable without giving it an initial value. This is a common source of errors and gives the programmer no additional expressive power so it was eliminated.
Data types
The language supports four basic datatypes:
- a boolean type with values: true and false
- a scalar type Int: 32-bit integer signed numbers
- a vector type IntVector: an ordered sequence of Ints
- an atom (that is, a value without a value, an identifier)
Additionally, there is a convenient wrapper for the matrix, which is internally stored as vector:
- IntMatrix: a wrapper for IntVector
When creating a non-scalar value, you need to specify its size.
Data types
Creating scalar values:
var vec: IntVector[4] = [42, 24, 22, 44]
Creating vectors:
var x: Int = 42
Accessing vector elements (indexing is zero-based):
print vec[2]
# => 22
Creating matrices:
var mat: IntMatrix[2, 2] = [42, 24, 22, 44]
Accessing matrix elements:
print mat[1,1]
# => 24
Data types
Like in most scientific libraries, the matrix order is column-major. This entails the following:
1. Vectors are "vertical". This means that, given two vectors x and y, x * y' is the outer product and x' * y is the inner product.
2. Matrices are in essence a layer of logic above vectors. The index M[row, col] is calculated as M[col + row*N] where N is the number of columns.
This is roughly how software like BLAS or LAPACK handles this.
Atoms
Atoms are very light (like atoms in Erlang) and their sole purpose is serving as a unique identifier. This means, that by themselves they do not carry any additional information and are used only to distinguish things.
We recon that in a data-processing language strings are overly heavy means of labelling things and something much lighter is in order. A quick example:
var x: Atom = Hello
var y: Atom = World
var z: Atom = Hello
x == y # => false
y == z # => false
x == z # => true
You can compare atoms and pass them around. They take up as much space as an int.
If Expression
The well-known "if", but it is an expression, not a statement. This means that the whole expression returns a value that you can, for instance, assign to a variable.
var y: Int = if x > 10 then
11
else
x + 1
end
The language is whitespace-insensitive, which means that we could rewrite the above code as follows:
var y: Int = if x > 10 then 11 else x + 1 end
For loop
This is the only loop available in the language, but it is really expressive. It allows you to iterate over vectors:
for (x: Int) in [1, 2, 3, 4, 5] do
print (x + 5)
end
And create new vectors (like list comprehensions).
var vec: IntVector = for (x: Int) in [1, 2, 3, 4] do x + 5 # returns [6, 7, 8, 9]
In essence, every for loop is a list comprehension. In the first example the result is simply discarded.
IO
This is a really tough question when the language runs on a separate device (the same problem that CUDA had a while ago with IO). In general, all IO is done on the host (that is, the computer), so if you want to read/write to console or file, it all happens on the host.
However, if you want to print something during the execution of the program, it is buffered and sent back to the computer, which handles IO after the program has finished on the device.
IO
IO is quite unusual, because it can only handle numerical data and thus you can only read/write structured data like CSV.
var data: IntVector = readVector(Console) # read a vector from Console
var mat: IntMatrix = readMatrix(File("example.csv")) # read a matrix from File
var data2: IntVector = for elem in data do elem + 5 end
writeVector(Console)
var mat2: IntMatrix = for x in mat do x * 2 end
writeMatrix(File("example-processed.csv"))
The parsing and writing to CSV is done automatically.
Functions
The syntax in functions is similar to the one known from Ruby. The keyword for defining a function is "def", and the definition ends with "end". You can either separate statements with newlines or with semicolons.
def square(x: Int): Int do
x * x
end
def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do
var ax = a * x # you can declare variables inside functions
y + ax
end
Note how the type IntVector is parametrized with its length. This allows us to statically check for errors such as wrong dimensionality of vectors and matrices.
Also, you can use s in the function body as any other Int.
Dimensionality
As mentioned before, when creating a vector/matrix, you need to specify its dimensionality. However, with function parameters the issue is more complex. You specify them alongside the types, so that you can enforce, for instance, the same lengths for two vectors and that is checked at compile-time.
# we want x and y to be of the same length
def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do
var ax = a * x # you can declare variables inside functions
y + ax
end
Dimensionality
What about the return type? It is inferred by the compiler like so:
def axpy(a: Int, x: IntVector[s], y: IntVector[s]): IntVector do # the inferred length is s
var ax = a * x # you can declare variables inside functions
y + ax
end
def outer(x: IntVector[dimx], y: IntVector[dimy]): IntVector do # inferred type is IntMatrix[dimx, dimy]
x * transpose(y)
end
def inner(x: IntVector[dimx], y: IntVector[dimy]): IntMatrix[1, 1] do # this will be checked (and it will succeed)
transpose(x) * y
end
You can also add the explicit dimensionality to the result. In that case, it will be checked by the compiler.
Architecture overview
A birds-eye view
Coprocessor
The goal is to make the system on FPGA targeted specifically at vector processing. Hence, a lot of operations need to be executed in parallel, in a SIMD sense (just like GPGPUs).
The system decodes each assembly instruction and the execution unit performs the operations (which are, for the most part, some operations on vectors, so they fit SIMD well).
Assembly
The language is compiled to an intermediate form, which is easy to translate into machine code and send to the device.
The assembly mnemonics comprise a range of basic instructions, such as store, load, move etc, but there are also some vector-specific operations like vector addition, vector times scalar multiplication, dot product and others.
Each assembly mnemonic corresponds to a certain opcode that can be executed by the system on FPGA board.
Assembly
Since we support both scalar and vector operations, we introduce a rather convenient convention when assigning opcodes to the mnemonics. All the scalar operations are even number. If you add one to the scalar operation, you get its vector counterpart (if one exists, and vice-versa).
For example:
LD --> 1
LDVEC --> 2
//...
ADD --> 5
ADDVEC --> 6
Assembly
The instruction set is still subject to change, due to the iterative nature of the development proces but one can outline the basic instruction set that is bound to be supported in the final design:
LD reg addr
LDVEC regvec addr len
ST addr reg
STVEC addr regvec len
ADD
ADDVEC
SUB
SUBVEC
MULE
MULEVEC
DIV
DIVVEC
BRDC val len
Memory
Memory addressing scheme is as simple as possible, with linear, contiguous, direct indexing (ie. natural numbers).
There are three separate memory units:
- code
- scalar data
- vector data
This allows us to use a very uniform addressing scheme and simplifies many operations.
Registers
There are two types of registers: scalar and vector. Currently the architecture contains two scalar, and two vector registers.
When an operation (like addition) is performed, it ALWAYS operates on the registers. If it is a vector addition, vector registers are used (RV0 and RV1) and the result is stored in RV0. For scalar operations it is R0 and R1, respectively.
Unary operations use only one register, R(V)0.
Registers
Copying values to and from registers is possible through the LD(VEC) and ST(VEC) instructions (LOAD and STORE, respectively).
LD reg addr // loads value into reg from addr
LDVEC regvec addr len // loads len elements into regvec from addr
Please note that there is no MOVE instruction, which is a well justified design decision and it contributes to the simplicity of the processor. Move can be easily described as a composition of LD and ST and as such can be left out from the instruction set.
Communication
Every program is first compiled: the compiler checks for type errors (which in our case also means dimensionality errors) and if none are found, translates the source code to assembly and then to machine code.
Machine code and data are then passed to the coprocessor through a bus and the processing takes place.
When all is done, the data is returned to the host.
Compiler
Parsing
Parsing bases on the idea that everything is an expression, which simplifies the Abstract Syntax Tree and the processing in general.
The parsing is done by using the parser combinators concept and more specifically -- Haskell's Parsec library.
This allows for both monadic and applicative parsing. Parsing using combinators is much more intuitive and strictly corresponds to the language's grammar than the well-known parsers LR.
Parsing
The (simplified) AST looks like this:
type Module = [Expr]
data Expr = Lit Integer
| VecLit [Integer]
| VarE Var
| BinOp Op Expr Expr
| If Expr [Expr] [Expr]
| Assign Var Expr
| Decl Var Type Expr
deriving (Show, Eq, Ord)
type Var = String
data Type = Scalar | Vector Integer deriving (Show, Eq, Ord)
data Op = Add | Sub | Mul | Div deriving (Show, Eq, Ord)
Parsing
Parsing itself is done by 'matching' appropriate token groups to definitions. For example, this is how one could parse a conditional expression:
ifStmt :: Parser A.Expr
ifStmt =
do L.reserved "if"
cond <- aExpression
L.reserved "then"
stmt1 <- statement
L.reserved "else"
stmt2 <- statement
L.reserved "end"
return $ A.If cond stmt1 stmt2
We match the parsed tokens to the definition, binding some to variables and discarding others.
Parsing
Firstly, one would use a lexer to extract tokens from the program text. This is essentailly "labelling" of atomic code pieces and is later used by the parser.
languageDef =
emptyDef { Token.commentStart = "{#"
, Token.commentEnd = "#}"
, Token.commentLine = "#"
, Token.identStart = letter
, Token.identLetter = alphaNum
, Token.reservedNames = [ "if"
, "then"
, "else"
, "end"
, "Int"
, "IntVector"
]
, Token.reservedOpNames = ["+", "-", "*", "/", "=", ":", "[", "]", ","]
}
Parsing
Parsing itself is done by 'matching' appropriate token groups to definitions. For example, this is how one could parse a conditional expression:
ifStmt :: Parser A.Expr
ifStmt =
do L.reserved "if"
cond <- aExpression
L.reserved "then"
stmt1 <- statement
L.reserved "else"
stmt2 <- statement
L.reserved "end"
return $ A.If cond stmt1 stmt2
We match the parsed tokens to the definition, binding some to variables and discarding others.
Code generation
Generating not-so-clumsy assembly code is a non-trivial task. One has to manage register allocation and memory access with great care and some ingenuity.
Most modern languages use some form of Intermediate Representation to simplify this task. We generate assembly in two phases:
1. abstract assembly
2. real assembly that can be executed by FPGA
Code generation
The first phase of assembly generation uses an infinite number of "abstract" registers. This means we don't (yet) have to worry about swapping registers and memory accesses.
Later on, the second phase optimizes the generated assembly so as not to exceed the number available on the processor.
This kind of separation means not only easier code generation but also increased modularity (you can easily change the number of registers).
FPGA Coprocessor
By ambrozyk
FPGA Coprocessor
- 1,048