Superoptimisation

by David Thomas — September 2016

This talk is about...

Using a superoptimiser program to discover optimal, useful, or just plain neat, assembly level code sequences

Fast or Small?

Compilers usually only give us a per-file choice between fast (-O2 / -O3) and small (-Os) code generation

But at different times we may need to express our code in ways which emphasise different properties:

as fast as possible
as small as possible
avoiding multiplication or division
avoiding any branching

So in these instances we may have to do some of the compiler's work ourselves

Han Hexample

Let's say we want to avoid division but we still want to be as fast as possible and our divisor is constant

We can hunt around and find something like Jim Blinn's fast divide by 255 formula:

But what if that's not exactly what we need?

Can we find others similar to it?

#define DIV_255(x) ((x + 1 + ((x + 1) >> 8)) >> 8)

Superoptimisers

Small sequences like these can be discovered with a superoptimiser

Superoptimisers are not as clever as the name might suggest: typically they perform an exhaustive search through a virtual instruction set

They generate tiny programs by constructing every possible permutation of instructions and then run these programs against trial values until they find one which works

Limitations

Sequences have to be quite short
- Practical limit of 4 ops max
  - (unless you want to wait for days or weeks)
Multiple axis space to search:
- 32-instruction set would be 32^4 ~= 1M 4-instruction programs to try out
Each program needs many trial constants pumped through it
⇒ Long runtimes
Have to be careful to limit each axis
Can't branch or use the CPU flags

GNU superopt (1991)

"a function sequence generator that uses exhaustive generate-and-test approach to find the shortest instruction sequence for a given function."
"You must tell the superoptimizer which function and which CPU you want to get code for."
"It cannot generate very long sequences unless you have a very fast computer."

Targets a number of real CPUs
Gained AVR support a couple of years ago: https://github.com/embecosm/gnu-superopt

Aha! (2002)

"A Hacker's Assistant"
by Henry Warren
This is the one we'll look at
Targets a "generic RISC" instruction set

STOKE (2013+)

Stochastic optimiser
Random search
x86-64 only
Does full verification
http://stoke.stanford.edu/

Souper (2014+)

Superoptimiser for LLVM IR
Uses SMT solver
Can cache results using Redis
Non-official Google project
https://github.com/google/souper

Let's Look at Aha

Aha Input

/* artificial.frag.c */

#include "aha.h"

int userfun(int x)
{
  if (x == 0)      return 1;
  else if (x == 1) return 2;
  else             return 0;
}

Note:

No branches

No state

No side-effects

Build Aha

$ make EXAMPLE=artificial aha
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha.o aha.c
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o simulator.o simulator.c
gcc -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha aha.o simulator.o

$ ./aha 3
Searching for programs with 3 operations.

Found 0 solutions.
Counters = 372751, 382255, 952561, total = 1707567
Process time = 0.029 secs

$ ./aha 4
Searching for programs with 4 operations.

Found a 4-operation program:
   add   r1,rx,-2
   bic   r2,r1,rx
   shr   r3,r2,31
   shl   r4,r3,rx
   Expr: ((((x + -2) & ~x) >>u 31) << x)
(... omitted ...)

Found 4 solutions.

Trial Values

Trial values here are values we test our synthesised routine against
You can add more but it will slow Aha down
Being able to control the trial values is useful:
- Say you have a routine which for which x is only ever 0..255
- Solutions may exist for that range which won't work in the fully general case

#define TRIAL {1, 0, -1, \
               MAXNEG, MAXPOS, MAXNEG + 1, MAXPOS - 1, \
               0x01234567, 0x89ABCDEF, -2, 2, -3, 3, -64, 64, -5, -31415, \
               0x0000FFFF, 0xFFFF0000, \
               0x000000FF, 0x0000FF00, 0x00FF0000, 0xFF000000, \
               0x0000000F, 0x000000F0, 0x00000F00, 0x0000F000, \
               0x000F0000, 0x00F00000, 0x0F000000, 0xF0000000}

Customisable Instruction Set

Aha has its own mini "generic RISC" instruction set defined in machine.h
You can add or remove instructions to enhance or limit output
e.g. to make it more like 32-bit ARM I added
- BIC (bitwise clear / AND NOT) and
- RSB (reverse subtract)
And disabled division as ARMs don't usually have hardware division

Verification

We saw in the "Trial Values" slide that Aha does an exhaustive search but not an exhaustive test
It may produce solutions which work only for its trial values
So how to we make sure the routines really work?
- We could write a test shell
- Exercise the code for 0..UINT_MAX
But that's not going to work for 64-bit runs is it?
- No, for that we would have to use an automated theorem solver...

Verification using Z3

Z3 is an automated theorem prover
We can use it to test the solutions which Aha! emits

from z3 import *
x = BitVec('x', 32)
y = BitVec('y', 32)
output = BitVec('output', 32)
s = Solver()
s.add(x^y==output)
s.add(((y & x)*0xFFFFFFFE) + (y + x)!=output)
print s.check()