Superoptimisation
by David Thomas — September 2016
This talk is about...
- Using a superoptimiser program to discover optimal, useful, or just plain neat, assembly level code sequences
Fast or Small?
Compilers usually only give us a per-file choice between fast (-O2 / -O3) and small (-Os) code generation
But at different times we may need to express our code in ways which emphasise different properties:
- as fast as possible
- as small as possible
- avoiding multiplication or division
- avoiding any branching
So in these instances we may have to do some of the compiler's work ourselves
Han Hexample
Let's say we want to avoid division but we still want to be as fast as possible and our divisor is constant
We can hunt around and find something like Jim Blinn's fast divide by 255 formula:
But what if that's not exactly what we need?
Can we find others similar to it?
#define DIV_255(x) ((x + 1 + ((x + 1) >> 8)) >> 8)
Superoptimisers
Small sequences like these can be discovered with a superoptimiser
Superoptimisers are not as clever as the name might suggest: typically they perform an exhaustive search through a virtual instruction set
They generate tiny programs by constructing every possible permutation of instructions and then run these programs against trial values until they find one which works
Limitations
- Sequences have to be quite short
- Practical limit of 4 ops max
- (unless you want to wait for days or weeks)
- Practical limit of 4 ops max
- Multiple axis space to search:
- 32-instruction set would be 32^4 ~= 1M 4-instruction programs to try out
- Each program needs many trial constants pumped through it
- ⇒ Long runtimes
- Have to be careful to limit each axis
- Can't branch or use the CPU flags
GNU superopt (1991)
- "a function sequence generator that uses exhaustive generate-and-test approach to find the shortest instruction sequence for a given function."
- "You must tell the superoptimizer which function and which CPU you want to get code for."
- "It cannot generate very long sequences unless you have a very fast computer."
- Targets a number of real CPUs
- Gained AVR support a couple of years ago: https://github.com/embecosm/gnu-superopt
Aha! (2002)
-
"A Hacker's Assistant"
-
by Henry Warren
-
This is the one we'll look at
-
Targets a "generic RISC" instruction set
STOKE (2013+)
-
Stochastic optimiser
-
Random search
-
x86-64 only
-
Does full verification
-
http://stoke.stanford.edu/
Souper (2014+)
-
Superoptimiser for LLVM IR
-
Uses SMT solver
-
Can cache results using Redis
-
Non-official Google project
-
https://github.com/google/souper
Let's Look at Aha
Aha Input
/* artificial.frag.c */
#include "aha.h"
int userfun(int x)
{
if (x == 0) return 1;
else if (x == 1) return 2;
else return 0;
}
Note:
No branches
No state
No side-effects
Build Aha
$ make EXAMPLE=artificial aha
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha.o aha.c
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o simulator.o simulator.c
gcc -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha aha.o simulator.o
$ ./aha 3
Searching for programs with 3 operations.
Found 0 solutions.
Counters = 372751, 382255, 952561, total = 1707567
Process time = 0.029 secs
$ ./aha 4
Searching for programs with 4 operations.
Found a 4-operation program:
add r1,rx,-2
bic r2,r1,rx
shr r3,r2,31
shl r4,r3,rx
Expr: ((((x + -2) & ~x) >>u 31) << x)
(... omitted ...)
Found 4 solutions.
Trial Values
- Trial values here are values we test our synthesised routine against
- You can add more but it will slow Aha down
- Being able to control the trial values is useful:
- Say you have a routine which for which x is only ever 0..255
- Solutions may exist for that range which won't work in the fully general case
#define TRIAL {1, 0, -1, \
MAXNEG, MAXPOS, MAXNEG + 1, MAXPOS - 1, \
0x01234567, 0x89ABCDEF, -2, 2, -3, 3, -64, 64, -5, -31415, \
0x0000FFFF, 0xFFFF0000, \
0x000000FF, 0x0000FF00, 0x00FF0000, 0xFF000000, \
0x0000000F, 0x000000F0, 0x00000F00, 0x0000F000, \
0x000F0000, 0x00F00000, 0x0F000000, 0xF0000000}
Customisable Instruction Set
- Aha has its own mini "generic RISC" instruction set defined in machine.h
- You can add or remove instructions to enhance or limit output
- e.g. to make it more like 32-bit ARM I added
- BIC (bitwise clear / AND NOT) and
- RSB (reverse subtract)
- And disabled division as ARMs don't usually have hardware division
Verification
- We saw in the "Trial Values" slide that Aha does an exhaustive search but not an exhaustive test
- It may produce solutions which work only for its trial values
- So how to we make sure the routines really work?
- We could write a test shell
- Exercise the code for 0..UINT_MAX
- But that's not going to work for 64-bit runs is it?
- No, for that we would have to use an automated theorem solver...
Verification using Z3
- Z3 is an automated theorem prover
- We can use it to test the solutions which Aha! emits
from z3 import *
x = BitVec('x', 32)
y = BitVec('y', 32)
output = BitVec('output', 32)
s = Solver()
s.add(x^y==output)
s.add(((y & x)*0xFFFFFFFE) + (y + x)!=output)
print s.check()
Links
- My article on Branchless Code Sequences:
- Quick introduction into SAT/SMT solvers and symbolic execution (Dennis Yurichev)
Superoptimisation
By David Thomas
Superoptimisation
A look at superoptimisers - in particular 'Aha'.
- 2,858