by David Thomas — September 2016
This talk is about...
- Using a superoptimiser program to discover optimal, useful, or just plain neat, assembly level code sequences
Fast or Small?
Compilers usually only give us a per-file choice between fast (-O2 / -O3) and small (-Os) code generation
But at different times we may need to express our code in ways which emphasise different properties:
- as fast as possible
- as small as possible
- avoiding multiplication or division
- avoiding any branching
So in these instances we may have to do some of the compiler's work ourselves
Let's say we want to avoid division but we still want to be as fast as possible and our divisor is constant
We can hunt around and find something like Jim Blinn's fast divide by 255 formula:
But what if that's not exactly what we need?
Can we find others similar to it?
- Sequences have to be quite short
- Practical limit of 4 ops max
- (unless you want to wait for days or weeks)
- Practical limit of 4 ops max
- Multiple axis space to search:
- 32-instruction set would be 32^4 ~= 1M 4-instruction programs to try out
- Each program needs many trial constants pumped through it
- ⇒ Long runtimes
- Have to be careful to limit each axis
- Can't branch or use the CPU flags
GNU superopt (1991)
- "a function sequence generator that uses exhaustive generate-and-test approach to find the shortest instruction sequence for a given function."
- "You must tell the superoptimizer which function and which CPU you want to get code for."
- "It cannot generate very long sequences unless you have a very fast computer."
- Targets a number of real CPUs
- Gained AVR support a couple of years ago: https://github.com/embecosm/gnu-superopt
"A Hacker's Assistant"
by Henry Warren
This is the one we'll look at
Targets a "generic RISC" instruction set
Does full verification
Superoptimiser for LLVM IR
Uses SMT solver
Can cache results using Redis
Non-official Google project
Let's Look at Aha
- Trial values here are values we test our synthesised routine against
- You can add more but it will slow Aha down
- Being able to control the trial values is useful:
- Say you have a routine which for which x is only ever 0..255
- Solutions may exist for that range which won't work in the fully general case
Customisable Instruction Set
- Aha has its own mini "generic RISC" instruction set defined in machine.h
- You can add or remove instructions to enhance or limit output
- e.g. to make it more like 32-bit ARM I added
- BIC (bitwise clear / AND NOT) and
- RSB (reverse subtract)
- And disabled division as ARMs don't usually have hardware division
- We saw in the "Trial Values" slide that Aha does an exhaustive search but not an exhaustive test
- It may produce solutions which work only for its trial values
- So how to we make sure the routines really work?
- We could write a test shell
- Exercise the code for 0..UINT_MAX
- But that's not going to work for 64-bit runs is it?
- No, for that we would have to use an automated theorem solver...
Verification using Z3
- Z3 is an automated theorem prover
- We can use it to test the solutions which Aha! emits
- My article on Branchless Code Sequences:
- Quick introduction into SAT/SMT solvers and symbolic execution (Dennis Yurichev)
By David Thomas