Superoptimisation

 

by David Thomas — September 2016

This talk is about...

  • Using a superoptimiser program to discover optimal, useful, or just plain neat, assembly level code sequences

Fast or Small?

Compilers usually only give us a per-file choice between fast (-O2 / -O3) and small (-Os) code generation

But at different times we may need to express our code in ways which emphasise different properties:

  • as fast as possible
  • as small as possible
  • avoiding multiplication or division
  • avoiding any branching

So in these instances we may have to do some of the compiler's work ourselves

Han Hexample

Let's say we want to avoid division but we still want to be as fast as possible and our divisor is constant

We can hunt around and find something like Jim Blinn's fast divide by 255 formula:

 

 

 

But what if that's not exactly what we need?

Can we find others similar to it?

#define DIV_255(x) ((x + 1 + ((x + 1) >> 8)) >> 8)

Superoptimisers

Small sequences like these can be discovered with a superoptimiser

Superoptimisers are not as clever as the name might suggest: typically they perform an exhaustive search through a virtual instruction set

They generate tiny programs by constructing every possible permutation of instructions and then run these programs against trial values until they find one which works

Limitations

  • Sequences have to be quite short
    • Practical limit of 4 ops max
      • (unless you want to wait for days or weeks)
  • Multiple axis space to search:
    • 32-instruction set would be 32^4 ~= 1M 4-instruction programs to try out
  • Each program needs many trial constants pumped through it
  • ⇒ Long runtimes
  • Have to be careful to limit each axis
  • Can't branch or use the CPU flags

GNU superopt (1991)

  • "a function sequence generator that uses exhaustive generate-and-test approach to find the shortest instruction sequence for a given function."
  • "You must tell the superoptimizer which function and which CPU you want to get code for."
  • "It cannot generate very long sequences unless you have a very fast computer."

 

  • Targets a number of real CPUs
  • Gained AVR support a couple of years ago: https://github.com/embecosm/gnu-superopt

Aha! (2002)

  • "A Hacker's Assistant"

  • by Henry Warren

  • This is the one we'll look at

  • Targets a "generic RISC" instruction set

STOKE (2013+)

  • Stochastic optimiser

  • Random search

  • x86-64 only

  • Does full verification

  • http://stoke.stanford.edu/

Souper (2014+)

  • Superoptimiser for LLVM IR

  • Uses SMT solver

  • Can cache results using Redis

  • Non-official Google project

  • https://github.com/google/souper

Let's Look at Aha

Aha Input

/* artificial.frag.c */

#include "aha.h"

int userfun(int x)
{
  if (x == 0)      return 1;
  else if (x == 1) return 2;
  else             return 0;
}

Note:

No branches

No state

No side-effects

Build Aha

$ make EXAMPLE=artificial aha
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha.o aha.c
gcc -c -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o simulator.o simulator.c
gcc -O3 -Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -MMD -I. -DINC=\"artificial.frag.c\" -DOFILE=\"artificial.out\" -o aha aha.o simulator.o
        
$ ./aha 3
Searching for programs with 3 operations.

Found 0 solutions.
Counters = 372751, 382255, 952561, total = 1707567
Process time = 0.029 secs

$ ./aha 4
Searching for programs with 4 operations.

Found a 4-operation program:
   add   r1,rx,-2
   bic   r2,r1,rx
   shr   r3,r2,31
   shl   r4,r3,rx
   Expr: ((((x + -2) & ~x) >>u 31) << x)
(... omitted ...)

Found 4 solutions.

Trial Values

  • Trial values here are values we test our synthesised routine against
  • You can add more but it will slow Aha down
  • Being able to control the trial values is useful:
    • Say you have a routine which for which x is only ever 0..255
    • Solutions may exist for that range which won't work in the fully general case
#define TRIAL {1, 0, -1, \
               MAXNEG, MAXPOS, MAXNEG + 1, MAXPOS - 1, \
               0x01234567, 0x89ABCDEF, -2, 2, -3, 3, -64, 64, -5, -31415, \
               0x0000FFFF, 0xFFFF0000, \
               0x000000FF, 0x0000FF00, 0x00FF0000, 0xFF000000, \
               0x0000000F, 0x000000F0, 0x00000F00, 0x0000F000, \
               0x000F0000, 0x00F00000, 0x0F000000, 0xF0000000}

Customisable Instruction Set

  • Aha has its own mini "generic RISC" instruction set defined in machine.h
  • You can add or remove instructions to enhance or limit output
  • e.g. to make it more like 32-bit ARM I added
    • BIC (bitwise clear / AND NOT) and
    • RSB (reverse subtract)
  • And disabled division as ARMs don't usually have hardware division

Verification

  • We saw in the "Trial Values" slide that Aha does an exhaustive search but not an exhaustive test
  • It may produce solutions which work only for its trial values
  • So how to we make sure the routines really work?
    • We could write a test shell
    • Exercise the code for 0..UINT_MAX
  • But that's not going to work for 64-bit runs is it?
    • No, for that we would have to use an automated theorem solver...

Verification using Z3

  • Z3 is an automated theorem prover
  • We can use it to test the solutions which Aha! emits
from z3 import *
x = BitVec('x', 32)
y = BitVec('y', 32)
output = BitVec('output', 32)
s = Solver()
s.add(x^y==output)
s.add(((y & x)*0xFFFFFFFE) + (y + x)!=output)
print s.check()

Links

Made with Slides.com