Autotuning in Futhark
Art by Robert Schenck
Philip Munksgaard, Svend Lund Breddam, Troels Henriksen, Fabian Cristian Gieseke & Cosmin Oancea
* Subject to some assumptions
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
Two levels of parallelism
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
Ways to parallelize
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
Tall matrix?
\(m\) threads, sequential inner code
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
\(n\) threads,
sequential outer code
Wide matrix?
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
So which version to use?
Inner or outer?
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
We don't know at compile-time!
let mapscan [m][n] (xss: [m][n]i32) : [m][n]i32 =
map2 (\(row: [n]i32) (i: i32) ->
loop (row: [n]i32) for _ in 0..<64 do
let row' = map (+ i) row
in scan (+) 0 row'
)
xss (0..<m)
Instead, let's generate multiple versions, and choose at compile-time
Henriksen, Troels, et al. "Incremental flattening for nested data parallelism." Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 2019.
Incremental flattening!
Parallelize outer
Parallelize inner
true
false
Incremental flattening!
Parallelize outer
Parallelize inner
true
false
Incremental flattening!
true
false
Version 2
true
false
Version 3
Version 1
true
false
Version 2
true
false
Version 3
Version 1
But how do we determine set of \(t\) that gives us best performance, for all datasets?
\(p_1\) can be size-variant and \(p_2\) can be size-invariant on a certain dataset
Analysis is done per-threshold
Input:
\(p_1\) is height of matrix (\(m\))
So the size is invariant
What if there's a loop?
What if there's a loop?
loop:
Input:
Still invariant
For each dataset, a single code version is best
Therefore, to tune a program on a single dataset, run each code version once and pick thresholds such that the fastest version is executed
To tune a program on multiple datasets, tune individually and combine thresholds, somehow
How do we make sure each version is run exactly once?
Program and dataset is given as is, we only control thresholds
Bottom-up traversal of the tuning tree
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
Setting all thresholds to \(\infty\) forces the bottom-most version to run
This allows us to record the run-time of \(v_3\)
With a bit of compiler instrumentation, we can also get information about what \(p_1\) and \(p_2\) was when they were compared against the thresholds
This allows us to pick \(v_2\) next
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Setting \(t_2 = 100\) (or any value below that) will force execution of \(v_2\)
Now, we can set the threshold \(t_2\) optimally for the bottom-most branch
Any value for \(t_2\) larger than 100 will select \(v_3\). Any value lower than 100 will select \(v_2\)
Thus, the optimal choice is a range:
If \(v_2\) is preferable, \(0 \leq t_2 \leq 100\)
To continue tuning, collapse bottom nodes into one and repeat
To continue tuning, collapse bottom nodes into one and repeat
We already know the best run-time for \(v_2'\), so we can jump straight to running \(v_1\)
For each dataset, we have found an optimal range for each threshold
We need to combine the tuning results from each dataset
Example:
Dataset 1: \(0 \leq t_2 \leq 100\) is optimal
Dataset 2 : \(50 < t_2 \leq \infty\) is optimal
Intersecting those ranges, \(50 < t_2 \leq 100\) is optimal for both datasets!
But is there always a valid intersection?
Example:
Dataset 1: \(0 \leq t_2 \leq 100\)
Dataset 2: \(500 < t_2 \leq \infty\)
\(p_1\) represents parallelism of \(v_1\)
If \(v_1\) is faster than \(v_2\) for a given value of \(p_1\), it should also be faster for larger values
\(p_1\) could represent something else, but we assume that the same property holds
Besides, if no range intersection exists, no choice of \(t_2\) will choose the best code version for all datasets
This method allows us to optimally tune size-invariant programs using exactly \(n \times d\) runs
\(n\): number of code versions
\(d\): number of datasets
loop:
But the size can change
loop:
But the size can change
Input:
This program is size-variant!
loop:
When the program is size-invariant, there is always a single best version of the code for each dataset
Always prefer \(v_1\) for this dataset
loop:
When the program is size-variant, there is not always a single best version of the code for each dataset
When \(p_1\) is \(5\) or \(50\), prefer \(v_1\) otherwise prefer \(v_2\)
loop:
It's no longer enough to run each code version once for each dataset
Each dataset does not necessarily have a single best code version
Input:
loop:
Each dataset does not necessarily have a single best code version
Best for \(p_2 = 50, 100\)
Best for \(p_2 = 5\)
Result is still a range, and combining results is still the same, but how do we efficiently find the best range for each dataset?
loop:
Best for \(p_2 = 50, 100\)
Best for \(p_2 = 5\)
Only values in \(\{0, 5, 50, 100, \infty\}\) are relevant to test
There could be many values. A loop that iterates a million times?
Binary search!
loop:
Best for \(p_2 = 50, 100\)
Best for \(p_2 = 5\)
Binary search!
Measure
Measure
Measure
Gradient
Assuming such a gradient exists!
If a gradient exists, we find the optimal tuning range for a single dataset and threshold in \(O(\log p)\) runs
\(p\): number of distinct parameter values for the given threshold
Compare to previous tuning tool, based on OpenTuner
Only benchmarks which have different performance
Reliable even for programs with small number of thresholds
Our technique combines multi-versioned compilation with a one-time autotuning process producing one executable that selects the most efficient combination of code versions for any dataset.