Making Code Fast
Note
- This is going to be an overview of a variety of different techniques for making code faster
- The presentation will be C/C++ focused but similar libraries/tools exist in other languages
- There won't be interactive sections due to logistical issues of getting everyone setup with linux/libraries
Feel free to ask questions!
Some of the syntax will be ugly and confusing
Compiler Optimizations!
How to apply compiler optimizations?
- -O1
- Simple optimizations to reduce code size and execution time
- -O2
- Applies even more optimizations, avoids optimizations requiring time space tradeoffs
- -O3
- Turns on all flags that don't break standard compliance, might increase binary size
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
How does the compiler make code fast?
- Inlining small functions
- Loop unrolling
- Applying some operations at compile time
- Alignment
- Dead code elimination*
*Dealing with this when writing benchmarks is fun
A few other things
- -Ofast
- applies some non standards compliant optimizations like -ffast-math
- applies some non standards compliant optimizations like -ffast-math
- There are a variety of other niche optimization settings
- Enabling optimization can often reveal undefined behavior in your code
Compiler optimization demo
Profiling!
(with gprof)
What do profilers do?
- Show how long each function call takes
- This lets you see what calls are happening the most and taking the longest
- Often offer other features involving things like memory management
- All sorts of crazy things if you look at the documentation closely
What are some profilers?
- gprof
- perf
- vallgrind callgrind
- gperftools
Drawbacks?
- Profilers become trickier to use with multithread programs
- Function calls can be hard to read/interpret
- Compiler optimizations can sometimes make results strange
Demo!
Threads!
(OpenMP)
Quick thread overview
- What if our program could have multiple sections execute at the same time?
- Most modern CPUs have multiple cores
- Each core can run a thread
- If a program is parallel the 4x cores = 4x speed in theory
- Most programming languages have threading libraries
- pthread in c, Thread in java
- We'll cover these next semester probably
What is OpenMP
- An API for multiple programming languages for writing multi threaded code
- Used by writing compiler directives in C++
- Simplifies writing multi threaded code
General idea
- Sections of the code can be defined as parallel
- A common use case is to make a parallel for loop in one of these sections (shown in demo)
- Variables can be marked as private or shared between the different threads OMP makes
- Sections can also be marked as critical meaning only one thread will access that code at a time
Title Text
- Bullet One
- Bullet Two
- Bullet Three
Demo!
SIMD
SIMD
Single Instruction, Multiple Data
What is SIMD
- Short version: if your data is in the right format the CPU can do 8 operations at once on it
- The CPU has special registers and instructions that can operate on multiple data
- These instructions take longer than normal instructions but can do 4x, 8x, or 16x the work depending on the instruction
What is SIMD
- SSE, AVX, AND AVX2 are different SIMD versions
- SSE is the oldest, supports 128 bit registers
- AVX support 256 bit registers
- AVX2 is the newest, supports 512 bit registers
- The CPU has a certain amount of these special registers and they can be interacted with using Intel intrinsics
- https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
Sounds crazy, how do we use it?
- #include <immintrin.h>
- compile with -mavx
- Requires modern processor
- move chunks of data into avx variables
- call avx functions on these registers
Example
void add_arrays(float* a, float* b, float* res){
for(int i = 0; i < SIZE; i+= 8){
//Load into SIMD regs
__m256 mma = _mm256_load_ps(a+i);
__m256 mmb = _mm256_load_ps(b+i);
//Calculate the sum
__m256 mmres = _mm256_add_ps(mma, mmb);
//Store the result in res for the current 8 values
_mm256_store_ps(res+i, mmres);
}
}
Demo!
Cuda to be discussed in the future due to logistics issues
Why does this matter
High performance fields
- Game engines
- High frequency trading
- High performance simulations
(Also threads will be important no matter what)
Thats it!
Making Code Fast: Profiles, SIMD, threads, and CUDA!
By theasocialmatzah
Making Code Fast: Profiles, SIMD, threads, and CUDA!
- 259