Making Code Fast

Note

  • This is going to be an overview of a variety of different techniques for making code faster
  • The presentation will be C/C++ focused but similar libraries/tools exist in other languages
  • There won't be interactive sections due to logistical issues of getting everyone setup with linux/libraries

Feel free to ask questions!

Some of the syntax will be ugly and confusing

Compiler Optimizations!

How to apply compiler optimizations?

  • -O1
    • Simple optimizations to reduce code size and execution time
  • -O2
    • Applies even more optimizations, avoids optimizations requiring time space tradeoffs
  • -O3
    • Turns on all flags that don't break standard compliance, might increase binary size
  • https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

How does the compiler make code fast?

  • Inlining small functions
  • Loop unrolling
  • Applying some operations at compile time
  • Alignment
  • Dead code elimination*

 

*Dealing with this when writing benchmarks is fun

A few other things

  • -Ofast
    • applies some non standards compliant optimizations like -ffast-math
  • There are a variety of other niche optimization settings
  • Enabling optimization can often reveal undefined behavior in your code

Compiler optimization demo

Profiling!

(with gprof)

What do profilers do?

  • Show how long each function call takes
    • This lets you see what calls are happening the most and taking the longest
  • Often offer other features involving things like memory management
  • All sorts of crazy things if you look at the documentation closely

What are some profilers?

  • gprof
  • perf
  • vallgrind callgrind
  • gperftools

Drawbacks?

  • Profilers become  trickier to use with multithread programs
  • Function calls can be hard to read/interpret
  • Compiler optimizations can sometimes make results strange

Demo!

Threads!

(OpenMP)

Quick thread overview

  • What if our program could have multiple sections execute at the same time?
    • Most modern CPUs have multiple cores
    • Each core can run a thread
    • If a program is parallel the 4x cores = 4x speed in theory
  • Most programming languages have threading libraries
    • pthread in c, Thread in java
    • We'll cover these next semester probably

What is OpenMP

  • An API for multiple programming languages for writing multi threaded code
  • Used by writing compiler directives in C++
  • Simplifies writing multi threaded code

General idea

  • Sections of the code can be defined as parallel
  • A common use case is to make a parallel for loop in one of these sections (shown in demo)
  • Variables can be marked as private or shared between the different threads OMP makes
  • Sections can also be marked as critical meaning only one thread will access that code at a time

Title Text

  • Bullet One
  • Bullet Two
  • Bullet Three

Demo!

SIMD

SIMD

Single Instruction, Multiple Data

What is SIMD

  • Short version: if your data is in the right format the CPU can do 8 operations at once on it
  • The CPU has special registers and instructions that can operate on multiple data
  • These instructions take longer than normal instructions but can do 4x, 8x, or 16x the work depending on the instruction

What is SIMD

  • SSE, AVX, AND AVX2 are different SIMD versions
    • SSE is the oldest, supports 128 bit registers
    • AVX support 256 bit registers
    • AVX2 is the newest, supports 512 bit registers
  • The CPU has a certain amount of these special registers and they can be interacted with using Intel intrinsics
  • https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

Sounds crazy, how do we use it?

  • #include <immintrin.h>
  • compile with -mavx
    • Requires modern processor
  • move chunks of data into avx variables
  • call avx functions on these registers

Example

void add_arrays(float* a, float* b, float* res){
  for(int i = 0; i < SIZE; i+= 8){
    //Load into SIMD regs
    __m256 mma = _mm256_load_ps(a+i);
    __m256 mmb = _mm256_load_ps(b+i);

    //Calculate the sum
    __m256 mmres = _mm256_add_ps(mma, mmb);

	//Store the result in res for the current 8 values
    _mm256_store_ps(res+i, mmres);
  }
}

Demo!

Cuda to be discussed in the future due to logistics issues

Why does this matter

High performance fields

  • Game engines
  • High frequency trading
  • High performance simulations

(Also threads will be important no matter what)

Thats it!

Making Code Fast: Profiles, SIMD, threads, and CUDA!

By theasocialmatzah

Making Code Fast: Profiles, SIMD, threads, and CUDA!

  • 259