All images and fonts placed here to

preload them into the browser cache

Nexa Text Regular

Nexa Text Bold

Nexa Text Italic

Nexa Text Bold Italic

Nexa Regular

Nexa Bold

Nexa  Italic

Nexa Bold Italic

#include <random_code>
using in_cpp;

to preload() {
 // the monospace
 font as = well;
}

Your Performance Todo List

The most important optimisation opportunities and pitfalls to remember about

by

Jan Bielak

Your Performance Todo List

The most important optimisation opportunities and pitfalls to remember about

by

Jan Bielak

Jan Bielak

Warsaw Staszic High School, Poland

Self-taught C++ Developer

Realtime rendering

Game development

janbielak.com
github.com/janekb04
youtube.com/@janbielak

Practically Correct, Just-in-Time Shell Script Parallelization

Konstantinos Kallas, Tammam Mustafa, Jan Bielak, Dimitris Karnikis, Thurston H.Y. Dang, Michael Greenberg, Nikos Vasilakis. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI22)

Performance

Performance

Performance

Performance

1. No unnecessary work

2. Use all computing power

3. Avoid waits and stalls

4. Use hardware efficiently

No unnecessary copying

No unnecessary allocations

Use all cores

Use SIMD

Lockless data structures

Asynchronous APIs

Job Systems

Cache friendliness

Well predictable code

5. OS-level efficiency

Performance

1. No unnecessary work

2. Use all computing power

3. Avoid waits and stalls

4. Use hardware efficiently

5. OS-level efficiency

Performance

1. No unnecessary work

2. Use all computing power

3. Avoid waits and stalls

4. Use hardware efficiently

5. OS-level efficiency

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Build pipeline modification

1. Enable compiler optimisations

GCC, LLVM, ICC -O2 or -O3
MSVC /Ox or /O2

Optimize for speed

Optimize for size

GCC, LLVM, ICC -Os
MSVC /O1

Optimization #1

Longer compile time

Build pipeline modification

2. Set target architecture

GCC, LLVM, ICC -march=native -mtune=native
MSVC /arch:IA32
or /arch:SSE
or /arch:SSE2
or /arch:AVX
or /arch:AVX2

or /arch:AVX512

For x86

For ARM

GCC, LLVM -mcpu=native
MSVC /arch:ARMv7VE
or /arch:VFPv4
or /arch:armv8.0
...
or /arch:armv8.8

automatic detection of current processor's features

automatic detection of current processor's features

needs to be specified manually

needs to be specified manually

Build pipeline modification

3. Use fast math

GCC, LLVM -ffast-math (included in -Ofast)
MSVC /fp:fast
ICC -fp-model=fast

Faster computation

Less precise results

Non standard-compliant

Build pipeline modification

4. Disable exceptions and RTTI

GCC, LLVM, ICC -fno-exceptions
MSVC /EHs-c- /D_HAS_EXCEPTIONS=0
GCC, LLVM, ICC -fno-rtti
MSVC /GR-

No exceptions

No RTTI

Limited performance gains

Non standard-compliant

Breaks code using exceptions

Build pipeline modification

5. Enable Link Time Optimization

Compiler

Compiler

Compiler

Linker

}

?

?

?

Build pipeline modification

5. Enable Link Time Optimization

Compiler

Compiler

Compiler

Linker

}

GCC, LLVM -flto
MSVC /GL
ICC -ipo

Build pipeline modification

6. Use Unity Builds

Compiler

Linker

}

Compiler

Build pipeline modification

6. Use Unity Builds

Compiler

Linker

}

Compiler

Unity Build

Unity Build

CMake -DCMAKE_UNITY_BUILD=ON

Build pipeline modification

7. Link statically

Static Linking Dynamic Linking


 

Better optimisable

More space efficient

Can be updated independently of executable

Build pipeline modification

8. Use Profile Guided Optimisation

Build pipeline

Build pipeline modification

8. Use Profile Guided Optimisation

Build pipeline

Execute

Build pipeline modification

8. Use Profile Guided Optimisation

Build pipeline

Execute

Build pipeline

GCC, LLVM -fprofile-generate
MSVC /GENPROFILE
ICC -prof-gen
GCC, LLVM -fprofile-use
MSVC /USEPROFILE
ICC -prof-use

Build pipeline modification

9. Try different compilers

Build pipeline modification

10. Try different standard libraries

Build pipeline modification

11. Keep your tools updated

Build pipeline modification

12. Preload with a replacement lib

env LD_PRELOAD=/usr/lib/libSUPERmalloc.so ./myprogram
env DYLD_INSERT_LIBRARIES=/usr/lib/libSUPERmalloc.dylib ./myprogram

Requires DLL injection

Windows

macOS

Linux, BSD

Build pipeline modification

13. Use binary post processing tools

LLVM BOLT

perf record
perf2bolt

Build pipeline modification

13. Use binary post processing tools

LLVM BOLT

perf record
perf2bolt
llvm-bolt

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

  1. Enable compiler optimisations
  2. Set target architecture
  3. Use fast math
  4. Disable exceptions and RTTI
  5. Enable Link Time Optimisation
  6. Use Unity Builds
  7. Link statically
  8. Use Profile Guided Optimisation
  9. Try different compilers
  10. Try different standard library implementations
  11. Keep your tools updated
  12. Preload you program with a replacement library
  13. Use binary post processing tools

Annotate  your code

14. Constexpr all the things

Effective use of C++

Constant expressions:

Literals:

1, 3.0f, nullptr, "Hello"

Arithmetic:

2 + 3, 4.0 / 3.0

Sizes and alignments:

sizeof(int), alignof(std::vector<int>)

...

14. Constexpr all the things

Effective use of C++

constexpr int f(int x) { return 3 * x + 5; }

Constexpr functions:

invocation MAY be a constant expression

f(5)
int x;
std::cin >> x;
f(x);

is a constant expression

is NOT a constant expression

Is a given invocation evaluated at compile time?

if (std::is_constant_evaluated()) { ... }
if consteval { ... }
consteval int f(int x) { return 3 * x + 5; }

Immediate functions:

f(5)
int x;
std::cin >> x;
f(x);

is a constant expression

is a COMPILE ERRROR

invocation MUST be a constant expression

(inside function body)

if constexpr (compile_time_condition) {...}

If constexpr:

if constexpr (std::is_constant_evaluated()) {...}

ALWAYS TRUE

14. Constexpr all the things

Effective use of C++

constexpr std::array<int> primes{ 2, 3, 5, 7, 11 };

Constexpr variables:

variable must be initialised at its declaration

constexpr int x;
x = 3;

is a COMPILE ERRROR

implies

primes[0] = 1;

is a COMPILE ERRROR

const
constexpr int f(int x) { return x + 1; }

int main()
{
    int x1 = 3;
    constexpr int y1 = f(x1);
    
    constexpr int x2 = 3;
    constexpr int y2 = f(x2);
}

accessing it is a constant expression

is a COMPILE ERRROR

constinit std::array<int> primes{ 2, 3, 5, 7, 11 };

Constinit variables:

(by a constant a expression)

variable must be initialised at its declaration by a constant expression

14. Constexpr all the things

Effective use of C++

constexpr std::array<int> primes{ 2, 3, 5, 7, 11 };

Constexpr variables:

constinit std::array<int> primes{ 2, 3, 5, 7, 11 };

Constinit variables:

constexpr int f(int x) { return 3 * x + 5; }

Constexpr functions:

Is a given invocation evaluated at compile time?

if (std::is_constant_evaluated()) { ... }
if consteval { ... }
consteval int f(int x) { return 3 * x + 5; }

Immediate functions:

if constexpr (compile_time_condition) {...}

If constexpr:

15. Make variables const

Effective use of C++

std::vector<float>
get_mean_deltas(std::vector<float> data)
{
    float sum = 0;
    for (auto&& num : data) 
    	sum += num;
        
    for (auto& num : data)
        num -= sum / data.size();
    return data;
}

Declare variables const

15. Make variables const

Effective use of C++

std::vector<float>
get_mean_deltas(std::vector<float> data)
{
    const float sum = std::accumulate(
        data.begin(),
        data.end(),
        0.0f
    );
        
    for (auto& num : data)
        num -= sum / data.size();
    return data;
}
std::vector<float>
get_mean_deltas(std::vector<float> data)
{
    const float sum = std::accumulate(
        data.begin(),
        data.end(),
        0.0f
    );
        
    const float __mean = sum / data.size();
    for (auto& num : data)
        num -= __mean;
    return data;    
}

...so this expression is loop-invariant and can be hoisted

and no expensive division in loop!

sum is const...

~ compiler's thought process

(paraphrased)

Declare variables const

15. Make variables const

Effective use of C++

template <typename T>
class vector {
    T* begin;
    T* end;
    T* capacity;
    
    /* ... */
    
public:
    constexpr size_t size() const noexcept {
    	return end - begin;
    }
};
template <typename T>
class vector {
    T* begin;
    T* end;
    T* capacity;
    
    /* ... */
    
public:
    constexpr size_t size(this const vector& self) noexcept {
    	return self.end - self.begin;
    }
};

Declare member functions const

15. Make variables const

Effective use of C++

Copy globals to const locals

(if copying is cheap)

struct {
    /* ... */
    bool fill;
} _internal__state;

void set_draw_mode_filled();
void set_draw_mode_wireframe();

void draw_mesh(const mesh* m) {
    for (const primitive* prim = m->begin(); prim != m->end(); ++prim) {
        if(_internal__is_frontfacing(*prim)) {
            if (_internal__state.fill) {
                _internal__draw_prim_filled(*prim);
            }
            else {
                _internal__draw_prim_wireframe(*prim);
            }
        }
    }
}

15. Make variables const

Effective use of C++

Copy globals to const locals

(if copying is cheap)

void draw_mesh(const mesh* m) {
    for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
        if(_internal__is_frontfacing(*prim))
            if (_internal__state.fill) 
                _internal__draw_prim_filled(*prim);
            else
                _internal__draw_prim_wireframe(*prim);
}
void draw_mesh(const mesh* m) {
    if (_internal__state.fill) 
        for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
            if(_internal__is_frontfacing(*prim))
                _internal__draw_prim_filled(*prim);
    else            
        for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
            if(_internal__is_frontfacing(*prim))        
                _internal__draw_prim_wireframe(*prim);
}

could modify _internal__state.fill

could modify _internal__state.fill

15. Make variables const

Effective use of C++

Copy globals to const locals

(if copying is cheap)

void draw_mesh(const mesh* m) {
    const bool fill = _internal__state.fill;
    for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
        if(_internal__is_frontfacing(*prim))
            if (fill) 
                _internal__draw_prim_filled(*prim);
            else
                _internal__draw_prim_wireframe(*prim);
}
void draw_mesh(const mesh* m) {
    if (_internal__state.fill) 
        for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
            if(_internal__is_frontfacing(*prim))
                _internal__draw_prim_filled(*prim);
    else            
        for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
            if(_internal__is_frontfacing(*prim))        
                _internal__draw_prim_wireframe(*prim);
}

could modify _internal__state.fill

but we don't care

16. Noexcept all the things

Effective use of C++

void f();

COULD throw an exception

void f() noexcept;

WILL NEVER throw an exception

void f() noexcept(true);
void f() noexcept(false);
template <typename T>
void swap(T&& lhs, T&& rhs)
    noexcept(std::is_nothrow_move_constructible<T>
          && std::is_nothrow_move_assignable<T>)
{
    T tmp = std::move(lhs);
    lhs = std::move(rhs);
    rhs = std::move(tmp);
}

noexceptness

depends on T

16. Noexcept all the things

Effective use of C++

void f();

COULD throw an exception

void f() noexcept;

WILL NEVER throw an exception

void f() noexcept(true);
void f() noexcept(false);
template <typename T>
void swap(T&& lhs, T&& rhs)
    noexcept(std::is_nothrow_move_constructible<T>
          && std::is_nothrow_move_assignable<T>)
{
    T tmp = std::move(lhs);
    lhs = std::move(rhs);
    rhs = std::move(tmp);
}
template <typename T>
void swap(T&& lhs, T&& rhs)
    noexcept(noexcept(T(std::move(lhs)))
          && noexcept(lhs = std::move(rhs)))
{
    T tmp = std::move(lhs);
    lhs = std::move(rhs);
    rhs = std::move(tmp);
}

17. Use static for internal linkage

Effective use of C++

int counter() {
    static int counter = 0;
    return ++counter;
};
struct image {
    namespace fs = std::filesystem;
    static image from_file(fs::path path);
};

Static variables

Static member functions

17. Use static for internal linkage

Effective use of C++

static int global_value;
static void global_func();

Internal linkage variables

Internal linkage functions

a.cpp
b.cpp
 
// Forward declarations
extern int global_value;
void global_func();
extern int global_value2;
void global_func2();

//Use
void example() {
    global_value = 42;    
    global_func();
    global_value2 = 42;
    global_func2();
}

17. Use static for internal linkage

Effective use of C++

static int global_value;
static void global_func();

Internal linkage functions

a.cpp
b.cpp
 
int global_value2;
void global_func2();
// Forward declarations
extern int global_value;
void global_func();
extern int global_value2;
void global_func2();

//Use
void example() {
    global_value = 42;    
    global_func();
    global_value2 = 42;
    global_func2();
}

unresolved external symbol

unresolved external symbol

?

?

17. Use static for internal linkage

Effective use of C++

18. Use [[noreturn]]

Effective use of C++

[[noreturn]] void Log::Error(const String& msg) {
    logfile << msg << '\n';
    std::cerr << msg << '\n';
    throw Engine::RuntimeError(msg);
}

19. Use [[likely]] and [[unlikely]]

Effective use of C++

void internal_work();

19. Use [[likely]] and [[unlikely]]

Effective use of C++

bool require_init = true;
void init_lib();
void internal_work();

19. Use [[likely]] and [[unlikely]]

Effective use of C++

bool require_init = true;
void init_lib();
void internal_work();


void work()
{
    if(require_init) {
        init_lib();
        require_init = false;
    }

    internal_work();
}

19. Use [[likely]] and [[unlikely]]

Effective use of C++

bool require_init = true;
void init_lib();
void internal_work();


void work()
{
    if(require_init) {
        init_lib();
        require_init = false;
    }

    internal_work();
}

Effective use of C++

C++23 [[assume(condition)]];
GCC if (!condition) __builtin_unreachable();
MSVC, ICC __assume(condition);
LLVM __builtin_assume(condition);

20. Use [[assume(condition)]];

Effective use of C++

[[assume(condition)]]; assert(condition);
Condition must be true Condition must be true
For the optimiser For the programmer
If !condition then
    Undefined Behaviour
If !condition then
    std::abort() in Debug Mode
    noop in Release Mode

20. Use [[assume(condition)]];

Effective use of C++

void implementation(internal_t* obj) {
    if (obj) {
        internal_work(*obj);
    }
}

void interface(public_t* obj) {
    if (obj) {
        [[assume(obj->internal)]];
        implementation(obj->internal);
    }
}

Assume that pointer is non null*

*better use a reference

void limiter(float* samples, size_t count) {
    [[assume(samples % 32 == 0)]];
    [[assume(size > 0)]];
    
    for (int i = 0; i < count; ++i) {
        samples[i] = std::clamp(samples[i], -1.0, 1.0)
    }
}

Assume pointer alignment*

*or use std::assume_aligned

example taken from P1774 (the [[assume]] proposal)

20. Use [[assume(condition)]];

Effective use of C++

const char* get_name(TextureType type) {
    switch(e) {
        case TextureType::Texture2D:
            return "Texture2D";
        case TextureType::Texture3D:
            return "Texture3D";
        case TextureType::Texture2DArray:
            return "Texture2DArray";
        case TextureType::Cubemap:
            return "Cubemap";
        default:
            [[assume(false)]];
    }
} 

Declare a code path unreachable

*or use std::unreachable

20. Use [[assume(condition)]];

21. Use __restrict

Effective use of C++

float* __restrict buffer0;
float* __restrict buffer1;

21. Use __restrict

Effective use of C++

float* __restrict buffer0;
float* __restrict buffer1;

UB if overlap

21. Use __restrict

Effective use of C++

pointer provenance

21. Use __restrict

Effective use of C++

GCC, LLVM, ICC __attribute__((malloc))
MSVC __declspec(restrict)

22. Make functions pure

Effective use of C++

f

param0

param1

output

GCC, LLVM, ICC __attribute__((pure))
or [[gnu::pure]]
MSVC Not Supported

22. Make functions pure

Effective use of C++

f

param0

param1

output

GCC, LLVM, ICC __attribute__((pure))
or [[gnu::pure]]
MSVC Not Supported

f

param0

param1

output

GCC, LLVM, ICC __attribute__((const))
or [[gnu::const]]
MSVC Not Supported

global state

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

  1. Enable compiler optimisations
  2. Set target architecture
  3. Use fast math
  4. Disable exceptions and RTTI
  5. Enable Link Time Optimisation
  6. Use Unity Builds
  7. Link statically
  8. Use Profile Guided Optimisation
  9. Try different compilers
  10. Try different standard library implementations
  11. Keep your tools updated
  12. Preload you program with a replacement library
  13. Use binary post processing tools

14. Use constexpr​

15. Make variables const

16. Use noexcept

17. Use static for internal linkage

18. Use [[noreturn]]

19. Use [[likely]] and [[unlikely]]

20. Use [[assume]]

21. Mark pointers restrict

22. Mark functions as pure

Annotate  your code

No redundant copies

23. Take parameters properly

Effective use of C++

void func(??? x);

if x can be null

if needing ownership of x

if x is copied

take by value

if x is moved from

func(x);

call site:

declaration?

take by rvalue reference

(x is only read from)

take by unique_ptr, shared_ptr

take std::optional of x

if x is modified

take by lvalue reference

if x is cheap to copy

take by value

take by const lvalue reference

if x is a range

false
false
false
false
false
false
false
true
true
true
true
true
true
true

does x need to be a contiguous array

false
true

take std::span

can x be an arbitrary range

true

take std::ranges::***

false

does x need to be a specific container

true

take the container

false

take iterator pair

does x need to be perfectly forwarded

take by "universal reference"

true
false

type&& x

type&& x

type x

type& x

type x

const type& x

START HERE

23. Take parameters properly

Effective use of C++

void f(const std::string& s);
f("Hello");
f(std::string{"Hello"}.c_str());
void f(const char* s);

(safe - lifetime of temporary extended)

implicit conversion to string

(allocation)

verbose

(safe)

f(std::string{"Hello"});
f("Hello");
void f(std::string_view s);

works for both

(no copies)

(safe)

23. Take parameters properly

Effective use of C++

if x can be null

if needing ownership of x

if x is copied

take by value

if x is moved from

take by rvalue reference

(x is only read from)

take by unique_ptr, shared_ptr

take std::optional of x

if x is modified

take by lvalue reference

if x is cheap to copy

take by value

take by const lvalue reference

if x is a range

false
false
false
false
false
false
false
true
true
true
true
true
true
true

does x need to be a contiguous array

false
true

take std::span

can x be an arbitrary range

true

take std::ranges::***

false

does x need to be a specific container

true

take the container

false

take iterator pair

does x need to be perfectly forwarded

take by "universal reference"

true
false

type&& x

type&& x

type x

type& x

type x

const type& x

START HERE

23. Take parameters properly

Effective use of C++

if x can be null

if needing ownership of x

if x is copied

take by value

if x is moved from

take by rvalue reference

(x is only read from)

take by unique_ptr, shared_ptr

take std::optional of x

if x is modified

take by lvalue reference

if x is cheap to copy

take by value

take by const lvalue reference

if x is a range

false
false
false
false
false
false
false
true
true
true
true
true
true
true

does x need to be a contiguous array

false
true

take std::span

can x be an arbitrary range

true

take std::ranges::***

false

does x need to be a specific container

true

take the container

false

take iterator pair

does x need to be perfectly forwarded

take by "universal reference"

true
false

type&& x

type&& x

type x

type& x

type x

const type& x

if x is a readonly string

START HERE

true

take std::string_view

false

23. Take parameters properly

Effective use of C++

if x can be null

if needing ownership of x

if x is copied

take by value

if x is moved from

take by rvalue reference

(x is only read from)

take by unique_ptr, shared_ptr

take std::optional of x

if x is modified

take by lvalue reference

if x is cheap to copy

take by value

take by const lvalue reference

if x is a range

false
false
false
false
false
false
false
true
true
true
true
true
true
true

does x need to be a contiguous array

false
true

take std::span

can x be an arbitrary range

true

take std::ranges::***

false

does x need to be a specific container

true

take the container

false

take iterator pair

does x need to be perfectly forwarded

take by "universal reference"

true
false

type&& x

type&& x

type x

type& x

type x

const type& x

if x is a readonly string

START HERE

true

take std::string_view

false

is x an invocable

false
true

try in this order:

std::invocable<Args...> auto&& x

return_t(*x)(Args...)

std::move_only_function&&<return_t(Args...)> x

std::function<return_t(Args...)> x

23. Take parameters properly

Effective use of C++

if x can be null

if needing ownership of x

if x is copied

take by value

if x is moved from

take by rvalue reference

(x is only read from)

take by unique_ptr, shared_ptr

take std::optional of x

if x is modified

take by lvalue reference

if x is cheap to copy

take by value

take by const lvalue reference

if x is a range

false
false
false
false
false
false
false
true
true
true
true
true
true
true

does x need to be a contiguous array

false
true

take std::span

can x be an arbitrary range

true

take std::ranges::***

false

does x need to be a specific container

true

take the container

false

take iterator pair

does x need to be perfectly forwarded

take by "universal reference"

true
false

type&& x

type&& x

type x

type& x

type x

const type& x

if x is a readonly string

START HERE

true

take std::string_view

false

is x an invocable

false
true

try in this order:

std::invocable<Args...> auto&& x

return_t(*x)(Args...)

std::move_only_function&&<return_t(Args...)> x

std::function<return_t(Args...)> x

is x a raw memory address

true
false

use a raw pointer

24. Avoid allocations in loops

Effective use of C++

while (true) {
    std::string line;
    std::getline(std::cin, line);
    if (!std::cin)
        break;
    process_line(line);
}
std::string line;
while (true) {
    std::getline(std::cin, line);
    if (!std::cin)
        break;
    process_line(line);
}
std::vector<int> shiny;
for (int i = 1; i <= 100 ++i)
    if (is_shiny(i))
        shiny.push_back(i);
std::vector<int> shiny;
shiny.reserve(100);
for (int i = 1; i <= 100 ++i)
    if (is_shiny(i))
        shiny.push_back(i);

move objects out of loops

.clear() if necessary

reserve() when an upper bound on size is known ahead of time

25. Avoid copying exceptions

Effective use of C++

catch(std::exception e) {
    std::cerr << e.what() << '\n';
}
catch(const std::exception& e) {
    std::cerr << e.what() << '\n';
}
catch(mutable_err& e) {
    e.append("Caught in foo")
    throw e;
}
catch(mutable_err& e) {
    e.append("Caught in foo")
    throw;
}

catch by reference

rethrow current exception

26. Avoid copies in range-for

Effective use of C++

std::vector<std::string> names;
for (auto name : names) {
	process(name);
}
std::vector<std::string> names;
for (const auto& name : names) {
	process(name);
}

avoid copying the iterated object

27. Avoid copies in lambda captures

Effective use of C++

std::flat_set<std::string> deviceLayers;
auto supported = [deviceLayers](std::string_view layer) {
	return deviceLayers.contains(layer);
}
std::flat_set<std::string> deviceLayers;
auto supported = [&deviceLayers](std::string_view layer) {
	return deviceLayers.contains(layer);
}

capture [&object]

28. Avoid copies in str. bindings

Effective use of C++

auto [first_person, age] = *map.begin();
const auto& [first_person, age] = *map.begin();

bind reference

29. Provide ref qualified methods

Effective use of C++

template <typename T>
class simple_optional {
    T data;
    bool has_data;
public:
    /* *** */
    T& value() {
    	if (!has_data)
            throw bad_optional_access();
        return data;
    }
    const T& value() const {
        if (!has_data)
            throw bad_optional_access();
        return data;
    }
};
simple_optional<Queue> get_transfer_queue();

try {
    Queue q = get_transfer_queue().value();
    // ...

Queue gets copied

Effective use of C++

template <typename T>
class simple_optional {
    T data;
    bool has_data;
public:
    /* *** */
    T& value() & {
    	if (!has_data)
            throw bad_optional_access();
        return data;
    }
    const T& value() const& {
        if (!has_data)
            throw bad_optional_access();
        return data;
    }
    T&& value() && {
        if (!has_data)
            throw bad_optional_access();
        return std::move(data);
    }
};
simple_optional<Queue> get_transfer_queue();

try {
    Queue q = get_transfer_queue().value();
    // ...

Queue gets moved

29. Provide ref qualified methods

Effective use of C++

template <typename T>
class simple_optional {
    T data;
    bool has_data;
public:
    /* *** */
    decltype(auto) value(this auto&& self) {
        if (!self.has_data)
            throw bad_optional_access();
        return std::forward_like<decltype(self)>(self.data);
    }
};

no code duplication

29. Provide ref qualified methods

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

  1. Enable compiler optimisations
  2. Set target architecture
  3. Use fast math
  4. Disable exceptions and RTTI
  5. Enable Link Time Optimisation
  6. Use Unity Builds
  7. Link statically
  8. Use Profile Guided Optimisation
  9. Try different compilers
  10. Try different standard library implementations
  11. Keep your tools updated
  12. Preload you program with a replacement library
  13. Use binary post processing tools

14. Use constexpr​

15. Make variables const

16. Use noexcept

17. Use static for internal linkage

18. Use [[noreturn]]

19. Use [[likely]] and [[unlikely]]

20. Use [[assume]]

21. Mark pointers restrict

22. Mark functions as pure

Annotate  your code

No redundant copies

23. Take function parameters properly

24. Avoid allocations in loops

25. Avoid copying exceptions

26. Avoid copies in range-for

27. Avoid copies in lambda captures

28. Avoid copies in structured bindings

29. Provide && method overloads

Cache-friendly code

Memory

Memory

Is memory a contiguous sequence of bytes?

Memory

Is memory a contiguous sequence of bytes?

C++ Standard:

NO

Process address space: 

YES

(logical, virtual address space)

Virtual address space in the Physical address space:

NO

Physical address space:

YES

Hardware caching:

Not even a sequence...

Virtual memory

Caches

Physical address space

Process address space

C++ memory model

memory page

Page table

Is memory a contiguous sequence of bytes?

C++ Standard:

NO

Process address space: 

YES

(logical, virtual address space)

Virtual address space in the Physical address space:

NO

Physical address space:

YES

C++ Standard:

NO

Process address space: 

YES

(logical, virtual address space)

Virtual address space in the Physical address space:

NO

Physical address space:

YES

Access virtual memory address

Translate to physical address

Get data

Virtual Memory

Physical address space

Process address space

C++ memory model

memory page

Page table

Is memory a contiguous sequence of bytes?

C++ Standard:

NO

Process address space: 

YES

(logical, virtual address space)

Virtual address space in the Physical address space:

NO

Physical address space:

YES

C++ Standard:

NO

Process address space: 

YES

(logical, virtual address space)

Virtual address space in the Physical address space:

NO

Physical address space:

YES

Access virtual memory address

Translate to physical address

Get data

Swap

Disk

Working set

Access virtual memory address

Check page table

Swap page in

Fetch from RAM

RAM

Disk

Translate to physical address

Get data

DATA

in the working set

page fault

, thrashing

Virtual Memory

Access virtual memory address

Check page table

Swap page in

Fetch from RAM

RAM

Disk

Translate to physical address

Get data

DATA

in the working set

page fault

, thrashing

Memory friendly code

30. Keep the working set size small

Caching

Access virtual memory address

Check TLB

Check cache

Check page table

Swap page in

Fetch from RAM

L1

L2

L3

CPU

RAM

Disk

Translate to physical address

Get data

DATA

in the working set

page fault

, thrashing

High latency

Prefetching

you wanted ar[0]?

well, here's the whole ar

Data locality

Cache line

Caching

you wanted ar[0]?

well, here's the whole ar

Temporal locality

CPU cache

Caching

Processor's

Execution Units

μop Cache

Loopback buffer

L1 Instruction Cache

L1 Data Cache

Register renaming and register files

L2 Cache

L3 Cache

Working set

TLB

CPU

RAM

Core

Page table

Memory

Access virtual memory address

Check TLB

Check cache

Check page table

Swap page in

Fetch from RAM

L1

L2

L3

CPU

RAM

Translate to physical address

Get data

DATA

in the working set

page fault

, thrashing

TLB hit

hit

hit

miss

miss

miss

TLB miss

Cache-friendly code

31. Exploit data locality

std::array

std::vector

std::deque

std::flat_map

std::flat_set

std::list

std::set

std::unordered_set

std::map

std::unordered_map

Cache-friendly code

int matrix[rows][cols];

for (int row = 0; row < rows; ++row)
    for (int col = 0; col < cols; ++col)
    process(matrix[row][col]);
int matrix[rows][cols];

for (int col = 0; col < cols; ++col)
    for (int row = 0; row < rows; ++row)
    process(matrix[row][col]);

31. Exploit data locality

Cache-friendly code

struct DebugInfo {
    std::string name;
    time_point creation;
    size_t use_cnt;
}

class DescriptorSet {
    VkDescriptorSet handle;
    DebugInfo debug;
    
    // guaranteed to outlive,
    // not dangling
    const Device& device;
public: 
    // ...
};

31. Exploit data locality

device

some c-string

handle

debug

debug.name

debug.name.m_data

debug.name.m_len

debug.creation

debug.use_cnt

device

...

Cache-friendly code

struct DebugInfo {
    std::string name;
    time_point creation;
    size_t use_cnt;
}

class DescriptorSet {
    VkDescriptorSet handle;
    VkDevice device_raw;
    unique_ptr<DebugInfo> debug;
    const Device& device;
public: 
	// ...
};

31. Exploit data locality

handle

debug

device_raw

...

device

Cache-friendly code

32. Exploit temporal locality

Linux pthread_set_affinity
Windows SetThreadAffinityMask
macOS thread_policy_set
with thread_affinity_policy_t

Pin thread to a core

Cache-friendly code

32. Exploit temporal locality

Linux, macOS setpriority
Windows SetPriorityClass
Linux pthread_setschedprio
Windows SetThreadPriority
macOS setThreadPriority (Objective C)

Set priority of the process

Set priority of a thread

Cache-friendly code

Contiguous data structures

Data oriented design

SOA vs AOS

Sequential memory access

Entity Component Systems

NUMA architectures

Cache-friendly code

33. Avoid false sharing

int thread1_data{};
int thread2_data{};

std::thread t1{work, std::ref(thread1_data)};
std::thread t2{work, std::ref(thread2_data)};

likely on the same cache line

false sharing

Cache-friendly code

33. Avoid false sharing

alignas(std::hardware_destructive_interference_size) int thread1_data{};
alignas(std::hardware_destructive_interference_size) thread2_data{};

std::thread t1{work, std::ref(thread1_data)};
std::thread t2{work, std::ref(thread2_data)};

on different cache lines

no dependencies

Cache-friendly code

34. Use non temporal stores

CPU

RAM

Cache

regular store

non-temporal store

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

  1. Enable compiler optimisations
  2. Set target architecture
  3. Use fast math
  4. Disable exceptions and RTTI
  5. Enable Link Time Optimisation
  6. Use Unity Builds
  7. Link statically
  8. Use Profile Guided Optimisation
  9. Try different compilers
  10. Try different standard library implementations
  11. Keep your tools updated
  12. Preload you program with a replacement library
  13. Use binary post processing tools

14. Use constexpr​

15. Make variables const

16. Use noexcept

17. Use static for internal linkage

18. Use [[noreturn]]

19. Use [[likely]] and [[unlikely]]

20. Use [[assume]]

21. Mark pointers restrict

22. Mark functions as pure

Annotate  your code

No redundant copies

30. Keep the working set small

31. Exploit data locality

32. Exploit temporal locality

33. Avoid false sharing

34. Use non temporal stores

Cache-friendly code

Branch predictor friendly code

23. Take function parameters properly

24. Avoid allocations in loops

25. Avoid copying exceptions

26. Avoid copies in range-for

27. Avoid copies in lambda captures

28. Avoid copies in structured bindings

29. Provide && method overloads

Branch predictor

35. Avoid indirected calls

36. Make branches predictable

37. Use branchless optimisations

38. Use SIMD intrinsics

38. Use SIMD intrinsics

Effective use of C++

Build pipeline modification

Manual hardware oriented optimisations

Your Performance Todo List

  1. Enable compiler optimisations
  2. Set target architecture
  3. Use fast math
  4. Disable exceptions and RTTI
  5. Enable Link Time Optimisation
  6. Use Unity Builds
  7. Link statically
  8. Use Profile Guided Optimisation
  9. Try different compilers
  10. Try different standard library implementations
  11. Keep your tools updated
  12. Preload you program with a replacement library
  13. Use binary post processing tools

14. Use constexpr​

15. Make variables const

16. Use noexcept

17. Use static for internal linkage

18. Use [[noreturn]]

19. Use [[likely]] and [[unlikely]]

20. Use [[assume]]

21. Mark pointers restrict

22. Mark functions as pure

Annotate  your code

No redundant copies

30. Keep the working set small

31. Exploit data locality

32. Exploit temporal locality

33. Avoid false sharing

34. Use non temporal stores

35. Avoid indirected calls

36. Make branches predictable

37. Use branchless optimisations

Cache-friendly code

Branch predictor friendly code

38. Use SIMD intrinsics

23. Take function parameters properly

24. Avoid allocations in loops

25. Avoid copying exceptions

26. Avoid copies in range-for

27. Avoid copies in lambda captures

28. Avoid copies in structured bindings

29. Provide && method overloads

&s and else WHERE credit's due

In the order of appearance

Presentation made using slides.com

Nexa font family by Fontfabric

SVGs made in Pixelmator Pro

“Memory Tape” rendered in Blender

Your Performance Todo List: The Most Important Performance Opportunities and Pitfalls to Remember About

By Jan Bielak

Your Performance Todo List: The Most Important Performance Opportunities and Pitfalls to Remember About

This is the interactive slide deck for my CppCon 2022 talk. Writing efficient programs is hard. This is because it requires a lot of knowledge, experience and strategic thinking. There have been many talks on optimization and often each addresses a single concept. Being able to achieve a bird’s eye view of factors affecting performance often requires many hours of researching the topic. To lessen the mental burden of optimizing programs, I have picked out the techniques, I believe are most important. During the talk, I will present them in an organized manner and provide practical examples of how they can be applied. I will first discuss what I believe are the main goals efficient programs strive to achieve. Then, I will present the general methods of achieving those goals. Then, for the majority of the talk, we will discuss a few dozen performance opportunities. For each of them, I will explain the underlying mechanism of how the optimisation works. I will avoid bluntly giving guidelines to follow without explanation. Each of the techniques naturally comes with its costs, and those will be discussed as well. I will additionally discuss various performance pitfalls. These are sometimes called “premature pessimisations” in contrast to the often used term of “premature optimizations”. I will show examples of optimizations which do not incur any cost on program readability or ma

  • 1,684