All images and fonts placed here to
preload them into the browser cache
Nexa Text Regular
Nexa Text Bold
Nexa Text Italic
Nexa Text Bold Italic
Nexa Regular
Nexa Bold
Nexa Italic
Nexa Bold Italic
#include <random_code>
using in_cpp;
to preload() {
// the monospace
font as = well;
}
Your Performance Todo List
The most important optimisation opportunities and pitfalls to remember about
by
Jan Bielak
Your Performance Todo List
The most important optimisation opportunities and pitfalls to remember about
by
Jan Bielak
Jan Bielak
Warsaw Staszic High School, Poland
Self-taught C++ Developer
Realtime rendering
Game development
janbielak.com github.com/janekb04 youtube.com/@janbielak
Practically Correct, Just-in-Time Shell Script Parallelization
Konstantinos Kallas, Tammam Mustafa, Jan Bielak, Dimitris Karnikis, Thurston H.Y. Dang, Michael Greenberg, Nikos Vasilakis. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI22)
Performance
Performance
Performance
1. No unnecessary work
2. Use all computing power
3. Avoid waits and stalls
4. Use hardware efficiently
No unnecessary copying
No unnecessary allocations
Use all cores
Use SIMD
Lockless data structures
Asynchronous APIs
Job Systems
Cache friendliness
Well predictable code
5. OS-level efficiency
Performance
1. No unnecessary work
2. Use all computing power
3. Avoid waits and stalls
4. Use hardware efficiently
5. OS-level efficiency
Performance
1. No unnecessary work
2. Use all computing power
3. Avoid waits and stalls
4. Use hardware efficiently
5. OS-level efficiency
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Build pipeline modification
1. Enable compiler optimisations
GCC, LLVM, ICC | -O2 or -O3 |
---|---|
MSVC | /Ox or /O2 |
Optimize for speed
Optimize for size
GCC, LLVM, ICC | -Os |
---|---|
MSVC | /O1 |
Optimization #1
Longer compile time
Build pipeline modification
2. Set target architecture
GCC, LLVM, ICC | -march=native -mtune=native |
---|---|
MSVC |
/arch:IA32 or /arch:SSE or /arch:SSE2 or /arch:AVX or /arch:AVX2 or /arch:AVX512 |
For x86
For ARM
GCC, LLVM | -mcpu=native |
---|---|
MSVC | /arch:ARMv7VE or /arch:VFPv4 or /arch:armv8.0 ... or /arch:armv8.8 |
automatic detection of current processor's features
automatic detection of current processor's features
needs to be specified manually
needs to be specified manually
Build pipeline modification
3. Use fast math
GCC, LLVM | -ffast-math (included in -Ofast) |
---|---|
MSVC | /fp:fast |
ICC | -fp-model=fast |
Faster computation
Less precise results
Non standard-compliant
Build pipeline modification
4. Disable exceptions and RTTI
GCC, LLVM, ICC | -fno-exceptions |
---|---|
MSVC | /EHs-c- /D_HAS_EXCEPTIONS=0 |
GCC, LLVM, ICC | -fno-rtti |
---|---|
MSVC | /GR- |
No exceptions
No RTTI
Limited performance gains
Non standard-compliant
Breaks code using exceptions
Build pipeline modification
5. Enable Link Time Optimization
Compiler
Compiler
Compiler
Linker
}
?
?
?
Build pipeline modification
5. Enable Link Time Optimization
Compiler
Compiler
Compiler
Linker
}
GCC, LLVM | -flto |
---|---|
MSVC | /GL |
ICC | -ipo |
Build pipeline modification
6. Use Unity Builds
Compiler
Linker
}
Compiler
Build pipeline modification
6. Use Unity Builds
Compiler
Linker
}
Compiler
Unity Build
Unity Build
CMake | -DCMAKE_UNITY_BUILD=ON |
---|
Build pipeline modification
7. Link statically
Static Linking | Dynamic Linking |
---|---|
|
Better optimisable
More space efficient
Can be updated independently of executable
Build pipeline modification
8. Use Profile Guided Optimisation
Build pipeline
Build pipeline modification
8. Use Profile Guided Optimisation
Build pipeline
Execute
Build pipeline modification
8. Use Profile Guided Optimisation
Build pipeline
Execute
Build pipeline
GCC, LLVM | -fprofile-generate |
---|---|
MSVC | /GENPROFILE |
ICC | -prof-gen |
GCC, LLVM | -fprofile-use |
---|---|
MSVC | /USEPROFILE |
ICC | -prof-use |
Build pipeline modification
9. Try different compilers
Build pipeline modification
10. Try different standard libraries
Build pipeline modification
11. Keep your tools updated
Build pipeline modification
12. Preload with a replacement lib
env LD_PRELOAD=/usr/lib/libSUPERmalloc.so ./myprogram
env DYLD_INSERT_LIBRARIES=/usr/lib/libSUPERmalloc.dylib ./myprogram
Requires DLL injection
Windows
macOS
Linux, BSD
Build pipeline modification
13. Use binary post processing tools
LLVM BOLT
perf record
perf2bolt
Build pipeline modification
13. Use binary post processing tools
LLVM BOLT
perf record
perf2bolt
llvm-bolt
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
- Enable compiler optimisations
- Set target architecture
- Use fast math
- Disable exceptions and RTTI
- Enable Link Time Optimisation
- Use Unity Builds
- Link statically
- Use Profile Guided Optimisation
- Try different compilers
- Try different standard library implementations
- Keep your tools updated
- Preload you program with a replacement library
- Use binary post processing tools
Annotate your code
14. Constexpr all the things
Effective use of C++
Constant expressions:
Literals:
1, 3.0f, nullptr, "Hello"
Arithmetic:
2 + 3, 4.0 / 3.0
Sizes and alignments:
sizeof(int), alignof(std::vector<int>)
...
14. Constexpr all the things
Effective use of C++
constexpr int f(int x) { return 3 * x + 5; }
Constexpr functions:
invocation MAY be a constant expression
f(5)
int x;
std::cin >> x;
f(x);
is a constant expression
is NOT a constant expression
Is a given invocation evaluated at compile time?
if (std::is_constant_evaluated()) { ... }
if consteval { ... }
consteval int f(int x) { return 3 * x + 5; }
Immediate functions:
f(5)
int x;
std::cin >> x;
f(x);
is a constant expression
is a COMPILE ERRROR
invocation MUST be a constant expression
(inside function body)
if constexpr (compile_time_condition) {...}
If constexpr:
if constexpr (std::is_constant_evaluated()) {...}
ALWAYS TRUE
14. Constexpr all the things
Effective use of C++
constexpr std::array<int> primes{ 2, 3, 5, 7, 11 };
Constexpr variables:
variable must be initialised at its declaration
constexpr int x;
x = 3;
is a COMPILE ERRROR
implies
primes[0] = 1;
is a COMPILE ERRROR
const
constexpr int f(int x) { return x + 1; }
int main()
{
int x1 = 3;
constexpr int y1 = f(x1);
constexpr int x2 = 3;
constexpr int y2 = f(x2);
}
accessing it is a constant expression
is a COMPILE ERRROR
constinit std::array<int> primes{ 2, 3, 5, 7, 11 };
Constinit variables:
(by a constant a expression)
variable must be initialised at its declaration by a constant expression
14. Constexpr all the things
Effective use of C++
constexpr std::array<int> primes{ 2, 3, 5, 7, 11 };
Constexpr variables:
constinit std::array<int> primes{ 2, 3, 5, 7, 11 };
Constinit variables:
constexpr int f(int x) { return 3 * x + 5; }
Constexpr functions:
Is a given invocation evaluated at compile time?
if (std::is_constant_evaluated()) { ... }
if consteval { ... }
consteval int f(int x) { return 3 * x + 5; }
Immediate functions:
if constexpr (compile_time_condition) {...}
If constexpr:
15. Make variables const
Effective use of C++
std::vector<float>
get_mean_deltas(std::vector<float> data)
{
float sum = 0;
for (auto&& num : data)
sum += num;
for (auto& num : data)
num -= sum / data.size();
return data;
}
Declare variables const
15. Make variables const
Effective use of C++
std::vector<float>
get_mean_deltas(std::vector<float> data)
{
const float sum = std::accumulate(
data.begin(),
data.end(),
0.0f
);
for (auto& num : data)
num -= sum / data.size();
return data;
}
std::vector<float>
get_mean_deltas(std::vector<float> data)
{
const float sum = std::accumulate(
data.begin(),
data.end(),
0.0f
);
const float __mean = sum / data.size();
for (auto& num : data)
num -= __mean;
return data;
}
...so this expression is loop-invariant and can be hoisted
and no expensive division in loop!
sum is const...
~ compiler's thought process
(paraphrased)
Declare variables const
15. Make variables const
Effective use of C++
template <typename T>
class vector {
T* begin;
T* end;
T* capacity;
/* ... */
public:
constexpr size_t size() const noexcept {
return end - begin;
}
};
template <typename T>
class vector {
T* begin;
T* end;
T* capacity;
/* ... */
public:
constexpr size_t size(this const vector& self) noexcept {
return self.end - self.begin;
}
};
Declare member functions const
15. Make variables const
Effective use of C++
Copy globals to const locals
(if copying is cheap)
struct {
/* ... */
bool fill;
} _internal__state;
void set_draw_mode_filled();
void set_draw_mode_wireframe();
void draw_mesh(const mesh* m) {
for (const primitive* prim = m->begin(); prim != m->end(); ++prim) {
if(_internal__is_frontfacing(*prim)) {
if (_internal__state.fill) {
_internal__draw_prim_filled(*prim);
}
else {
_internal__draw_prim_wireframe(*prim);
}
}
}
}
15. Make variables const
Effective use of C++
Copy globals to const locals
(if copying is cheap)
void draw_mesh(const mesh* m) {
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
if (_internal__state.fill)
_internal__draw_prim_filled(*prim);
else
_internal__draw_prim_wireframe(*prim);
}
void draw_mesh(const mesh* m) {
if (_internal__state.fill)
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
_internal__draw_prim_filled(*prim);
else
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
_internal__draw_prim_wireframe(*prim);
}
could modify _internal__state.fill
could modify _internal__state.fill
15. Make variables const
Effective use of C++
Copy globals to const locals
(if copying is cheap)
void draw_mesh(const mesh* m) {
const bool fill = _internal__state.fill;
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
if (fill)
_internal__draw_prim_filled(*prim);
else
_internal__draw_prim_wireframe(*prim);
}
void draw_mesh(const mesh* m) {
if (_internal__state.fill)
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
_internal__draw_prim_filled(*prim);
else
for (const primitive* prim = m->begin(); prim != m->end(); ++prim)
if(_internal__is_frontfacing(*prim))
_internal__draw_prim_wireframe(*prim);
}
could modify _internal__state.fill
but we don't care
16. Noexcept all the things
Effective use of C++
void f();
COULD throw an exception
void f() noexcept;
WILL NEVER throw an exception
void f() noexcept(true);
void f() noexcept(false);
template <typename T>
void swap(T&& lhs, T&& rhs)
noexcept(std::is_nothrow_move_constructible<T>
&& std::is_nothrow_move_assignable<T>)
{
T tmp = std::move(lhs);
lhs = std::move(rhs);
rhs = std::move(tmp);
}
noexceptness
depends on T
16. Noexcept all the things
Effective use of C++
void f();
COULD throw an exception
void f() noexcept;
WILL NEVER throw an exception
void f() noexcept(true);
void f() noexcept(false);
template <typename T>
void swap(T&& lhs, T&& rhs)
noexcept(std::is_nothrow_move_constructible<T>
&& std::is_nothrow_move_assignable<T>)
{
T tmp = std::move(lhs);
lhs = std::move(rhs);
rhs = std::move(tmp);
}
template <typename T>
void swap(T&& lhs, T&& rhs)
noexcept(noexcept(T(std::move(lhs)))
&& noexcept(lhs = std::move(rhs)))
{
T tmp = std::move(lhs);
lhs = std::move(rhs);
rhs = std::move(tmp);
}
17. Use static for internal linkage
Effective use of C++
int counter() {
static int counter = 0;
return ++counter;
};
struct image {
namespace fs = std::filesystem;
static image from_file(fs::path path);
};
Static variables
Static member functions
17. Use static for internal linkage
Effective use of C++
static int global_value;
static void global_func();
Internal linkage variables
Internal linkage functions
a.cpp
b.cpp
// Forward declarations
extern int global_value;
void global_func();
extern int global_value2;
void global_func2();
//Use
void example() {
global_value = 42;
global_func();
global_value2 = 42;
global_func2();
}
17. Use static for internal linkage
Effective use of C++
static int global_value;
static void global_func();
Internal linkage functions
a.cpp
b.cpp
int global_value2;
void global_func2();
// Forward declarations
extern int global_value;
void global_func();
extern int global_value2;
void global_func2();
//Use
void example() {
global_value = 42;
global_func();
global_value2 = 42;
global_func2();
}
unresolved external symbol
unresolved external symbol
?
?
17. Use static for internal linkage
Effective use of C++
18. Use [[noreturn]]
Effective use of C++
[[noreturn]] void Log::Error(const String& msg) {
logfile << msg << '\n';
std::cerr << msg << '\n';
throw Engine::RuntimeError(msg);
}
19. Use [[likely]] and [[unlikely]]
Effective use of C++
void internal_work();
19. Use [[likely]] and [[unlikely]]
Effective use of C++
bool require_init = true;
void init_lib();
void internal_work();
19. Use [[likely]] and [[unlikely]]
Effective use of C++
bool require_init = true;
void init_lib();
void internal_work();
void work()
{
if(require_init) {
init_lib();
require_init = false;
}
internal_work();
}
19. Use [[likely]] and [[unlikely]]
Effective use of C++
bool require_init = true;
void init_lib();
void internal_work();
void work()
{
if(require_init) {
init_lib();
require_init = false;
}
internal_work();
}
Effective use of C++
C++23 | [[assume(condition)]]; |
---|---|
GCC | if (!condition) __builtin_unreachable(); |
MSVC, ICC | __assume(condition); |
LLVM | __builtin_assume(condition); |
20. Use [[assume(condition)]];
Effective use of C++
[[assume(condition)]]; | assert(condition); |
---|---|
Condition must be true | Condition must be true |
For the optimiser | For the programmer |
If !condition then Undefined Behaviour |
If !condition then std::abort() in Debug Mode noop in Release Mode |
20. Use [[assume(condition)]];
Effective use of C++
void implementation(internal_t* obj) {
if (obj) {
internal_work(*obj);
}
}
void interface(public_t* obj) {
if (obj) {
[[assume(obj->internal)]];
implementation(obj->internal);
}
}
Assume that pointer is non null*
*better use a reference
void limiter(float* samples, size_t count) {
[[assume(samples % 32 == 0)]];
[[assume(size > 0)]];
for (int i = 0; i < count; ++i) {
samples[i] = std::clamp(samples[i], -1.0, 1.0)
}
}
Assume pointer alignment*
*or use std::assume_aligned
example taken from P1774 (the [[assume]] proposal)
20. Use [[assume(condition)]];
Effective use of C++
const char* get_name(TextureType type) {
switch(e) {
case TextureType::Texture2D:
return "Texture2D";
case TextureType::Texture3D:
return "Texture3D";
case TextureType::Texture2DArray:
return "Texture2DArray";
case TextureType::Cubemap:
return "Cubemap";
default:
[[assume(false)]];
}
}
Declare a code path unreachable
*or use std::unreachable
20. Use [[assume(condition)]];
21. Use __restrict
Effective use of C++
float* __restrict buffer0;
float* __restrict buffer1;
21. Use __restrict
Effective use of C++
float* __restrict buffer0;
float* __restrict buffer1;
UB if overlap
21. Use __restrict
Effective use of C++
pointer provenance
21. Use __restrict
Effective use of C++
GCC, LLVM, ICC | __attribute__((malloc)) |
---|---|
MSVC | __declspec(restrict) |
22. Make functions pure
Effective use of C++
f
param0
param1
output
GCC, LLVM, ICC | __attribute__((pure)) or [[gnu::pure]] |
---|---|
MSVC | Not Supported |
22. Make functions pure
Effective use of C++
f
param0
param1
output
GCC, LLVM, ICC | __attribute__((pure)) or [[gnu::pure]] |
---|---|
MSVC | Not Supported |
f
param0
param1
output
GCC, LLVM, ICC | __attribute__((const)) or [[gnu::const]] |
---|---|
MSVC | Not Supported |
global state
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
- Enable compiler optimisations
- Set target architecture
- Use fast math
- Disable exceptions and RTTI
- Enable Link Time Optimisation
- Use Unity Builds
- Link statically
- Use Profile Guided Optimisation
- Try different compilers
- Try different standard library implementations
- Keep your tools updated
- Preload you program with a replacement library
- Use binary post processing tools
14. Use constexpr
15. Make variables const
16. Use noexcept
17. Use static for internal linkage
18. Use [[noreturn]]
19. Use [[likely]] and [[unlikely]]
20. Use [[assume]]
21. Mark pointers restrict
22. Mark functions as pure
Annotate your code
No redundant copies
23. Take parameters properly
Effective use of C++
void func(??? x);
if x can be null
if needing ownership of x
if x is copied
take by value
if x is moved from
func(x);
call site:
declaration?
take by rvalue reference
(x is only read from)
take by unique_ptr, shared_ptr
take std::optional of x
if x is modified
take by lvalue reference
if x is cheap to copy
take by value
take by const lvalue reference
if x is a range
false
false
false
false
false
false
false
true
true
true
true
true
true
true
does x need to be a contiguous array
false
true
take std::span
can x be an arbitrary range
true
take std::ranges::***
false
does x need to be a specific container
true
take the container
false
take iterator pair
does x need to be perfectly forwarded
take by "universal reference"
true
false
type&& x
type&& x
type x
type& x
type x
const type& x
START HERE
23. Take parameters properly
Effective use of C++
void f(const std::string& s);
f("Hello");
f(std::string{"Hello"}.c_str());
void f(const char* s);
(safe - lifetime of temporary extended)
implicit conversion to string
(allocation)
verbose
(safe)
f(std::string{"Hello"});
f("Hello");
void f(std::string_view s);
works for both
(no copies)
(safe)
23. Take parameters properly
Effective use of C++
if x can be null
if needing ownership of x
if x is copied
take by value
if x is moved from
take by rvalue reference
(x is only read from)
take by unique_ptr, shared_ptr
take std::optional of x
if x is modified
take by lvalue reference
if x is cheap to copy
take by value
take by const lvalue reference
if x is a range
false
false
false
false
false
false
false
true
true
true
true
true
true
true
does x need to be a contiguous array
false
true
take std::span
can x be an arbitrary range
true
take std::ranges::***
false
does x need to be a specific container
true
take the container
false
take iterator pair
does x need to be perfectly forwarded
take by "universal reference"
true
false
type&& x
type&& x
type x
type& x
type x
const type& x
START HERE
23. Take parameters properly
Effective use of C++
if x can be null
if needing ownership of x
if x is copied
take by value
if x is moved from
take by rvalue reference
(x is only read from)
take by unique_ptr, shared_ptr
take std::optional of x
if x is modified
take by lvalue reference
if x is cheap to copy
take by value
take by const lvalue reference
if x is a range
false
false
false
false
false
false
false
true
true
true
true
true
true
true
does x need to be a contiguous array
false
true
take std::span
can x be an arbitrary range
true
take std::ranges::***
false
does x need to be a specific container
true
take the container
false
take iterator pair
does x need to be perfectly forwarded
take by "universal reference"
true
false
type&& x
type&& x
type x
type& x
type x
const type& x
if x is a readonly string
START HERE
true
take std::string_view
false
23. Take parameters properly
Effective use of C++
if x can be null
if needing ownership of x
if x is copied
take by value
if x is moved from
take by rvalue reference
(x is only read from)
take by unique_ptr, shared_ptr
take std::optional of x
if x is modified
take by lvalue reference
if x is cheap to copy
take by value
take by const lvalue reference
if x is a range
false
false
false
false
false
false
false
true
true
true
true
true
true
true
does x need to be a contiguous array
false
true
take std::span
can x be an arbitrary range
true
take std::ranges::***
false
does x need to be a specific container
true
take the container
false
take iterator pair
does x need to be perfectly forwarded
take by "universal reference"
true
false
type&& x
type&& x
type x
type& x
type x
const type& x
if x is a readonly string
START HERE
true
take std::string_view
false
is x an invocable
false
true
try in this order:
std::invocable<Args...> auto&& x
return_t(*x)(Args...)
std::move_only_function&&<return_t(Args...)> x
std::function<return_t(Args...)> x
23. Take parameters properly
Effective use of C++
if x can be null
if needing ownership of x
if x is copied
take by value
if x is moved from
take by rvalue reference
(x is only read from)
take by unique_ptr, shared_ptr
take std::optional of x
if x is modified
take by lvalue reference
if x is cheap to copy
take by value
take by const lvalue reference
if x is a range
false
false
false
false
false
false
false
true
true
true
true
true
true
true
does x need to be a contiguous array
false
true
take std::span
can x be an arbitrary range
true
take std::ranges::***
false
does x need to be a specific container
true
take the container
false
take iterator pair
does x need to be perfectly forwarded
take by "universal reference"
true
false
type&& x
type&& x
type x
type& x
type x
const type& x
if x is a readonly string
START HERE
true
take std::string_view
false
is x an invocable
false
true
try in this order:
std::invocable<Args...> auto&& x
return_t(*x)(Args...)
std::move_only_function&&<return_t(Args...)> x
std::function<return_t(Args...)> x
is x a raw memory address
true
false
use a raw pointer
24. Avoid allocations in loops
Effective use of C++
while (true) {
std::string line;
std::getline(std::cin, line);
if (!std::cin)
break;
process_line(line);
}
std::string line;
while (true) {
std::getline(std::cin, line);
if (!std::cin)
break;
process_line(line);
}
std::vector<int> shiny;
for (int i = 1; i <= 100 ++i)
if (is_shiny(i))
shiny.push_back(i);
std::vector<int> shiny;
shiny.reserve(100);
for (int i = 1; i <= 100 ++i)
if (is_shiny(i))
shiny.push_back(i);
move objects out of loops
.clear() if necessary
reserve() when an upper bound on size is known ahead of time
25. Avoid copying exceptions
Effective use of C++
catch(std::exception e) {
std::cerr << e.what() << '\n';
}
catch(const std::exception& e) {
std::cerr << e.what() << '\n';
}
catch(mutable_err& e) {
e.append("Caught in foo")
throw e;
}
catch(mutable_err& e) {
e.append("Caught in foo")
throw;
}
catch by reference
rethrow current exception
26. Avoid copies in range-for
Effective use of C++
std::vector<std::string> names;
for (auto name : names) {
process(name);
}
std::vector<std::string> names;
for (const auto& name : names) {
process(name);
}
avoid copying the iterated object
27. Avoid copies in lambda captures
Effective use of C++
std::flat_set<std::string> deviceLayers;
auto supported = [deviceLayers](std::string_view layer) {
return deviceLayers.contains(layer);
}
std::flat_set<std::string> deviceLayers;
auto supported = [&deviceLayers](std::string_view layer) {
return deviceLayers.contains(layer);
}
capture [&object]
28. Avoid copies in str. bindings
Effective use of C++
auto [first_person, age] = *map.begin();
const auto& [first_person, age] = *map.begin();
bind reference
29. Provide ref qualified methods
Effective use of C++
template <typename T>
class simple_optional {
T data;
bool has_data;
public:
/* *** */
T& value() {
if (!has_data)
throw bad_optional_access();
return data;
}
const T& value() const {
if (!has_data)
throw bad_optional_access();
return data;
}
};
simple_optional<Queue> get_transfer_queue();
try {
Queue q = get_transfer_queue().value();
// ...
Queue gets copied
Effective use of C++
template <typename T>
class simple_optional {
T data;
bool has_data;
public:
/* *** */
T& value() & {
if (!has_data)
throw bad_optional_access();
return data;
}
const T& value() const& {
if (!has_data)
throw bad_optional_access();
return data;
}
T&& value() && {
if (!has_data)
throw bad_optional_access();
return std::move(data);
}
};
simple_optional<Queue> get_transfer_queue();
try {
Queue q = get_transfer_queue().value();
// ...
Queue gets moved
29. Provide ref qualified methods
Effective use of C++
template <typename T>
class simple_optional {
T data;
bool has_data;
public:
/* *** */
decltype(auto) value(this auto&& self) {
if (!self.has_data)
throw bad_optional_access();
return std::forward_like<decltype(self)>(self.data);
}
};
no code duplication
29. Provide ref qualified methods
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
- Enable compiler optimisations
- Set target architecture
- Use fast math
- Disable exceptions and RTTI
- Enable Link Time Optimisation
- Use Unity Builds
- Link statically
- Use Profile Guided Optimisation
- Try different compilers
- Try different standard library implementations
- Keep your tools updated
- Preload you program with a replacement library
- Use binary post processing tools
14. Use constexpr
15. Make variables const
16. Use noexcept
17. Use static for internal linkage
18. Use [[noreturn]]
19. Use [[likely]] and [[unlikely]]
20. Use [[assume]]
21. Mark pointers restrict
22. Mark functions as pure
Annotate your code
No redundant copies
23. Take function parameters properly
24. Avoid allocations in loops
25. Avoid copying exceptions
26. Avoid copies in range-for
27. Avoid copies in lambda captures
28. Avoid copies in structured bindings
29. Provide && method overloads
Cache-friendly code
Memory
Memory
Is memory a contiguous sequence of bytes?
Memory
Is memory a contiguous sequence of bytes?
C++ Standard:
NO
Process address space:
YES
(logical, virtual address space)
Virtual address space in the Physical address space:
NO
Physical address space:
YES
Hardware caching:
Not even a sequence...
Virtual memory
Caches
Physical address space
Process address space
C++ memory model
memory page
Page table
Is memory a contiguous sequence of bytes?
C++ Standard:
NO
Process address space:
YES
(logical, virtual address space)
Virtual address space in the Physical address space:
NO
Physical address space:
YES
C++ Standard:
NO
Process address space:
YES
(logical, virtual address space)
Virtual address space in the Physical address space:
NO
Physical address space:
YES
Access virtual memory address
Translate to physical address
Get data
Virtual Memory
Physical address space
Process address space
C++ memory model
memory page
Page table
Is memory a contiguous sequence of bytes?
C++ Standard:
NO
Process address space:
YES
(logical, virtual address space)
Virtual address space in the Physical address space:
NO
Physical address space:
YES
C++ Standard:
NO
Process address space:
YES
(logical, virtual address space)
Virtual address space in the Physical address space:
NO
Physical address space:
YES
Access virtual memory address
Translate to physical address
Get data
Swap
Disk
Working set
Access virtual memory address
Check page table
Swap page in
Fetch from RAM
RAM
Disk
Translate to physical address
Get data
DATA
in the working set
page fault
, thrashing
Virtual Memory
Access virtual memory address
Check page table
Swap page in
Fetch from RAM
RAM
Disk
Translate to physical address
Get data
DATA
in the working set
page fault
, thrashing
Memory friendly code
30. Keep the working set size small
Caching
Access virtual memory address
Check TLB
Check cache
Check page table
Swap page in
Fetch from RAM
L1
L2
L3
CPU
RAM
Disk
Translate to physical address
Get data
DATA
in the working set
page fault
, thrashing
High latency
Prefetching
you wanted ar[0]?
well, here's the whole ar
Data locality
Cache line
Caching
you wanted ar[0]?
well, here's the whole ar
Temporal locality
CPU cache
Caching
Processor's
Execution Units
μop Cache
Loopback buffer
L1 Instruction Cache
L1 Data Cache
Register renaming and register files
L2 Cache
L3 Cache
Working set
TLB
CPU
RAM
Core
Page table
Memory
Access virtual memory address
Check TLB
Check cache
Check page table
Swap page in
Fetch from RAM
L1
L2
L3
CPU
RAM
Translate to physical address
Get data
DATA
in the working set
page fault
, thrashing
TLB hit
hit
hit
miss
miss
miss
TLB miss
Cache-friendly code
31. Exploit data locality
std::array
std::vector
std::deque
std::flat_map
std::flat_set
std::list
std::set
std::unordered_set
std::map
std::unordered_map
Cache-friendly code
int matrix[rows][cols];
for (int row = 0; row < rows; ++row)
for (int col = 0; col < cols; ++col)
process(matrix[row][col]);
int matrix[rows][cols];
for (int col = 0; col < cols; ++col)
for (int row = 0; row < rows; ++row)
process(matrix[row][col]);
31. Exploit data locality
Cache-friendly code
struct DebugInfo {
std::string name;
time_point creation;
size_t use_cnt;
}
class DescriptorSet {
VkDescriptorSet handle;
DebugInfo debug;
// guaranteed to outlive,
// not dangling
const Device& device;
public:
// ...
};
31. Exploit data locality
device
some c-string
handle
debug
debug.name
debug.name.m_data
debug.name.m_len
debug.creation
debug.use_cnt
device
...
Cache-friendly code
struct DebugInfo {
std::string name;
time_point creation;
size_t use_cnt;
}
class DescriptorSet {
VkDescriptorSet handle;
VkDevice device_raw;
unique_ptr<DebugInfo> debug;
const Device& device;
public:
// ...
};
31. Exploit data locality
handle
debug
device_raw
...
device
Cache-friendly code
32. Exploit temporal locality
Linux | pthread_set_affinity |
---|---|
Windows | SetThreadAffinityMask |
macOS | thread_policy_set with thread_affinity_policy_t |
Pin thread to a core
Cache-friendly code
32. Exploit temporal locality
Linux, macOS | setpriority |
---|---|
Windows | SetPriorityClass |
Linux | pthread_setschedprio |
---|---|
Windows | SetThreadPriority |
macOS | setThreadPriority (Objective C) |
Set priority of the process
Set priority of a thread
Cache-friendly code
Contiguous data structures
Data oriented design
SOA vs AOS
Sequential memory access
Entity Component Systems
NUMA architectures
Cache-friendly code
33. Avoid false sharing
int thread1_data{};
int thread2_data{};
std::thread t1{work, std::ref(thread1_data)};
std::thread t2{work, std::ref(thread2_data)};
likely on the same cache line
false sharing
Cache-friendly code
33. Avoid false sharing
alignas(std::hardware_destructive_interference_size) int thread1_data{};
alignas(std::hardware_destructive_interference_size) thread2_data{};
std::thread t1{work, std::ref(thread1_data)};
std::thread t2{work, std::ref(thread2_data)};
on different cache lines
no dependencies
Cache-friendly code
34. Use non temporal stores
CPU
RAM
Cache
regular store
non-temporal store
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
- Enable compiler optimisations
- Set target architecture
- Use fast math
- Disable exceptions and RTTI
- Enable Link Time Optimisation
- Use Unity Builds
- Link statically
- Use Profile Guided Optimisation
- Try different compilers
- Try different standard library implementations
- Keep your tools updated
- Preload you program with a replacement library
- Use binary post processing tools
14. Use constexpr
15. Make variables const
16. Use noexcept
17. Use static for internal linkage
18. Use [[noreturn]]
19. Use [[likely]] and [[unlikely]]
20. Use [[assume]]
21. Mark pointers restrict
22. Mark functions as pure
Annotate your code
No redundant copies
30. Keep the working set small
31. Exploit data locality
32. Exploit temporal locality
33. Avoid false sharing
34. Use non temporal stores
Cache-friendly code
Branch predictor friendly code
23. Take function parameters properly
24. Avoid allocations in loops
25. Avoid copying exceptions
26. Avoid copies in range-for
27. Avoid copies in lambda captures
28. Avoid copies in structured bindings
29. Provide && method overloads
Branch predictor
35. Avoid indirected calls
36. Make branches predictable
37. Use branchless optimisations
38. Use SIMD intrinsics
38. Use SIMD intrinsics
Effective use of C++
Build pipeline modification
Manual hardware oriented optimisations
Your Performance Todo List
- Enable compiler optimisations
- Set target architecture
- Use fast math
- Disable exceptions and RTTI
- Enable Link Time Optimisation
- Use Unity Builds
- Link statically
- Use Profile Guided Optimisation
- Try different compilers
- Try different standard library implementations
- Keep your tools updated
- Preload you program with a replacement library
- Use binary post processing tools
14. Use constexpr
15. Make variables const
16. Use noexcept
17. Use static for internal linkage
18. Use [[noreturn]]
19. Use [[likely]] and [[unlikely]]
20. Use [[assume]]
21. Mark pointers restrict
22. Mark functions as pure
Annotate your code
No redundant copies
30. Keep the working set small
31. Exploit data locality
32. Exploit temporal locality
33. Avoid false sharing
34. Use non temporal stores
35. Avoid indirected calls
36. Make branches predictable
37. Use branchless optimisations
Cache-friendly code
Branch predictor friendly code
38. Use SIMD intrinsics
23. Take function parameters properly
24. Avoid allocations in loops
25. Avoid copying exceptions
26. Avoid copies in range-for
27. Avoid copies in lambda captures
28. Avoid copies in structured bindings
29. Provide && method overloads
&s and else WHERE credit's due
In the order of appearance
- Inigo Quilez & Pol Jeremias “Shadertoy” (The background for all slides is procedurally generated in a GPU shader. Without their tools, this wouldn’t be possible)
- Hanzel Quantock “Interstellar” shader (Background for animated “Performance” slide)
- Fedor G Pikus “The Art of Writing Efficient Programs” (Inspired my five main performance goals and provided many valuable insights)
- Evan Emerson “Hedley” (Great library which aggregates together different compiler specific attributes and flags)
- Microsoft Docs (Provides flags for MSVC)
- GCC Documentation (Flags and attributes in GCC)
- Daniel Lemire “Its more complicated…” (insightful post about -march and -mtune)
- John Linford “[…] -march, -mtune and -mcpu” (difference between these flags and how they work on x86 vs on ARM)
- Contributors from StackOverflow “What does fast-math do” (“question in title”)
- Simon Byrne “Beware of fast math”
- LLVM Docs “LTO: Design and implementation”
- Contributors from Wikipedia “Interprocedural optimization”
- CMake Docs “UNITY_BUILD”
- Nicolas Fleury “C++ in Huge AAA Games” CppCon 2014 (unity builds)
- Contributors from StackOverflow “Static linking vs dynamic linking”
- Intel “C++ CCDG&R” (PGO)
- Microsoft & GitHub contributors “mimalloc” (info on preloading)
- Meta & LLVM contributors “BOLT”
- C++ Draft Host "eel.is"
- The ISO C++ Standard Draft
- Matt Godbolt “Compiler Explorer”
- P0847 authors “Deducing This”
- Khronos Group “OpenGL” (for being a great example at how to be a terrible API)
- cppreference.com
- Timur Doumler “P1174: Portable assumptions”
- Ralf “Pointers are complicated” (pointer provenance)
Presentation made using slides.com
Nexa font family by Fontfabric
SVGs made in Pixelmator Pro
“Memory Tape” rendered in Blender
Your Performance Todo List: The Most Important Performance Opportunities and Pitfalls to Remember About
By Jan Bielak
Your Performance Todo List: The Most Important Performance Opportunities and Pitfalls to Remember About
This is the interactive slide deck for my CppCon 2022 talk. Writing efficient programs is hard. This is because it requires a lot of knowledge, experience and strategic thinking. There have been many talks on optimization and often each addresses a single concept. Being able to achieve a bird’s eye view of factors affecting performance often requires many hours of researching the topic. To lessen the mental burden of optimizing programs, I have picked out the techniques, I believe are most important. During the talk, I will present them in an organized manner and provide practical examples of how they can be applied. I will first discuss what I believe are the main goals efficient programs strive to achieve. Then, I will present the general methods of achieving those goals. Then, for the majority of the talk, we will discuss a few dozen performance opportunities. For each of them, I will explain the underlying mechanism of how the optimisation works. I will avoid bluntly giving guidelines to follow without explanation. Each of the techniques naturally comes with its costs, and those will be discussed as well. I will additionally discuss various performance pitfalls. These are sometimes called “premature pessimisations” in contrast to the often used term of “premature optimizations”. I will show examples of optimizations which do not incur any cost on program readability or ma
- 1,684