Kevin Song
I'm a student at UT (that's the one in Austin) who studies things.
Multiple noncommunicating managers is an awful idea everywhere
std::unique_ptr: the manager that disallows other managers
std::shared_ptr: the manager that talks to other managers
Both classes allow us to manage the lifetimes of objects using RAII
unique_ptr {
T* ptr;
...
unique_ptr(const unique_ptr& other) = delete;
unique_ptr& operator=(const unique_ptr& other) = delete;
data
control block
1
We can forward a forwarding reference with the original value category it had by using std::forward
template<typename T>
T make_T(T&& arg){
T(arg); // Calls copy constructor ONLY
T(std::forward<T>(arg)); // Calls move OR copy constructor
}
T&& in a type-deduced context forms a forwarding pointer, which is a reference to the category it was initialized with.
int main(){
int x = 3;
auto&& rvalue_reference = 3;
auto&& lvalue_reference = x;
}
Zero-cost abstractions
Even abstractions as basic as function calls and loops can add signficant amounts of overhead!
So much so that we have optimization techniques (inlining and unrolling) to deal with the overheads of these abstractions!
So what the heck does it mean when we say that C++ offers zero-cost abstractions?
What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.
-- Bjarne Stroustrup
In practice, something we strive for more than something we accomplish.
a = [1, "1", 1.0]
Python has heterogeneous collections: you can add objects of different types into e.g. a list.
How convenient!
a = [1, "1", 1.0]
Since we can put any object into a list, we can put different-sized objects into a list.
How do we store them?
a = [1, "1", 1.0]
Advantages
Disadvantages
Implications: simple for loops are now \(O(N^2)\)!
a = [1, "1", 1.0]
Advantages
Disadvantages
Any container storing heterogeneous elements must either pay an \( O(N) \) indexing cost or store elements indirectly.
a = [1, 2, 3]
Can I just place the elements in-line like I would for a homogeneous array?
a = [1, 2, 3, "4"]
Can I just place the elements in-line like I would for a homogeneous array?
As long as we can add elements of a different type, we have to be able to deal with it.
a = [1, 1, 1]
This has very real runtime implications for Python!
Let's add 1 to a bunch of numbers to test it out...
#include<vector>
#include<iostream>
int main(){
std::vector<int> a(100'000'000, 1);
for(auto& x : a){
x++;
}
}
def main():
x = [1] * 100_000_000
list(map(lambda z: z+1, x))
if __name__ == "__main__":
main()
int* read_unknown_data(const std::string& filename){
std::istream inf(filename);
int num_elements;
inf >> num_elements;
int* data = new int[num_elements]; // Ugh, when can I delete this?
}
Recall: Stack allocation/deallocation is lightning fast, but size of object needs to be known at compile-time.
Unknown-size objects have to be allocated on the heap and cleaned up later.
class Program{
public int[] readInput(String fname)
Scanner scanner = new Scanner(new File(fname));
int[] vals = new int[scanner.nextInt()]; // Java will take care of it!
for(int i = 0; i < vals.length){
vals[i++] = scanner.nextInt();
}
return vals;
}
Garbage collection makes this fun and easy, at the cost of a little extra time to run the GC algorithm!
But there's a small cost for every single value we use, and that can add up.
Can I tell the garbage collector to only worry about certain data in my heap?
I can manually force garbage collection to occur, but I can't tell the language "trust me, drop that memory now" without violating gc safety requirements.
A lightweight threading system which relies on the language runtime to schedule threads instead of the OS. Found in:
func f(from string) {
for i := 0; i < 3; i++ {
fmt.Println(from, ":", i)
}
}
func main() {
go f("goroutine")
go func(msg string) {
fmt.Println(msg)
}("going")
}
goroutine : 0
going
goroutine : 1
goroutine : 2
example from https://gobyexample.com/goroutines
Threads managed by Go, not by the OS!
Can I tell my programming language that I don't want to use green threads?
Can I remove the code dealing with threads from the language runtime to reduce the complexity/binary size of the language?
Just don't use green threads.
Use your intuition to ask questions, not answer them
--John Osterhout, inventor of Tcl
Software engineers are full of horror stories of someone who spent weeks optimizing the {memory access, program logic, parallelization} only to realize that they didn't actually make the program run any faster!
Use your intuition to ask questions, not answer them
--John Osterhout, inventor of Tcl
We can look at some code and say "it looks like this should be slower"...
...but we must back this up with measurements showing it to be the case!
std::unique_ptr
Runtime cost
Compile time cost
Code Complexity Cost
No: no template is instantiated so there's no extra stuff to compile
No: unique_ptr (in its simplest form) just wraps new and delete
Do we have to pay a cost for unique_ptr if we don't use it?
No.
If we use handwritten allocation instead of letting unique_ptr allocate stuff for us, do we lose any performance?
#include<memory>
#include<iostream>
int main(){
int* databuf = new int[10000];
int* rp = databuf;
std::unique_ptr<int> up(databuf);
std::cout << "Size of unique ptr = " << sizeof(up) << '\n';
std::cout << "Size of raw ptr = " << sizeof(rp) << '\n';
}
❯ ./a.out
Size of unique ptr = 8
Size of raw ptr = 8
Apparently not.
Allocate + Deallocate 100,000,000 ints using smart pointers and raw pointers
Less than a 0.1% difference in a program that spends all of its time doing allocation!
If your program spends 5% of its time doing allocation, this is a 0.05% difference
No extra cost if unused
Just as fast as handwritten code
std::vector
Do we have to pay a cost for vector if we don't use it?
Runtime cost
Compile time cost
Code Complexity Cost
No.
No: no template is instantiated so there's no extra stuff to compile
Minimal: the code for the abstraction has to exist, but we don't have to put it into our program.
Do we need to pay the indirect storage cost like we do in Python?
After all, we can store any type of data in a C++ vector...
std::vector<int>
std::vector<float>
std::vector<Cow>
Since any given vector only stores one type of data, we can store all of them in-line in the heap!
std::vector<int> a = {1, 2, 3};
Element access in a raw array is a simple procedure:
In principle, access in a vector is harder:
_Z8access_NRSt6vectorIiSaIiEEi:
.LFB853:
.cfi_startproc
movq (%rdi), %rax
movslq %esi, %rsi
movl (%rax,%rsi,4), %eax
ret
.cfi_endproc
_Z8access_NPPii:
.LFB854:
.cfi_startproc
movq (%rdi), %rax
movslq %esi, %rsi
movl (%rax,%rsi,4), %eax
ret
.cfi_endproc
How do I make a copy of a vector in C++? (Without using the built-in copy constructor)
template <typename T>
std::vector<T> makeCopy(const std::vector<T>& in){
std::vector<T> out(in.size());
for(int i = 0; i < in.size(); i++){
out[i] = in[i];
}
return out;
}
Time for 10 million elements:
97 milliseconds
vector<int> makeCopySTL(const vector<int> &a) {
vector<int> x = a;
return x;
}
Time for 10 million elements:
13 milliseconds
Most modern processors provide vector units which can load/store multiple addresses at once.
AVX2, per CPU cycle
load/store, per CPU cycle
But there's a catch: the memory block has to be aligned to a multiple of 16! Most memory allocated in a program doesn't have this property!
The vector units require that memory is aligned (e.g. the beginning address is a multiple of 16). An actual data buffer may not have this property.
Orange blocks can be copied via aligned vector instructions--red blocks cannot.
Should we give up on using vector instructions because some of the memory is uneligible?
Copy orange blocks via vector instructions, then clean up red blocks with manual copying!
Determining which blocks can be copied and which cannot at runtime is a little annoying....
_Z11makeCopySTLRKSt6vectorIiSaIiEE:
.LFB853:
.cfi_startproc
;; blah blah blah
.L3:
; Associated code does "patch-up" on red blocks which can't be copied by aligned memmove
movq %rcx, %xmm0
addq %rcx, %rbx
punpcklqdq %xmm0, %xmm0
movq %rbx, 16(%r12)
movups %xmm0, (%r12)
movq 8(%rbp), %rax
movq 0(%rbp), %rsi
movq %rax, %rbx
subq %rsi, %rbx
cmpq %rsi, %rax
je .L6
movq %rcx, %rdi
movq %rbx, %rdx
call memmove@PLT ; Does accelerated memory copies using vector instructions!
movq %rax, %rcx
Reality is much more complicated than presented here--sometimes both your source and destination need to be aligned!
...maybe you could. I probably couldn't.
Broadly try to fulfill two goals:
Some abstractions that don't qualify:
Heterogeneous Collections
Force layout changes in data structure that affect all usages of that collection
Garbage Collection
All data must be collected by the GC, even if we know exactly when it can be safely collected
Green Threads
Can opt not to use them, but implementation must remain in the language runtime
std::vector
std::unique_ptr
Uses compile-time information to make operations just as fast as handwritten counterparts
Uses no more memory and is essentially no slower than a raw pointer.
Vector copy scheme is faster than naive handwritten code!
By Kevin Song
Zero-cost abstractions
I'm a student at UT (that's the one in Austin) who studies things.