Vladislav Shpilevoy PRO
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.
All clickable
Latency
CPU & Memory Usage
Requests Per Second
Lookup takes time
Allocations affect each other
Thread contention
Avoid the heap when can: use stack, store by value
Optimize frequent usages in critical places
malloc() new
free() delete
std::vector / map / list / stack / queue ... - all use the heap
A big (>1 KB) structure describing a request with all its data
It is allocated on each incoming request
And deleted on completion
Expensive
class Request
{
uint64_t userID;
std::string data;
uint64_t startTime;
std::map<std::string, std::string> fields;
// ...
// many fields so the size is big, > 1KB.
};
void processRequest(const Params& params) { Request* r = new Request(); r->userID = params.userID; r->data = parseData(params); r->startTime = timeNow(); r->fields = parseFields(params); // ... r->start(); }
void Request::onComplete() { sendResponse(); delete this; }
Heap allocation of so big objects is slow
RequestPool theRequestPool;
void processRequest(const Params& params) { Request* r = theRequestPool.take(); // ... r->start(); }
void
Request::onComplete()
{
sendResponse();
theRequestPool.put(this);
}
Do not pay for the heap lookup of a free block of the needed size
Deal with concurrency in a more efficient way than the standard heap does
class RequestPool { std::mutex mutex; std::stack<Request> pool; };
Request* RequestPool::take() { std::unique_lock lock(mutex); if (not pool.empty()) { Request* res = pool.top(); pool.pop(); return res; } return new Request();
};
void RequestPool::put(Request *req) { std::unique_lock lock(mutex); pool.push(req);
};
A mutex protected list or stack or queue
Mutex contention
An STL container will use the heap under the hood
Global
Thread
Thread
Thread
Heap is used rarely and in bulk. Eventually not used at all
No contention on a single global pool
Limited pool of objects in each thread
While threads can, they use the local pools
When the local pool is empty, a batch is taken from the global pool
Full local pools are moved for reuse into the global pool
Global
Thread
Thread
Thread
All empty in the beginning
One thread needs an object. It allocates a whole batch
Global
Thread
Thread
Thread
One threads needs an object. It allocates a whole batch
One object is used
Global
Thread
Thread
Thread
One object is used
Another object is taken
Second thread also allocates something and does some work
Global
Thread
Thread
Thread
Second thread also allocates something and does some work
Allocate and use more objects
Global
Thread
Thread
Thread
Allocate and use more objects
Now imagine some objects end up in another thread
Global
Thread
Thread
Thread
Now imagine some objects end up in another thread
They are freed and fill up the local pool
Global
Thread
Thread
Thread
They are freed and fill up the container
To free the rest the thread moves the batch into the global pool
Global
Thread
Thread
Thread
To free more the thread moves the container into global
Now can free more
Global
Thread
Thread
Thread
Now can free more
Threads do some more random work
Global
Thread
Thread
Thread
Threads do some more random work
Thread #2 wants to allocate more. It takes a batch from the global
Global
Thread
Thread
Thread
Thread #2 wants to allocate more. It takes a batch from global
And the work continues
template<uint32_t PaddigSize>
struct Value
{
uint8_t myPadding[PaddingSize];
};
template<uint32_t PaddigSize>
struct ValuePooled
: public Value<PaddigSize>
, public ThreadPooled<...>
{
};
int main() { run<Value<1>>(); run<ValuePooled<1>>(); run<Value<512>>(); run<ValuePooled<512>>(); run<Value<1024>>(); run<ValuePooled<1024>>(); return 0; }
template<typename ValueT> void run() { std::vector<std::thread> threads; for (int threadI = 0; threadI < threadCount; ++threadI) { threads.emplace_back([&]() { std::vector<ValueT*> values; values.resize(valueCount); for (int iterI = 0; iterI < iterCount; ++iterI) { for (int valI = 0; valI < valueCount; ++valI) values[valI] = new ValueT(); for (int valI = 0; valI < valueCount; ++valI) deleteRandom(values); } }); } for (std::thread& t : threads) t.join(); }
heap: 8119 ms pool: 5804 ms
heap: 14480 ms pool: 7249 ms
heap: 14853 ms pool: 7166 ms
The stored object
The link
void
std::list::push_back(const T& item)
{
link* l = new link();
l->m_item = item;
l->m_prev = m_tail;
if (m_tail)
m_tail->m_next = l;
m_tail = l;
}
std::list<Object*> objects; // ... fill with data. std::list::iterator it = objects.find(...); print(it -> m_any_member); // Is the same as: print(it.m_link -> m_data -> m_any_member);
Heap usage on push/pop can be expensive
Non-intrusive = +1 memory lookup for access
struct MyObject { // other members ... MyObject* next; MyObject* prev; };
void intr_list::push(T* item) { item->prev = m_tail; if (m_tail) m_tail->next = item; m_tail = item; }
intr_list<Object*> objects; // ... fill with data. intr_list::iterator it = objects.find(...); print(it -> m_any_member); // Is the same as: print(it.m_item -> m_any_member);
No heap usage by the container at all
Intrusive = direct memory access to the stored object
template<typename T, T* T::*myLink = &T::myNext> struct ForwardList { void Prepend(T* aItem); void Append(T* aItem); void Clear(); T* PopAll(); T* PopFirst(); T* GetFirst(); const T* GetFirst() const; T* GetLast(); const T* GetLast() const; bool IsEmpty() const; Iterator begin(); Iterator end(); ConstIterator begin() const; ConstIterator end() const; // ... and more. };
struct MyObject { // ... any data. MyObject* myNext; };
ForwardList<MyObject> list; list.Append(object); object = list.PopFirst(); for (MyObject* it : list) DoSomething(it); // ... and more.
template<typename T, typename List> void run() { std::vector<T> items = createItems(); List<T> list; time startTime = now(); for (T& i : items) list.push_back(i); time createTime = now() - startTime; startTime = now(); for (T& t : list) touch(t); time walkTime = now() - startTime; }
int main() { // List of pointers. run<Item*, std::list>(); // Intrusive list (always pointers). run<Item, intr_list>(); }
Test list population speed
Test list iteration speed
Both lists store pre-allocated objects by pointers. No copying
std: 967681 us intr: 337408 us
std: 920796 us intr: 860463 us
struct object { uint64_t input = 0; uint64_t output = 0; };
int main() { object obj; time startTime = now(); std::thread inputThread(inputF, std::ref(obj), target); std::thread outputThread(outputF, std::ref(obj), target); inputThread.join(); outputThread.join(); print(now() - startTime); return 0; }
static void inputF(object &obj, uint64_t target)
{
while (++obj.input < target)
continue;
}
static void outputF(object &obj, uint64_t target) { while (++obj.output < target) continue; }
10078861 us
int main() { object obj; time startTime = now(); std::thread inputThread(inputF, std::ref(obj), target); std::thread outputThread(outputF, std::ref(obj), target); inputThread.join(); outputThread.join(); print(now() - startTime); return 0; }
static void inputF(object &obj, uint64_t target)
{
while (++obj.input < target)
continue;
}
static void outputF(object &obj, uint64_t target)
{
while (++obj.output < target)
continue;
}
struct object { uint64_t input = 0; char padding[64]; uint64_t output = 0; };
10078861 us
1922510 us
struct object { uint64_t input = 0; uint64_t output = 0; };
Memory access always goes through the CPU cache like "proxy"
The cache stores a small subset of the main memory in form of "cache lines"
When >1 core references same address, their caches need to sync on this address
uint64_t input
uint64_t output
Logically unrelated data, but intersect in hardware
uint64_t input
uint64_t output
Split the data into separate cache lines
padding
std::atomic<...>
__sync_...
__atomic_...
relaxed, consume, acquire, release, acq_rel, seq_cst
a, b, c = 0
a = 1 b = 2 c = 3
print("c: ", c) print("b: ", b) print("a: ", a)
c: 3 b: 2 a: 1
c: 0 b: 2 a: 1
c: 3 b: 0 a: 0
c: 0 b: 2 a: 0
uint64_t run(std::atomic_uint64_t& value) { for (uint64_t i = 0; i < targetValue; ++i) value.store(1, MEMORY_ORDER); return (int)value.load(relaxed); } int main() { std::atomic_uint64_t value(0); time startTime = now(); run(value); print(now() - startTime); return 0; }
#define MEMORY_ORDER \
std::memory_order_relaxed
#define MEMORY_ORDER \
std::memory_order_seq_cst
rel: 269647 us seq: 4511327 us
uint64_t run(std::atomic_uint64_t& value) { for (uint64_t i = 0; i < targetValue; ++i) value.store(1, MEMORY_ORDER); return (int)value.load(relaxed); }
.L2:
mov QWORD PTR [rdi], 1
sub rax, 1
jne .L2
mov rax, QWORD PTR [rdi]
ret
.L6:
mov rcx, rdx
xchg rcx, QWORD PTR [rdi]
sub rax, 1
jne .L6
mov rax, QWORD PTR [rdi]
ret
For fun have a look at the asm on x86-64 clang 17.0.1
class Queue { std::mutex mutex; std::queue std::vector std::list data; std::stack std::dequeue ... };
Contention on the mutex
Single-Producer-
Single-Consumer
Single-Producer-
Multi-Consumer
Multi-Producer-
Single-Consumer
Multi-Producer-
Multi-Consumer
Single-Producer-
Single-Consumer
Multi-Producer-
Single-Consumer
Single-Producer-
Multi-Consumer
Multi-Producer-
Multi-Consumer
9 mln/sec
5.8 mln/sec
2.5 mln/sec
1.7 mln/sec
struct Message { void* data; size_t size; };
void sendAll(int sock, const vector<Message>& msgs) { for (const Message& m : msgs) send(sock, m.data, m.size); }
2.52 GB/sec
A list of buffers to send
The simplest way to send them
struct Message { void* data; size_t size; };
void sendAll(int sock, const vector<Message>& msgs)
{
for (const Message& m : msgs)
send(sock, m.data, m.size);
}
2.52 GB/sec
void sendAll(int sock, const vector<Message>& msgs)
{
struct iovec vecs[limit];
size_t count = min(limit, msgs.size());
for (size_t i : count)
{
vecs[i].iov_base = msgs[i].data;
vecs[i].iov_len = msgs[i].size;
}
struct msghdr msg = {0};
msg.msg_iov = vecs;
msg.msg_iovlen = count;
sendmsg(sock, &msg, 0);
}
6.08 GB/sec
ssize_t sendmsg( int sockfd, const struct msghdr *msg, int flags);
struct msghdr { struct iovec *msg_iov; size_t msg_iovlen; ... };
ssize_t recvmsg( int sockfd, struct msghdr *msg, int flags);
Must be > 1 to make sense
These are bad:
ssize_t readv( int fd, const struct iovec *iov, int iovcnt);
ssize_t writev( int fd, const struct iovec *iov, int iovcnt);
They spoil statistics in
/proc/self/io
r
w
w
r
w
w
w
w
Periodic Polling
Reactive Polling
r
w
w
r
w
Event Queue
r
w
r
w
r
r
r
w
r
w
r
r
w
r
void
Client::onOutput()
{
if (m_out_size == 0)
return;
ssize_t rc = send(m_sock,
m_out_buf, m_out_size);
handleOutput(rc);
}
void Server::run() { while (true) { int s = accept(m_sock); if (s >= 0) addClient(s); for (Client* c : m_clients) { c->onInput(); c->onOutput(); } usleep(100'000); // 100ms } }
void
Client::onInput()
{
ssize_t rc = recv(m_sock,
m_in_buf, m_in_size);
handleInput(rc);
}
Work with every socket in a loop with a period
Additional latency
Increased CPU usage for EWOULDBLOCK system calls
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); void FD_CLR(int fd, fd_set *set); int FD_ISSET(int fd, fd_set *set); void FD_SET(int fd, fd_set *set); void FD_ZERO(fd_set *set);
Deprecated*. Crash on descriptors with value > 1024.
* Not on Windows
** FD_SETSIZE on Mac
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); void FD_CLR(int fd, fd_set *set); int FD_ISSET(int fd, fd_set *set); void FD_SET(int fd, fd_set *set); void FD_ZERO(fd_set *set);
int poll( struct pollfd *fds, nfds_t nfds, int timeout); struct pollfd { int fd; short events; short revents; };
Deprecated*. Crash on descriptors with value > 1024.
Usable, fast on small descriptor count
Takes file descriptors with events to listen for
Returns the happened events here
* Not on Windows
** FD_SETSIZE on Mac
void Server::run() { while (true) { int rc = poll(m_fds, m_fd_count, 0); for (int i = 0; i < m_fd_count; ++i) { pollfd& fd = m_fds[i]; if (isServer(fd)) { if (fd.rvents & POLLIN) tryAccept(); continue; } Client* c = m_clients[i]; if (fd.rvents & POLLIN) c->onInput(); if (fd.revnts & POLLOUT) c->onOutput(); } } }
Wait for events on any of those descriptors
Perform only the work which was signaled with an event
No sleeps
No parasite system calls
int epoll_create1(int flags); int epoll_ctl( int epfd, int op, int fd, struct epoll_event *event); int epoll_wait( int epfd, struct epoll_event *events, int maxevents, int timeout); struct epoll_event { uint32_t events; epoll_data_t data; };
int kqueue(void); int kevent( int kq, const struct kevent *changelist, int nchanges, struct kevent *eventlist, int nevents, const struct timespec *timeout); EV_SET(kev, ident, filter, flags, fflags, data, udata);
Sockets are tracked inside of the kernel
Need to add them once
Fetch events just for the sockets which are signaled
No need to pass all sockets into the kernel each time when fetching events
void Server::run() { epoll_event events[maxEvents]; while (true) { int count = epoll_wait(m_epoll, events, maxEvents, 0); for (int i = 0; i < count; ++i) { epoll_event& e = events[i]; if (e.data.fd == m_server_sock) { if (e.events & EPOLLIN) tryAccept(); continue; } Client* c = e.data.ptr; if (e.events & EPOLLIN) c->onInput(); if (e.events & EPOLLOUT) c->onOutput(); } } }
Wait for events on sockets in this event-queue
Process only the events. No fullscan of all sockets
Perform only the work which was signaled with an event
No sleeps
No parasite system calls
No socket array fullscan
void Queue::postFast(Task *t)
{
std::unique_lock lock(m_mutex);
bool wasEmpty = m_queue.empty();
m_queue.push_back(t);
if (wasEmpty)
m_cond.broadcast();
}
void Queue::postSlow(Task *t)
{
std::unique_lock lock(m_mutex);
m_queue.push_back(t);
m_cond.broadcast();
}
Task *Queue::pop() { std::unique_lock lock(m_mutex); while (m_queue.empty()) m_cond.wait(lock); return m_queue.pop(); }
When the queue is highly loaded, there will be less system calls on the condvar
template<typename T, uint Size> class Array { private: uint m_size; T* m_dynamic; T m_static[Size]; }; void Array::append(const T& obj) { if (m_size < Size) { m_static[m_size++] = obj; return; } makeDynamic(); m_dynamic[m_size++] = obj; }
Better cache locality if the containers grow beyond the static size rarely
Less heap usage and fragmentation
struct MyStructSlow { void inc() { std::unique_lock lock(m_mutex); ++m_count; } uint64_t get() const { std::unique_lock lock(m_mutex); return m_count; } std::mutex m_mutex; uint64_t m_count; };
struct MyStructFast { void inc() { m_count.fetch_add(1, std::memory_order_relaxed); } uint64_t get() const { return m_count.load(std::memory_order_relaxed); } std::atomic_uint64_t m_count; };
Smaller type size
More parallelism in case of contention
void ServerSlow::accept() { // ... int sock = accept(m_server, nullptr, nullptr); onClient(sock); // ... } void ServerSlow::onClient(int sock) { // ... Client *c = new Client(sock); socklen_t len = sizeof(c->m_peerAddr); getpeername(sock, &c->m_peerAddr, &len); // ... }
getsockname() is simply useless most of the time
accept() returns the peer address via out params
void ServerFast::accept() { // ... sockaddr_storage addr; socklent_t len = sizeof(addr); int sock = accept(m_server, &addr, &len); onClient(sock, &addr); // ... } void ServerFast::onClient(int sock, const sockaddr *addr) { // ... Client *c = new Client(sock, addr); // ... }
Dynamic allocation with the speed of thread's stack
void Server::acceptFast()
{
int sock = accept4(m_server, &addr,
&len, SOCK_NONBLOCK);
}
void Client::createFast()
{
int sock = socket(domain,
type | SOCK_NONBLOCK, prot);
}
-2 system calls per socket, can get some actual perf bump for high RPS of one-connection-one-request load
*Linux only
void Server::acceptSlow()
{
int sock = accept(m_server, &addr, &len);
int flags = fcntl(sock, F_GETFL, 0);
fcntl(sock, F_SETFL, flags | O_NONBLOCK);
}
void Client::createSlow()
{
int sock = socket(domain, type, prot);
int flags = fcntl(sock, F_GETFL, 0);
fcntl(sock, F_SETFL, flags | O_NONBLOCK);
}
void Client::connectStart() { m_sock = socket(SOCK_NONBLOCK); int rc = connect(m_sock, host); if (rc == 0) { m_state = CONNECTED; return; } if (errno == EINPROGRESS) { m_state = CONNECTING; return; } m_state = ERROR; }
void Client::connectUpdate() { pollfd fd = { .fd = m_sock, .events = POLLOUT }; if (poll(&fd, 1, 0) != 1) return; if (fd.revents & POLLOUT) { m_state = CONNECTED; return; } int err; int rc = getsockopt(m_sock, SOL_SOCKET, SO_ERROR, &err); if (rc != 0 || err != 0) { m_state = ERROR; return; } }
Can connect thousands of clients asynchronously, even in the same epoll as the main IO loop
1 - Start
2 - Instant complete?
3 - Started async
4 - Becomes writable when done
5 - getsockopt() to get error
void Server::process() { while (notAllDone()) { for (Client *c : m_clients) c->process(); } }
void Server::process() { uint yield = 0; while (notAllDone()) { for (Client *c : m_clients) c->process(); if (++yield % 1024 == 0) usleep(1000); } }
Free the CPU core for other threads sometimes - it can boost total perf
The CPU core time is wasted. Can hurt perf in other threads a lot when core count < thread count
Especially hurtful when the busy-loop thread waits on something done in the other threads, but also steals and burns their time
By Vladislav Shpilevoy
I categorize the primary sources of code performance degradation into three groups: - Thread contention. For instance, too hot mutexes, overly strict order in lock-free operations, and false sharing. - Heap utilization. Loss is often caused by frequent allocation and deallocation of large objects, and by the absence of intrusive containers at hand. - Network IO. Socket reads and writes are expensive due to being system calls. Also they can block the thread for a long time, resulting in hacks like adding tens or hundreds more threads. Such measures intensify contention, as well as CPU and memory usage, while neglecting the underlying issue. I present a series of concise and straightforward low-level recipes on how to gain performance via code optimizations. While often requiring just a handful of changes, the proposals might amplify the performance N-fold. The suggestions target the mentioned bottlenecks caused by certain typical mistakes. Proposed optimizations might render architectural changes not necessary, or even allow to simplify the setup if existing servers start coping with the load effortlessly. As a side effect, the changes can make the code cleaner and reveal more bottlenecks to investigate.
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.