First Aid Kit for C/C++ server performance

Vladislav Shpilevoy

FOSDEM'24

Plan

All clickable

Target metrics

Latency

CPU & Memory Usage

Requests Per Second

Issue Categories

Category: Heap

Category: Heap

Lookup takes time

Allocations affect each other

Thread contention

Avoid the heap when can: use stack, store by value

Optimize frequent usages in critical places

malloc()
new
free()
delete

std::vector / map / list / stack / queue ... - all use the heap

Category: Heap

Topic: Object Pooling

Usecase

Category: Heap

A big (>1 KB) structure describing a request with all its data

It is allocated on each incoming request

And deleted on completion

    - Object Pooling

Expensive

Example Code

Category: Heap

class Request
{
    uint64_t userID;
    std::string data;
    uint64_t startTime;
    std::map<std::string, std::string> fields;
    // ...
    // many fields so the size is big, > 1KB.
};
void
processRequest(const Params& params)
{
    Request* r = new Request();
    r->userID = params.userID;
    r->data = parseData(params);
    r->startTime = timeNow();
    r->fields = parseFields(params);
    // ...

    r->start();
}
void
Request::onComplete()
{
    sendResponse();
    delete this;
}

Heap allocation of so big objects is slow

    - Object Pooling

Object Pooling

Category: Heap

RequestPool theRequestPool;
void
processRequest(const Params& params)
{
    Request* r = theRequestPool.take();
    // ...
    r->start();
}
void
Request::onComplete()
{
    sendResponse();
    theRequestPool.put(this);
}

Do not pay for the heap lookup of a free block of the needed size

Deal with concurrency in a more efficient way than the standard heap does

    - Object Pooling

Keep the objects in a pool and reuse them

Trivial Solution

Category: Heap

class RequestPool
{
    std::mutex mutex;
    std::stack<Request> pool;
};
Request*
RequestPool::take()
{
    std::unique_lock lock(mutex);
    if (not pool.empty())
    {
        Request* res = pool.top();
        pool.pop();
        return res;
    }
    return new Request();
};
void
RequestPool::put(Request *req)
{
    std::unique_lock lock(mutex);
    pool.push(req);
};

A mutex protected list or stack or queue

Mutex contention

An STL container will use the heap under the hood

    - Object Pooling

Good Solution

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Heap is used rarely and in bulk. Eventually not used at all

No contention on a single global pool

Limited pool of objects in each thread

While threads can, they use the local pools

When the local pool is empty, a batch is taken from the global pool

Full local pools are moved for reuse into the global pool

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

All empty in the beginning

One thread needs an object. It allocates a whole batch

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

One threads needs an object. It allocates a whole batch

One object is used

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

One object is used

Another object is taken

Second thread also allocates something and does some work

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Second thread also allocates something and does some work

Allocate and use more objects

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Allocate and use more objects

Now imagine some objects end up in another thread

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Now imagine some objects end up in another thread

They are freed and fill up the local pool

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

They are freed and fill up the container

To free the rest the thread moves the batch into the global pool

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

To free more the thread moves the container into global

Now can free more

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Now can free more

Threads do some more random work

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Threads do some more random work

Thread #2 wants to allocate more. It takes a batch from the global

Demonstration

Category: Heap

    - Object Pooling

Global

Thread

Thread

Thread

Thread #2 wants to allocate more. It takes a batch from global

And the work continues

Benchmark

Category: Heap

template<uint32_t PaddigSize>
struct Value
{
    uint8_t myPadding[PaddingSize];
};
template<uint32_t PaddigSize>
struct ValuePooled
    : public Value<PaddigSize>
    , public ThreadPooled<...>
{
};
int
main()
{
    run<Value<1>>();
    run<ValuePooled<1>>();
    run<Value<512>>();
    run<ValuePooled<512>>();
    run<Value<1024>>();
    run<ValuePooled<1024>>();
    return 0;
}
template<typename ValueT>
void
run()
{
    std::vector<std::thread> threads;
    for (int threadI = 0; threadI < threadCount; ++threadI)
    {
        threads.emplace_back([&]() {
            std::vector<ValueT*> values;
            values.resize(valueCount);
            for (int iterI = 0; iterI < iterCount; ++iterI)
            {
                for (int valI = 0; valI < valueCount; ++valI)
                    values[valI] = new ValueT();
                for (int valI = 0; valI < valueCount; ++valI)
                    deleteRandom(values);
            }
        });
    }
    for (std::thread& t : threads)
        t.join();
}

    - Object Pooling

Performance

Category: Heap

    - Object Pooling

1 byte

512 bytes

1024 bytes

300'000 items
5 threads

x1.4

heap: 8119 ms
pool: 5804 ms

x2

heap: 14480 ms
pool:  7249 ms

x2.07

heap: 14853 ms
pool:  7166 ms

Pool

vs

Heap

Category: Heap

Topic: Intrusive Containers

Containers

Category: Heap

The stored object

The link

std::list

void
std::list::push_back(const T& item)
{
    link* l = new link();
    l->m_item = item;
    l->m_prev = m_tail;
    if (m_tail)
        m_tail->m_next = l;
    m_tail = l;
}
std::list<Object*> objects;
// ... fill with data.

std::list::iterator it = objects.find(...);

print(it -> m_any_member);
// Is the same as:
print(it.m_link -> m_data -> m_any_member);

Heap usage on push/pop can be expensive

Non-intrusive = +1 memory lookup for access

    - Intrusive Containers

Intrusive Containers

Category: Heap

intr_list

struct MyObject
{
    // other members ...

    MyObject* next;

    MyObject* prev;
};
void
intr_list::push(T* item)
{
    item->prev = m_tail;
    if (m_tail)
        m_tail->next = item;
    m_tail = item;
}
intr_list<Object*> objects;
// ... fill with data.

intr_list::iterator it = objects.find(...);

print(it -> m_any_member);
// Is the same as:
print(it.m_item -> m_any_member);

No heap usage by the container at all

Intrusive = direct memory access to the stored object

    - Intrusive Containers

Store next and prev inside of object

Examples

Category: Heap

    - Intrusive Containers

template<typename T, T* T::*myLink = &T::myNext>
struct ForwardList
{
    void Prepend(T* aItem);
    void Append(T* aItem);
    void Clear();
    T* PopAll();
    T* PopFirst();

    T* GetFirst();
    const T* GetFirst() const;
    T* GetLast();
    const T* GetLast() const;

    bool IsEmpty() const;

    Iterator begin();
    Iterator end();

    ConstIterator begin() const;
    ConstIterator end() const;

    // ... and more.
};
struct MyObject
{
    // ... any data.

    MyObject* myNext;
};
ForwardList<MyObject> list;

list.Append(object);

object = list.PopFirst();

for (MyObject* it : list)
    DoSomething(it);

// ... and more.

Benchmark

Category: Heap

    - Intrusive Containers

template<typename T, typename List>
void
run()
{
    std::vector<T> items = createItems();
    List<T> list;

    time startTime = now();
    for (T& i : items)
        list.push_back(i);
    time createTime = now() - startTime;

    startTime = now();
    for (T& t : list)
        touch(t);
    time walkTime = now() - startTime;
}
int
main()
{
    // List of pointers.
    run<Item*, std::list>();

    // Intrusive list (always pointers).
    run<Item,  intr_list>();
}

Test list population speed

Test list iteration speed

Both lists store pre-allocated objects by pointers. No copying

Performance

Category: Heap

    - Intrusive Containers

intr_list<Item>

50'000'000 items

x2.87

std:  967681 us
intr: 337408 us

Create:

Walk:

x1.07

std:  920796 us
intr: 860463 us

vs

std::list<Item*>

Category: Thread Contention

Category: Thread Contention

Category: Thread Contention

Topic: False Sharing

Example

Category: Thread Contention

struct object {
    uint64_t input = 0;

    uint64_t output = 0;
};
int main()
{
    object obj;
    time startTime = now();

    std::thread inputThread(inputF, std::ref(obj), target);
    std::thread outputThread(outputF, std::ref(obj), target);
    inputThread.join();
    outputThread.join();

    print(now() - startTime);
    return 0;
}
static void inputF(object &obj, uint64_t target)
{
    while (++obj.input < target)
        continue;
}
static void outputF(object &obj, uint64_t target)
{
    while (++obj.output < target)
        continue;
}
10078861 us

    - False Sharing

Example

Category: Thread Contention

int main()
{
    object obj;
    time startTime = now();

    std::thread inputThread(inputF, std::ref(obj), target);
    std::thread outputThread(outputF, std::ref(obj), target);
    inputThread.join();
    outputThread.join();

    print(now() - startTime);
    return 0;
}
static void inputF(object &obj, uint64_t target)
{
    while (++obj.input < target)
        continue;
}
static void outputF(object &obj, uint64_t target)
{
    while (++obj.output < target)
        continue;
}
struct object {
    uint64_t input = 0;
    char padding[64];
    uint64_t output = 0;
};

x5.24

10078861 us
 1922510 us
struct object {
    uint64_t input = 0;

    uint64_t output = 0;
};

400'000'000 ops

    - False Sharing

Padding

vs

Compact

CPU Caching

Category: Thread Contention

Main Memory

    - False Sharing

Cache

Sync

Memory access always goes through the CPU cache like "proxy"

The cache stores a small subset of the main memory in form of "cache lines"

When >1 core references same address, their caches need to sync on this address

False Sharing

Category: Thread Contention

uint64_t input

uint64_t output

Cache Line

Logically unrelated data, but intersect in hardware

    - False Sharing

False Sharing

Category: Thread Contention

uint64_t input

uint64_t output

Cache Line

Split the data into separate cache lines

padding

    - False Sharing

Category: Thread Contention

Topic: Memory Order

Memory Order

Category: Thread Contention

std::atomic<...>
__sync_...
__atomic_...

relaxed, consume, acquire, release, acq_rel, seq_cst

Atomic

Order

 The order restricts freedom of visibility and completion for lock-free operations 

    - Memory Order

No Order

Category: Thread Contention

a, b, c = 0
a = 1
b = 2
c = 3
print("c: ", c)
print("b: ", b)
print("a: ", a)
c: 3
b: 2
a: 1
c: 0
b: 2
a: 1
c: 3
b: 0
a: 0
c: 0
b: 2
a: 0

...

...

    - Memory Order

Benchmark

Category: Thread Contention

    - Memory Order

uint64_t run(std::atomic_uint64_t& value)
{
    for (uint64_t i = 0; i < targetValue; ++i)
        value.store(1, MEMORY_ORDER);

    return (int)value.load(relaxed);
}

int main()
{
    std::atomic_uint64_t value(0);
    time startTime = now();

    run(value);

    print(now() - startTime);
    return 0;
}
#define MEMORY_ORDER \
    std::memory_order_relaxed
#define MEMORY_ORDER \
    std::memory_order_seq_cst

Performance

Category: Thread Contention

    - Memory Order

relaxed

x16.7

rel:  269647 us
seq: 4511327 us

vs

sequential

1'000'000'000 ops

x86-64 gcc 11.4.0

Reason

Category: Thread Contention

    - Memory Order

uint64_t run(std::atomic_uint64_t& value)
{
    for (uint64_t i = 0; i < targetValue; ++i)
        value.store(1, MEMORY_ORDER);

    return (int)value.load(relaxed);
}
.L2:
        mov     QWORD PTR [rdi], 1
        sub     rax, 1
        jne     .L2
        mov     rax, QWORD PTR [rdi]
        ret
.L6:
        mov     rcx, rdx
        xchg    rcx, QWORD PTR [rdi]
        sub     rax, 1
        jne     .L6
        mov     rax, QWORD PTR [rdi]
        ret

relaxed

sequential

Sequential order enforced a too strict 'xchg' operation instead of 'mov'

For fun have a look at the asm on x86-64 clang 17.0.1

Category: Thread Contention

Topic: Lock-Free Queues

Queues

Category: Thread Contention

class Queue
{
    std::mutex mutex;

    std::queue
    std::vector
    std::list    data;
    std::stack
    std::dequeue
         ...
};

    - Lock-Free Queues

Contention on the mutex

Lock-Free Queue

Category: Thread Contention

    - Lock-Free Queues

Single-Producer-
Single-Consumer

Single-Producer-
Multi-Consumer

Multi-Producer-
Single-Consumer

Multi-Producer-
Multi-Consumer

Solutions

Category: Thread Contention

    - Lock-Free Queues

Single-Producer-
Single-Consumer

Multi-Producer-
Single-Consumer

Single-Producer-
Multi-Consumer

Multi-Producer-
Multi-Consumer

Benchmark

Category: Thread Contention

    - Lock-Free Queues

Lock-Free

x1.5

9 mln/sec

5 prod-threads

vs

With mutex

x2.6

10 prod-threads

5.8 mln/sec

Lock-Free

vs

With mutex

5 cons-threads

x2.6

2.5 mln/sec

10 cons-threads

x4.5

1.7 mln/sec

Category: Network

Category: Network

Category: Network

Topic: Scatter-Gather IO

Benchmark

Category: Network

struct Message
{
    void* data;
    size_t size;  
};
void sendAll(int sock, const vector<Message>& msgs)
{
    for (const Message& m : msgs)
        send(sock, m.data, m.size);
}

    - Scatter-Gather IO

2.52 GB/sec

A list of buffers to send

The simplest way to send them

16 x 1KB buffers / each sendAll() call

Benchmark

Category: Network

struct Message
{
    void* data;
    size_t size;  
};
void sendAll(int sock, const vector<Message>& msgs)
{
    for (const Message& m : msgs)
        send(sock, m.data, m.size);
}

    - Scatter-Gather IO

2.52 GB/sec
void sendAll(int sock, const vector<Message>& msgs)
{
    struct iovec vecs[limit];
    size_t count = min(limit, msgs.size());
    for (size_t i : count)
    {
        vecs[i].iov_base = msgs[i].data;
        vecs[i].iov_len = msgs[i].size;
    }

    struct msghdr msg = {0};
    msg.msg_iov = vecs;
    msg.msg_iovlen = count;
    sendmsg(sock, &msg, 0);
}

x2.4

6.08 GB/sec

sendmsg

vs

send

16 x 1KB buffers / each sendAll() call

Solution

Category: Network

    - Scatter-Gather IO

ssize_t sendmsg(
    int sockfd,
    const struct msghdr *msg,
    int flags);
struct msghdr {
    struct iovec *msg_iov;
    size_t msg_iovlen;
    ...
};
ssize_t recvmsg(
    int sockfd,
    struct msghdr *msg,
    int flags);

Must be > 1 to make sense

Note!

These are bad:

ssize_t readv(
    int fd,
    const struct iovec *iov,
    int iovcnt);
ssize_t writev(
    int fd,
    const struct iovec *iov,
    int iovcnt);

They spoil statistics in

/proc/self/io

Category: Network

Topic: Event Queue

Massive Load

Category: Network

    - Event Queue

r

w

w

r

w

w

w

w

...

...

...

...

Load Handling

Category: Network

    - Event Queue

Periodic Polling

N sec

Reactive Polling

r

w

w

r

w

Event

Event Queue

r

w

r

w

r

r

r

w

r

w

r

r

w

r

Periodic Polling

Category: Network

    - Event Queue

void
Client::onOutput()
{
    if (m_out_size == 0)
        return;
    ssize_t rc = send(m_sock,
        m_out_buf, m_out_size);
    handleOutput(rc);
}
void
Server::run()
{
    while (true) {
        int s = accept(m_sock);
        if (s >= 0)
            addClient(s);

        for (Client* c : m_clients) {
            c->onInput();
            c->onOutput();
        }
        usleep(100'000); // 100ms
    }
}
void
Client::onInput()
{
    ssize_t rc = recv(m_sock,
        m_in_buf, m_in_size);
    handleInput(rc);
}

Work with every socket in a loop with a period

Additional latency

Increased CPU usage for EWOULDBLOCK system calls

Reactive Polling

Category: Network

    - Event Queue

int
select(int nfds,
       fd_set *readfds,
       fd_set *writefds,
       fd_set *exceptfds,
       struct timeval *timeout);

void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);

Deprecated*. Crash on descriptors with value > 1024.

* Not on Windows

** FD_SETSIZE on Mac

Reactive Polling

Category: Network

    - Event Queue

int
select(int nfds,
       fd_set *readfds,
       fd_set *writefds,
       fd_set *exceptfds,
       struct timeval *timeout);

void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
int
poll(
    struct pollfd *fds,
    nfds_t nfds,
    int timeout);

struct pollfd {
    int fd;
    short events;
    short revents;
};

Deprecated*. Crash on descriptors with value > 1024.

Usable, fast on small descriptor count

Takes file descriptors with events to listen for

Returns the happened events here

* Not on Windows

** FD_SETSIZE on Mac

Reactive Polling: Example

Category: Network

    - Event Queue

void
Server::run()
{
    while (true) {
        int rc = poll(m_fds, m_fd_count, 0);
        for (int i = 0; i < m_fd_count; ++i)
        {
            pollfd& fd = m_fds[i];
            if (isServer(fd)) {
                if (fd.rvents & POLLIN)
                    tryAccept();
                continue;
            }

            Client* c = m_clients[i];
            if (fd.rvents & POLLIN)
                c->onInput();
            if (fd.revnts & POLLOUT)
                c->onOutput();
        }
    }
}

Wait for events on any of those descriptors

Perform only the work which was signaled with an event

No sleeps

No parasite system calls

Benchmark

Category: Network

    - Event Queue

Many clients, interaction in groups, waves. Not all clients active all the time

5 waves x 1000 clients x 1000 requests

Reactive

vs

Periodic

+25% speed

0

EWOULDBLOCK

119'006'466

EWOULDBLOCK

Busy-loop, no sleep

Sleep when no events

Event Queue

Category: Network

    - Event Queue

int
epoll_create1(int flags);

int
epoll_ctl(
    int epfd,
    int op,
    int fd,
    struct epoll_event *event);

int
epoll_wait(
    int epfd,
    struct epoll_event *events,
    int maxevents,
    int timeout);

struct epoll_event {
   uint32_t events;
   epoll_data_t data;
};
int
kqueue(void);

int
kevent(
    int kq,
    const struct kevent *changelist,
    int nchanges,
    struct kevent *eventlist,
    int nevents,
    const struct timespec *timeout);

EV_SET(kev, ident, filter, flags,
    fflags, data, udata);

Linux

Mac, BSD

Sockets are tracked inside of the kernel

Need to add them once

Fetch events just for the sockets which are signaled

No need to pass all sockets into the kernel each time when fetching events

Event Queue: Example

Category: Network

    - Event Queue

void
Server::run()
{
    epoll_event events[maxEvents];

    while (true) {
        int count = epoll_wait(m_epoll, events, maxEvents, 0);
        for (int i = 0; i < count; ++i)
        {
            epoll_event& e = events[i];
            if (e.data.fd == m_server_sock) {
                if (e.events & EPOLLIN)
                    tryAccept();
                continue;
            }

            Client* c = e.data.ptr;
            if (e.events & EPOLLIN)
                c->onInput();
            if (e.events & EPOLLOUT)
                c->onOutput();
        }
    }
}

Wait for events on sockets in this event-queue

Process only the events. No fullscan of all sockets

Perform only the work which was signaled with an event

No sleeps

No parasite system calls

No socket array fullscan

Benchmark

Category: Network

    - Event Queue

+30% speed

5 waves x 1000 clients x 1000 requests

Epoll

vs

Poll

CPU burn for fullscan

Less CPU usage

Additional content

Additional content

Signal non-empty queue

void Queue::postFast(Task *t)
{
    std::unique_lock lock(m_mutex);
    bool wasEmpty = m_queue.empty();
    m_queue.push_back(t);
    if (wasEmpty)
        m_cond.broadcast();
}
void Queue::postSlow(Task *t)
{
    std::unique_lock lock(m_mutex);
    m_queue.push_back(t);
    m_cond.broadcast();
}
Task *Queue::pop()
{
    std::unique_lock lock(m_mutex);
    while (m_queue.empty())
        m_cond.wait(lock);
    return m_queue.pop();
}

When the queue is highly loaded, there will be less system calls on the condvar

Signal the condition variable only when the queue state changes - from empty to non-empty

Hybrid Containers

template<typename T, uint Size>
class Array
{
private:
    uint m_size;
    T* m_dynamic;
    T m_static[Size];
};

void Array::append(const T& obj)
{
    if (m_size < Size)
    {
        m_static[m_size++] = obj;
        return;
    }
    makeDynamic();
    m_dynamic[m_size++] = obj;
}

Better cache locality if the containers grow beyond the static size rarely

Until fixed size store the items inside the container object, do not allocate separately

Less heap usage and fragmentation

Lock-Free operations

Use std::atomic for simple counters and flags

struct MyStructSlow
{
    void inc()
    {
        std::unique_lock lock(m_mutex);
        ++m_count;
    }

    uint64_t get() const
    {
        std::unique_lock lock(m_mutex);
        return m_count;
    }

    std::mutex m_mutex;
    uint64_t m_count;
};
struct MyStructFast
{
    void inc()
    {
        m_count.fetch_add(1, std::memory_order_relaxed);
    }


    uint64_t get() const
    {
        return m_count.load(std::memory_order_relaxed);
    }


    std::atomic_uint64_t m_count;
};

Smaller type size

More parallelism in case of contention

Drop getpeername()

Simply drop it. getsockname() too. These are system calls and are expensive

void ServerSlow::accept()
{
    // ...
    int sock = accept(m_server, nullptr, nullptr);
    onClient(sock);
    // ...
}



void ServerSlow::onClient(int sock)
{
    // ...
    Client *c = new Client(sock);
    socklen_t len = sizeof(c->m_peerAddr);
    getpeername(sock, &c->m_peerAddr, &len);
    // ...
}

getsockname() is simply useless most of the time

accept() returns the peer address via out params

void ServerFast::accept()
{
    // ...
    sockaddr_storage addr;
    socklent_t len = sizeof(addr);
    int sock = accept(m_server, &addr, &len);
    onClient(sock, &addr);
    // ...
}

void ServerFast::onClient(int sock, const sockaddr *addr)
{
    // ...
    Client *c = new Client(sock, addr);
    // ...
}

Scratch pad allocator

Dynamic allocation for in-scope usage without putting load on the heap

  • ScratchPad allocator;
  • Scope Stack allocator;
  • Linear allocator;
  • ...

Dynamic allocation with the speed of thread's stack

Fast nonblock with no syscall

Set socket NONBLOCK flag with no additional system calls

void Server::acceptFast()
{
    int sock = accept4(m_server, &addr,
        &len, SOCK_NONBLOCK);
}
void Client::createFast()
{
    int sock = socket(domain,
        type | SOCK_NONBLOCK, prot);
}

-2 system calls per socket, can get some actual perf bump for high RPS of one-connection-one-request load

*Linux only

void Server::acceptSlow()
{
    int sock = accept(m_server, &addr, &len);
    int flags = fcntl(sock, F_GETFL, 0);
    fcntl(sock, F_SETFL, flags | O_NONBLOCK);
}
void Client::createSlow()
{
    int sock = socket(domain, type, prot);
    int flags = fcntl(sock, F_GETFL, 0);
    fcntl(sock, F_SETFL, flags | O_NONBLOCK);
}

Non-blocking connect()

connect() can be non-blocking, compatible with poll/epoll

void Client::connectStart()
{
    m_sock = socket(SOCK_NONBLOCK);
    int rc = connect(m_sock, host);
    if (rc == 0) {
        m_state = CONNECTED;
        return;
    }
    if (errno == EINPROGRESS) {
        m_state = CONNECTING;
        return;
    }
    m_state = ERROR;
}
void Client::connectUpdate()
{
    pollfd fd = {
        .fd = m_sock, .events = POLLOUT
    };
    if (poll(&fd, 1, 0) != 1)
        return;
    if (fd.revents & POLLOUT) {
        m_state = CONNECTED;
        return;
    }
    int err;
    int rc = getsockopt(m_sock,
        SOL_SOCKET, SO_ERROR, &err);
    if (rc != 0 || err != 0) {
        m_state = ERROR;
        return;
    }
}

Can connect thousands of clients asynchronously, even in the same epoll as the main IO loop

1 - Start

2 - Instant complete?

3 - Started async

4 - Becomes writable when done

5 - getsockopt() to get error

Stop busy loops

Do not spin in unlimited no-sleep loops

void Server::process()
{
    while (notAllDone())
    {
        for (Client *c : m_clients)
            c->process();
    }
}
void Server::process()
{
    uint yield = 0;
    while (notAllDone())
    {
        for (Client *c : m_clients)
            c->process();
        if (++yield % 1024 == 0)
            usleep(1000);
    }
}

Free the CPU core for other threads sometimes - it can boost total perf

The CPU core time is wasted. Can hurt perf in other threads a lot when core count < thread count

Especially hurtful when the busy-loop thread waits on something done in the other threads, but also steals and burns their time

FOSDEM 2024: First Aid Kit for C/C++ server performance

By Vladislav Shpilevoy

FOSDEM 2024: First Aid Kit for C/C++ server performance

I categorize the primary sources of code performance degradation into three groups: - Thread contention. For instance, too hot mutexes, overly strict order in lock-free operations, and false sharing. - Heap utilization. Loss is often caused by frequent allocation and deallocation of large objects, and by the absence of intrusive containers at hand. - Network IO. Socket reads and writes are expensive due to being system calls. Also they can block the thread for a long time, resulting in hacks like adding tens or hundreds more threads. Such measures intensify contention, as well as CPU and memory usage, while neglecting the underlying issue. I present a series of concise and straightforward low-level recipes on how to gain performance via code optimizations. While often requiring just a handful of changes, the proposals might amplify the performance N-fold. The suggestions target the mentioned bottlenecks caused by certain typical mistakes. Proposed optimizations might render architectural changes not necessary, or even allow to simplify the setup if existing servers start coping with the load effortlessly. As a side effect, the changes can make the code cleaner and reveal more bottlenecks to investigate.

  • 829