From Zero to Hero -
The Npgsql Optimization Story
whoami
- Shay Rojansky
- Long-time Linux/open source developer/enthusiast (installed Slackware somewhere in 1994/5)
- Lead dev of Npgsql
- .NET driver for PostgreSQL (took over in 2012)
- Provider for Entity Framework Core
- Joining Microsoft's data access team in January
- Logistician for Doctors Without Borders (MSF)
whois Npgsql
- Open-source .NET driver for PostgreSQL
- Implements ADO.NET (System.Data.Common)
- Supports fully async (but also sync)
- Fully binary driver
- Fully-featured (PostgreSQL special types, data import/export API, notification support...)
Focus on high performance...
"Database drivers are boring"
- Everyone, including me in 2012
Perf in .NET Core 2.1 and Npgsql 4.0
TechEmpower Fortunes results, round 17
The Raw Numbers
OS | .NET Core | Npgsql 3.2 RPS | Npgsql 4.0 RPS |
---|---|---|---|
Windows | 2.0 | 71,810 | 107,400 |
Windows | 2.1 | 101,546 | 132,778 |
Linux | 2.0 | 73,023 | 93,668 |
Linux | 2.1 | 78,301 | 107,437 |
- On Windows, 3.2 -> 4.0 gives 50% boost
- .NET Core 2.0 -> 2.1 gives a further 24% (85% total)
- Linux perf is lower...
Topics
- Lock-free connection pool
- Async perf techniques
- Buffering / Network I/O
- Command preparation
Breadth rather than depth
Lock-Free Database Connection Pool
What is a DB connection pool?
- Physical DB connections are very expensive to create - especially in PostgreSQL (process spawned in the backend).
- If we open/close all the time, the app grinds to a halt.
- So the driver includes a pool.
- Pooled Open() and Close() must be very efficient (connection sometimes held for very short, e.g. to service one web request)
- It's also the one place where synchronization happens: once a connection is taken out there's no thread-safety.
Identifying Contention Bottlenecks
- Low local CPU usage (was only 91%)
- Network bandwidth far below saturation (10Gbps Ethernet, very little iowait)
- No resource starvation on the PostgreSQL side
- Profiler will show Monitor.Enter()
- Try reducing lock scopes if possible (already optimized)
Lock-Free Programming
- Lock-free programming means, well, no locking.
- All operations are interleaved, all the time, everywhere.
- "Interlocked" CPU instructions are used to perform complex operations atomically.
- It's really, really hard to get right.
Simple Rent
Connection[] _idle = new Connection[PoolSize];
public Connection Rent()
{
for (var i = 0; _idle.Length; i++)
{
// If the slot contains null, move to the next slot.
if (_idle[i] == null)
continue;
// If we saw a connector in this slot, atomically exchange it with a null.
// Either we get a connector out which we can use, or we get null because
// someone has taken it in the meantime. Either way put a null in its place.
var conn = Interlocked.Exchange(ref _idle[i], null);
if (conn != null)
return conn;
}
// No idle connection was found, create a new one
return new Connection(...);
}
Simple Return
public void Return(Connection conn)
{
for (var i = 0; i < _idle.Length; i++)
if (Interlocked.CompareExchange(ref _idle[i], conn, null) == null)
return;
// No free slot was found, close the connection...
conn.Close();
}
This simple implementation was used as a proof-of-concept.
Let's take it up a notch
- Since physical connections are very expensive, we need to enforce an upper limit (think multiple app instances, one DB backend).
- Once the max limit has been reached, wait until a connection is released.
- Release waiters in FIFO, otherwise some get starved.
- So add a waiting queue next to our idle list: if no idle conn was found, enqueue a TaskCompletionSource.
- When releasing, first check the waiting queue to release any pending attempts. Otherwise return to the idle list.
Two data structures, no locks
- One idle list, one waiting queue. By the time we finish updating one, the other may have changed.
- Example: rent attempt enters waiting queue, at the same time as a release puts a connection back in the idle list. Rent attempt waits forever.
- Need a way to have a single, atomically-modified state (how many busy, idle, waiting)
Atomically manipulate states
[StructLayout(LayoutKind.Explicit)]
internal struct PoolState
{
[FieldOffset(0)]
internal short Idle;
[FieldOffset(2)]
internal short Busy;
[FieldOffset(4)]
internal short Waiting;
[FieldOffset(0)]
internal long All;
}
// Create a local copy of the pool state, in the stack
PoolState newState;
newState.All = _state.All;
// Modify it
newState.Idle--;
newState.Busy++;
// Commit the changes
_state.All = newState.All;
Problem: we have to have multiple counters (idle, busy, waiting), and manipulate them atomically.
But two operations may be racing, and we lose a change...
Handling race conditions
var sw = new SpinWait();
while (true)
{
// Create a local copy of the pool state, in the stack
PoolState oldState;
oldState.All = Volatile.Read(ref _state.All); // Volatile prevents unwanted "optimization"
var newState = oldState;
// Modify it
newState.Idle--;
newState.Busy++;
// Try to commit the changes
if (Interlocked.CompareAndExchange(ref _state.All, newState.All, oldState.All) == oldState.All)
break;
// Our attempt to increment the busy count failed, Loop again and retry.
sw.SpinOnce();
}
- This is how a lot of lock-free code looks like: try until success.
- SpinWait is a nice tool for spinning, and will automatically context switch after some time.
- Beware of the ABA problem!
Very subtle issues
- An open attempt creates a new connection
(idle == 0, total < max). - But a network error causes an exception. State has already been updated (busy++), need to decrement.
- But we already did busy++, so maybe busy == max. Someone may have gone into the waiting queue - so we need to un-wait them.
Conclusion
https://medium.com/@tylerneely/fear-and-loathing-in-lock-free-programming-7158b1cdd50c
8% perf increase
But probably the most dangerous 8% I've ever done.
No standard connection pool API
- Pooling is an internal concern of the provider, there's no user-facing API (unlike JDBC).
var conn = new NpgsqlConnection("Host=...");
conn.Open();
- This imposes a lookup on every open - we need to get the pool for the given connection string. This is bad especially since an application typically has just one pool.
- Npgsql implements this very efficiently (iterate over list, do reference comparison). But a first-class pooling API would obviate this entirely.
No standard connection pool API
- A 1st-class pool API would allow us to take the pool out of the provider:
- One strong implementation
- Reuse across providers
- Different pooling strategies
- No need to lookup by connection string
- https://github.com/dotnet/corefx/issues/26714
Async Perf Tips
Avoid async-related allocations
- ValueTask allows us to avoid the heap allocation of a Task<T> when the async method completes synchronously.
- Not necessary if the method returns void, bool and some other values.
- Since 2.1, IValueTaskSource also allows us to eliminate the allocation when the method completes asynchronously. This is advanced.
Best async tip: avoid async
- Conventional wisdom is that for asynchronously-completing async methods, any async-related CPU overhead will be dominated by I/O latency etc.
- Not true for DbFortunesRaw (low-latency 10Gbps Ethernet, simple query with data in memory...).
- Even synchronously-yielding method invocations need to be treated very aggressively. The cost of the state machine is noticeable.
Refactoring to Reduce Async Calls
var msg = await ReadExpecting<SomeType>();
async ValueTask<T> ReadExpecting<T>()
{
var msg = await ReadMessage();
if (msg is T expected)
return expected;
throw new NpgsqlException(...);
}
var msg = Expect<T>(await ReadMessage());
internal static T Expect<T>(IBackendMessage msg)
=> msg is T asT
? asT
: throw new NpgsqlException(...);
async ValueTask<T> ReadMessage() { ... }
Reduced from ~7 layers to 2-3, ~3.5% RPS boost
Allows inlining
Creating Fast Paths
- A more extreme way to eliminate async is to build fast paths for when data is in memory.
Task Open()
{
// Attempt to get a connection from the idle list
if (pool.TryAllocateFast(connString, out var connection))
return connection;
// No dice, do the slow path
return OpenLong();
async Task OpenLong() {
...
}
}
- This is similar to ValueTask: ValueTask eliminates the memory overhead of the async method, but we also want to get rid of the CPU overhead.
- Of course, this causes code duplication...
Small Detour: async & sync paths
- Required by the ADO.NET API, may be considered legacy
- But sync still has advantages over async
- How to maintain two paths?
- Duplicate - no way
- Sync over async - deadlocks
- Async over sync - not true async
Async & sync paths
- Made possible by ValueTask (otherwise sync methods allocate for nothing)
- We still pay useless async state machine CPU overhead in sync path, but it's acceptable.
public Task Flush(bool async)
{
if (async)
return Underlying.WriteAsync(Buffer, 0, WritePosition);
Underlying.Write(Buffer, 0, WritePosition);
return Task.CompletedTask;
}
Buffering, Network I/O and Minimizing Roundtrips
Golden Rules of I/O
- Do as little read/write system calls as possible.
- Pack as much data as possible in a memory buffer before writing.
- Read as much data from the socket in each read.
- No needless copying!
- If you do the above, for god's sake, disable Nagling (SO_NODELAY)
Read Buffering in ADO.NET
- The API requires at least row-level buffering - can access fields in random order.
using (var reader = cmd.ExecuteReader())
while (reader.Read())
Console.WriteLine(reader.GetString(1) + ": " + reader.GetInt32(0));
- A special flag allows column-level buffering, for huge column access.
Current Implementation
- PostgreSQL wire protocol has length prefixes.
- In Npgsql, each physical connection has 8KB read and write buffers (tweakable).
- Type handlers know how to read/write values.
- Problem: need to constantly check we have enough data/space
- Problem: rowsize > 8KB? Oversize buffer, lifetime management...
These two problems are obviated by the new Pipelines API
Roundtrip per Statement
INSERT INTO...
OK
INSERT INTO...
OK
Batching/Pipelining
INSERT INTO...; INSERT INTO ...
OK; OK
Batching / Pipelining
-
The performance improvement is dramatic.
-
The further away the server is, the more the speedup (think cloud).
-
Supported for all CRUD operations, not just insert.
-
For bulk insert, binary COPY is an even better option.
-
Entity Framework Core takes advantage of this.
var cmd = new NpgsqlCommand("INSERT INTO ...; INSERT INTO ...", conn);
cmd.ExecuteNonQuery();
Batching / Pipelining - No Real API
-
The batching "API" is shoving several statements into a single string.
-
This forces Npgsql to parse the text and split on semicolons...
-
(parsing needs to happen for other reasons)
var cmd = new NpgsqlCommand("INSERT INTO ...; INSERT INTO ...", conn);
cmd.ExecuteNonQuery();
Less Obvious Batching
var tx = conn.BeginTransaction(); // Sends BEGIN
// do stuff in the transaction
BEGIN
OK
INSERT INTO...
OK
Prepend, prepend
-
BeginTransaction() simply writes the BEGIN into the write buffer, but doesn’t send. It will be sent with the next query.
-
Another example: when closing a (pooled) connection, we need to reset its state.
-
This is also prepended, and will be sent when the connection is “reopened” and a query is sent.
The Future - Multiplexing?
- Physical connections are expensive, scarce resources
- Pooling is already hyper-efficient, but under extreme load we're starved for connections
- What if connections were shared between "clients", and we could send queries on any connection, at any time?
- The goal: have more pending queries out on the network, rather than waiting for a connection to be released.
Multiplexing
- When someone executes a query, we look for any physical connection and write.
- The client waits (on a TaskCompletionSource?). When the resultset arrives, we buffer it entirely in memory (need a good buffer pool, probably use Pipelines).
- Breaks the current ownership model: a connection only belongs to you when you're doing actual I/O
- Some exceptions need to be made for exclusive ownership (e.g. transactions)
Command Preparation
What are prepared statements?
- Typical applications execute the same SQL again and again with different parameters.
- Most database support prepared statements
- Preparation is very important in PostgreSQL:
SELECT table0.data+table1.data+table2.data+table3.data+table4.data
FROM table0
JOIN table1 ON table1.id=table0.id
JOIN table2 ON table2.id=table1.id
JOIN table3 ON table3.id=table2.id
JOIN table4 ON table4.id=table3.id
Unprepared:
TPS: 953
Prepared:
TPS: 8722
Persisting Prepared Statements
-
Prepared statements + short-lived connections =
using (var conn = new NpgsqlConnection(...))
using (var cmd = new NpgsqlCommand("<some_sql>", conn) {
conn.Open();
cmd.Prepare();
cmd.ExecuteNonQuery();
}
-
In Npgsql, prepared statements are persisted.
-
Each physical connection has a prepared statement cache, keyed on the SQL.
-
The first time the SQL is executed on a physical connection, it is prepared. The next time the prepared statement is reused.
Automatic Preparation
-
Prepared statements + O/RMs =
-
O/RMs (or Dapper) don't call DbCommand.Prepare()
-
Npgsql includes optional automatic preparation: all the benefits, no need to call Prepare().
-
Statement is prepared after being executed 5 times.
-
We eject least-recently used statements to make room for new ones.
Closing Remarks
The ADO.NET API
- The API hasn't received significant attention in a long time.
- Pooling API
- Batching API
- Better support for prepared statements
- Generic parameters to avoid boxing
- However, it's still possible to do remarkably well on ADO.NET (no need to throw it out entirely).
Be careful
- Always keep perf in mind, but resist the temptation to over- (or pre-) optimize.
- Reducing roundtrips and preparing commands is much more important than reducing an allocation or a doing fancy low-level stuff.
- (even if it is slightly less sexy)
It's a Good Time to do Perf
- Perf tools (BDN, profilers) are getting much better, but progress still needs to be made (e.g. Linux)
- Microsoft has really put perf at the center.
- Great open-source relationship with Microsoft data team, around EF Core and Npgsql perf.
- Thanks to Diego Vega, Arthur Vickers, Andrew Peters, Stephen Toub...
- The TechEmpower fortunes benchmark is one of the more realistic ones (e.g. involves DB)
- But still, how many webapps out there send identical simple queries over a 10Gbps link to PostgreSQL, for a small dataset that is cached in memory?
- Performance is usually about tradeoffs: code is more brittle/less safe (e.g. lock-free), duplication (e.g. fast paths).
Is this all useful in the real world?
- No good answer, especially difficult to decide when you're a library...
- Important for .NET Core evangelism: show the world what the stack is capable of, get attention.
- Your spouse & family may hate you, but...
- You learn a ton of stuff and have lots of fun in the process.
So, is this all useful in the real world??
Thank you!
https://slides.com/shayrojansky/dotnetos-2018-11-05/
http://npgsql.org, https://github.com/npgsql/npgsql
Shay Rojansky, roji@roji.org, @shayrojansky
Dotnetos Presentation on Npgsql Performance (full)
By Shay Rojansky
Dotnetos Presentation on Npgsql Performance (full)
- 1,165