Race conditions in

distributed systems

Atomic updates, optimistic locking, ...

Race conditions in

distributed systems

Atomic updates, optimistic locking, ...

Spot the problem

const basketItem = await basketItems.find({ guid });
basketItem.amount++;
await basketItems.replace({ guid }, basketItem);
const basketItem = await basketItems.find({ guid });
await basketItems.update({ guid }, { $set: { amount: item.amount + 1 } });
const basketItem = await basketItems.find({ itemId });
if (basketItem) {
  await basketItems.update({ guid: item.guid}, { amount: item.amount + 1 });
} else {
  await basketItems.insert({ guid: genGuid(), itemId, amount: 1 });
}

1

2

3

5

4

const user = users.find({ guid });
if (user && user.createdOn < new Date('2021-01-01')) {
  users.update({ guid }, { $set:{ legacy: true } });
}
const openPayments = payments.find({ completedOn: { $ne: null } });
payments.update(
  { guid: { $in: openPayments.map(x => x.guid) } },
  { $set:{ completedOn: now, wasForceClosed: true } }
);

Examples from our code base

Examples from our code base

Examples from our code base

Examples from our code base

Examples from our code base

  • Applying coupons and changing amount at the same time (fixed)
  • Adding items before opening payment request
  • Calculating Deutschland Card points for multiple items
  • Plu search in cart ui: One lookup could overtake the other (fixed)
  • Mobile app search: One lookup could overtake the other (fixed!?)
  • Store processing: Messages of different priorities are concurrent
  • ...

Everything that can possibly go wrong will go wrong"

Murphy's law

\text{Chance for one particular case} = 1 : 1,000,000,000\\ \text{Count of similar cases} = 100\\ \text{Count of events} = 500,000 \times 20\\ \text{Expected bug count} = \frac{100 \times 500,000 \times 20}{1,000,000,000} = 1\\ \text{=> once per month}

Chance for an inconsistency to happen per month

\text{And for each 100,000 carts?}\\ + 20\\ \text{20 times, 40 times, 80 times, ... per month}

Locks

  • In local system: Lock, Semaphore, Mutex, Queue, ...
  • In distributed system: Distributed locks/sessions with Redis, Service Bus, ...

Pro:

  • Covers a lot of cases

  • May be implemented centrally => don't need to think about it most of the time

Con:

  • Costly, potentially slow

  • Can become a bottle neck

  • Deadlocks

Atomic updates

const basketItem = await basketItems.find({ guid });
basketItem.amount++;
await basketItems.replace({ guid }, basketItem);
const basketItem = await basketItems.find({ guid });
await basketItems.update({ guid }, { $set: { amount: item.amount + 1 } });
const basketItem = await basketItems.find({ itemId });
if (basketItem) {
  await basketItems.update({ guid: item.guid}, { amount: item.amount + 1 });
} else {
  await basketItems.insert({ guid: genGuid(), itemId, amount: 1 });
}
const user = users.find({ guid });
if (user && user.createdOn < new Date('2021-01-01')) {
  users.update({ guid }, { $set:{ legacy: true } });
}
const openPayments = payments.find({ completedOn: { $ne: null } });
payments.update(
  { guid: { $in: openPayments.map(x => x.guid) } },
  { $set:{ completedOn: now, wasForceClosed: true } }
);
await basketItems.update({ guid }, {  $inc: { amount: 1 } });


await basketItems.update({ guid }, {  $inc: { amount: 1 } });

await basketitems.update(
  { itemId },
  { $inc: { amount: 1 }, $setOnInsert: { guid: genGuid() } },
  { upsert: true }
)

payments.update(
  { completedOn: { $ne: null },
  { $set:{ wasForceClosed: true }, $currentDate: { completedOn: true } }
);

Pro: Fast, safe, simple

Con: Not always feasable

Optimistic locking

{_id: 1, version: 1, x: 1 }

database

{_id: 1, version: 1, x: 1 }

client 1: oldVersion = 1

{_id: 1, version: 1, x: 1 }

client 2: oldVersion = 1

{_id: 1, version: 1, x: 2 }

client 1: oldVersion = 1

{_id: 1, version: 1, x: 3 }

client 2: oldVersion = 1

{_id: 1, version: 2, x: 2 }

database

{_id: 1, version: 2, x: 2 }

client 2: oldVersion = 2

{_id: 1, version: 2, x: 4 }

client 2: oldVersion = 2

{_id: 1, version: 3, x: 4 }

database

Pro:

  • Usually next to no overhead (collision are rare)
  • Versatile and safe

Con: All clients must use this scheme

var updatedItem = optimisticLock(
  sessionCollection,
  Builders<SessionModel>.Filter.Eq(x => x.guid == model.sessionGuid),
  session => {
    var item = session.Items?.First(x => x.BasketItemGuid == model.BasketItemGuid);
    item.amount++;
    return session;
  }
)

What could it look like?

async Task optimisticLock<T>(
  IMongoCollection<T> coll,
  FilterDefinition<T> filter,
  Func<T, Task<T>> fn
) where T : Versioned
{
  for (var attempt = 0; attempt < 100; attempt++) {
    var item = await coll.Find(filter).Limit(1).FirstAsync();
    var prevVersion = item.version;
    
    var updated = await fn(item);
    updated.version = prevVersion + 1;
    
    var result = await coll.ReplaceOneAsync(
      Builders<T>.Filter.And(new FilterDefinition<T>[] {
        Builders<T>.Filter.Eq(x => x.Id, item.Id),
        Builders<T>.Filter.Eq(x => x.version, prevVersion)
      }),
      updated
    );
    if (result.ModifiedCount > 0) return;
  }
  throw new Exception();
}

Thank you for your attention

Backup slides

Race conditions in distributed systems: Atomic updates, optimistic locking, ...

By Marco Schumacher

Race conditions in distributed systems: Atomic updates, optimistic locking, ...

  • 227