The Database

--; DROP TABLE swsUser

But first what is it?

  • The source of truth for all company data
  • The main storage of all user data
  • Required by pretty much every service SWS has
  • It's kind of a big deal

The Database through time

What is this? A database for ants?

CIQ Data

Database server

Rackspace

We heard you like credits

Lets put it all on one big server lol

Things are getting slow...

Lets make lots of little baby servers

Why is everything still going down?

Hello darkness my old friend

CIQ we need to break-up

  • We started with SWS data and CIQ data living together
  • Meant that the database size combined was 1.9TB
  • The total disk space was 2.0TB
  • Daily chore for me was to clear space before we hit 0 bytes free
  • Backing up SWS data required hacky methods

What a database shouldn't do:

  • Be at constant high CPU usage
  • Have <50GB free of disk space
  • Be almost impossible to back up
  • Also have an uptime like this

What we need from the database:

  • High availability
  • High performance
  • Scalable
  • Frequent & reliable backups

SQL availability groups

  • Three or more SQL servers sharing the same databases
  • Consists of one Primary and two+ secondaries
  • Persisting data happens on the Primary which then synchronises the secondaries
  • Reads optionally go to the Secondaries for read-heavy apps like the Batch

SQL Availability Groups

How is this better than an even bigger server?

  • One server is one big point of failure
  • Maintenance requires outages
  • There's a cap on database performance
  • Disaster recovery requires days potentially

A Failure scenario

  1. Primary node blows up
  2. Secondaries decide on who becomes the new Primary
  3. New primary takes over the primary IPs
  4. Apps reconnect after short downtime (60-90s)
  5. Old Primary is fixed, comes back online
  6. Rejoins the group
  7. Becomes a secondary and starts synchronising
  8. Availability Group returns to Healthy

Backups

  • Now we can do them properly
  • Restoring to staging is a lot easier
  • Backups are stored at 15 minute resolution

Live demo

This could go terribly wrong

SWL Availability Group

By Jabin Bastian

SWL Availability Group

  • 190