MSc, Computer and Systems Eng @ Tallinn Technical University
BSc, Applied Mathematics @ Yildiz Technical University
(was on) Board of Directors @ PostgreSQL Europe
Organizer @ Prague PostgreSQL Meetup
Working with databases for 10+ years
(lives in) Prague
(from) Turkey
(is) New Mom
PostgreSQL Fault Tolerance: WAL
Fault Types in Database Systems
Transaction - Commit - Checkpoint
Replication Methods in PostgreSQL
Failover and Switchover
Managing Timeline Issues: pg_rewind
Synchronous Replication (synchronous_commit)
Logical Decoding
Backups
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails.
User application bugs
Administrator (human) errors
Database software failures
Operating system failures
Hardware failures (disk)
Network failures
Datacenter-level events
Write ahead logging
mechanism
is the main
fault tolerance
system
for PostgreSQL
which ensures
durability of any db changes.
Database changes themselves are not written to data files on disk at transaction commit.
Standard SQL Transaction Isolation Levels
Writes to data files are done sometime later by the background writer or checkpointer on a server.
Crash recovery replays the WAL, but from what point does it start to recover?
Database replication is the term we use to describe the technology used to maintain a copy of a set of data on a remote system.
2000
2005
2010
2014
2017
WAL over network from master to standby
wal_level parameter determines how much information is written to the WAL.
WAL level | Suitable for |
---|---|
minimal | crash recovery |
replica (default at PG12) | physical replication
file-based archiving |
logical | logical replication |
Failover
Switchover
Timelines provide protection from connecting to the wrong upstream after promotion (failover, switchover).
TL1
TL2
Master (Old master)
Standby (New master)
TL1
TL2
Master (Old master)
Standby (New master)
TL1
TL2
Master (Old master)
TL1
TL2
TL1
TL2
Standby (New master)
Synchronous replication guarantees that data is written to at least two nodes before the user or application is told that a transaction has committed.
Logical replication allows us to stream logical data changes between two nodes.
What will be the next big leap of fault tolerance?