Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

Gülçin Yıldırım 156398IASM

Master's Thesis, June 2017

Supervisor: Tarmo Robal, PhD

Agenda

Goals & Motivation

Problems & How to Solve?

Platform Implementation

Case Studies & Results

Outlook

Conclusion & Summary

Problems?

DOWNTIME

Revenue Loss

Reputation Loss

High Availability?

SLAs?

Low Capacity?

Database Anyone?

Banks

Social Networks

Desktop Apps

Startups

Medium-size

Enterprises

To Upgrade, or not to Upgrade

New features
Security patches
Perfomance Updates
Bug fixes

Outdated, no support
Vulnerable to attacks
Poor Perfomance
Buggy, hard to maintain

Why Automate?

Risk & Errors
Cost
Time-to-market

Reproducibility
Repeatability
Efficiency

Updating nasa.gov: 1 hr to 5 mins
Patching updates: Multi-day to 45 mins
Application stack setup: 1-2 hrs to 10 mins

Ansible Loves PostgreSQL

( in the s )

Postgres Modules: 6

AWS Modules: 100

PostgreSQL

252

Database

208

Cloud

116

Upgrade

Power of Logical Replication

Offline Conversion
- pg_dump/pg_restore
- pg_upgrade

Online Conversion
- pglogical
- pglupgrade

1

2 Elements of the solution

pglogical

pgbouncer

Ansible

AWS

pglupgrade

How

Does It Work?

First Case: High Availability

First Case: High Availability (2)

Second Case: Read Scalability

Second Case: Read Scalability(2)

Results

Metric (1st case)	pg_dump/pg_restore	pg_upgrade	pglupgrade
Primary Downtime	00:24:27	00:16:25	00:00:03
Partial cluster HA	00:24:27	00:28:56	00:00:03
Full cluster capacity	01:02:27	00:28:56	00:38:00
Length of upgrade	01:02:27	00:28:56	01:38:10
Extra disk space	800 MB	27 GB	10 GB

Metric (2nd case)	pg_dump/pg_restore	pg_upgrade	pglupgrade
Primary Downtime	00:23:52	00:17:03	00:00:05
Partial cluster HA	00:23:52	00:54:29	00:00:05
Full cluster capacity	00:23:52	03:19:16	00:00:05
Length of upgrade	00:23:52	03:19:16	01:02:10
Extra disk space	800 MB	27 GB	10 GB

Back to the Future

There is a need for a near-zero downtime automated upgrade solution for PostgreSQL [PgCon 2017 Developer Meeting]
Solution will be introduced to the community at PGDay'17 Russia on 6th July
The tool will have a use in 2ndQuadrant and our customers
Applicable to other DBMSs through logical replication
Applicable to other Operating Systems
PostgreSQL 10 will have built-in logical replication

Conclusion

Database clusters can be upgraded with minimal downtime without users being affected while the upgrade is happening.
An application can still respond to the request only with a small drop in performance.
Thesis and the case studies prove that pglupgrade minimizes the downtime to the level of 3-5 seconds.

Summary

THE GOAL: Minimize the downtime for database upgrades, specifically on PostgreSQL
METHODOLOGY: Engineer logical replication as an upgrade method
IMPLEMENTATION: Develop pglupgrade tool to automate the upgrade process using Ansible
CASE STUDIES: Benchmark on two different PostgreSQL clusters in cloud
RESULTS & CONCLUSION: Near-Zero downtime achieved without affecting the users

Thanks!

Questions

First Question

[old-primary]
54.171.211.188

[new-primary]
54.246.183.100

[old-standbys]
54.77.249.81
54.154.49.180

[new-standbys:children]
old-standbys

[pgbouncer]
54.154.49.180

Inventory file hosts.ini


$ ansible-playbook -i hosts.ini pglupgrade.yml

Running pglupgrade playbook

Second Question

- name: Switch the pgbouncer (and applications) to new primary
  hosts: pgbouncer
  become: true
  become_user: postgres
  vars_files:
      - config.yml
  roles:
      - role: pgbouncer
        tags:
          - pgbouncer

pgbouncer play in pglupgrade playbook

remove

add

Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

Agenda

Goals & Motivation

Problems & How to Solve?

Platform Implementation

Case Studies & Results

Outlook

Conclusion & Summary

Problems?

DOWNTIME

Revenue Loss

Reputation Loss

High Availability?

SLAs?

Low Capacity?

Database Anyone?

Banks

Social Networks

Desktop Apps

Startups

Medium-size

Enterprises

To Upgrade, or not to Upgrade

New features

Security patches

Perfomance Updates

Bug fixes

Outdated, no support

Vulnerable to attacks

Poor Perfomance

Buggy, hard to maintain

Why Automate?

Ansible Loves PostgreSQL

( in the s )

Postgres Modules: 6

AWS Modules: 100

Power of Logical Replication

pg_dump/pg_restore

pg_upgrade

pglogical

pglupgrade

1

2

Elements of the solution

pglogical

pgbouncer

Ansible

AWS

pglupgrade

How

Does It Work?

First Case: High Availability

First Case: High Availability (2)

Second Case: Read Scalability

Second Case: Read Scalability(2)

Results

Back to the Future

Conclusion

Summary

Thanks!

Questions

First Question

Second Question