Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

Gülçin Yıldırım 156398IASM

Master's Thesis, June 2017

Supervisor: Tarmo Robal, PhD

Agenda

Goals & Motivation

Problems & How to Solve?

Platform Implementation

Case Studies & Results

Outlook

Conclusion & Summary

Problems?

DOWNTIME

Revenue Loss

Reputation Loss

High Availability?

SLAs?

Low Capacity?

Database Anyone?

Banks

Social Networks

Desktop Apps

Startups

Medium-size

Enterprises

To Upgrade, or not to Upgrade

  • New features

  • Security patches

  • Perfomance Updates

  • Bug fixes

  • Outdated, no support

  • Vulnerable to attacks

  • Poor Perfomance

  • Buggy, hard to maintain

Why Automate?

  • Risk & Errors
  • Cost
  • Time-to-market

 

  • Reproducibility
  • Repeatability
  • Efficiency

 

  • Updating nasa.gov: 1 hr to 5 mins
  • Patching updates: Multi-day to 45 mins
  • Application stack setup: 1-2 hrs to 10 mins

Ansible Loves PostgreSQL

( in the           s )

Postgres Modules: 6

AWS Modules: 100

PostgreSQL

252

Database

208

Cloud

116

Upgrade

50

Power of Logical Replication

  • Offline Conversion

    • pg_dump/pg_restore

    • pg_upgrade

  • Online Conversion

    • pglogical

    • pglupgrade

1

2

Elements of the solution

pglogical

pgbouncer

Ansible

AWS

pglupgrade

How

Does It Work?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

First Case: High Availability

First Case: High Availability (2)

Second Case: Read Scalability

Second Case: Read Scalability(2)

Results

Metric (1st case) pg_dump/pg_restore pg_upgrade pglupgrade
Primary Downtime 00:24:27 00:16:25 00:00:03
Partial cluster HA   00:24:27 00:28:56 00:00:03
Full cluster capacity 01:02:27 00:28:56 00:38:00
Length of upgrade 01:02:27 00:28:56 01:38:10
Extra disk space 800 MB 27 GB 10 GB
Metric (2nd case)  pg_dump/pg_restore pg_upgrade pglupgrade
Primary Downtime 00:23:52 00:17:03 00:00:05
Partial cluster HA   00:23:52 00:54:29 00:00:05
Full cluster capacity 00:23:52 03:19:16 00:00:05
Length of upgrade 00:23:52 03:19:16 01:02:10
Extra disk space 800 MB 27 GB 10 GB

Back to the Future

  • There is a need for a near-zero downtime automated upgrade solution for PostgreSQL [PgCon 2017 Developer Meeting]
  • Solution will be introduced to the community at PGDay'17 Russia on 6th July
  • The tool will have a use in 2ndQuadrant and our customers
  • Applicable to other DBMSs through logical replication
  • Applicable to other Operating Systems
  • PostgreSQL 10 will have built-in logical replication

Conclusion

  • Database clusters can be upgraded with minimal downtime without users being affected while the upgrade is happening.

  • An application​ can still respond to the request only with a small drop in performance.

  • Thesis and the case studies prove that pglupgrade minimizes the downtime to the level of 3-5 seconds.

Summary

  • THE GOAL: Minimize the downtime for database upgrades, specifically on PostgreSQL

  • METHODOLOGY: Engineer logical replication as an upgrade method

  • IMPLEMENTATION: Develop pglupgrade tool to automate the upgrade process using Ansible

  • CASE STUDIES: Benchmark on two different PostgreSQL clusters in cloud

  • RESULTS & CONCLUSION: Near-Zero downtime achieved without affecting the users 

Thanks!

Questions

First Question

[old-primary]
54.171.211.188

[new-primary]
54.246.183.100

[old-standbys]
54.77.249.81
54.154.49.180

[new-standbys:children]
old-standbys

[pgbouncer]
54.154.49.180

Inventory file hosts.ini


$ ansible-playbook -i hosts.ini pglupgrade.yml

Running pglupgrade playbook

Second Question

- name: Switch the pgbouncer (and applications) to new primary
  hosts: pgbouncer
  become: true
  become_user: postgres
  vars_files:
      - config.yml
  roles:
      - role: pgbouncer
        tags:
          - pgbouncer

pgbouncer play in pglupgrade playbook

remove

add