Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

Gülçin Yıldırım 156398IASM

Master's Thesis, June 2017

Supervisor: Tarmo Robal, PhD

Agenda

Goals & Motivation

Problems & How to Solve?

Platform Implementation

Case Studies & Results

Outlook

Conclusion & Summary

Problems?

DOWNTIME

Revenue Loss

Reputation Loss

High Availability?

SLAs?

Low Capacity?

Database Anyone?

Banks

Social Networks

Desktop Apps

Startups

Medium-size

Enterprises

To Upgrade, or not to Upgrade

  • New features

  • Security patches

  • Perfomance Updates

  • Bug fixes

  • Outdated, no support

  • Vulnerable to attacks

  • Poor Perfomance

  • Buggy, hard to maintain

Why Automate?

  • Risk & Errors
  • Cost
  • Time-to-market

 

  • Reproducibility
  • Repeatability
  • Efficiency

 

  • Updating nasa.gov: 1 hr to 5 mins
  • Patching updates: Multi-day to 45 mins
  • Application stack setup: 1-2 hrs to 10 mins

Ansible Loves PostgreSQL

( in the           s )

Postgres Modules: 6

AWS Modules: 100

PostgreSQL

252

Database

208

Cloud

116

Upgrade

50

Power of Logical Replication

  • Offline Conversion

    • pg_dump/pg_restore

    • pg_upgrade

  • Online Conversion

    • pglogical

    • pglupgrade

1

2

Elements of the solution

pglogical

pgbouncer

Ansible

AWS

pglupgrade

How

Does It Work?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

First Case: High Availability

First Case: High Availability (2)

Second Case: Read Scalability

Second Case: Read Scalability(2)

Results

Metric (1st case) pg_dump/pg_restore pg_upgrade pglupgrade
Primary Downtime 00:24:27 00:16:25 00:00:03
Partial cluster HA   00:24:27 00:28:56 00:00:03
Full cluster capacity 01:02:27 00:28:56 00:38:00
Length of upgrade 01:02:27 00:28:56 01:38:10
Extra disk space 800 MB 27 GB 10 GB
Metric (2nd case)  pg_dump/pg_restore pg_upgrade pglupgrade
Primary Downtime 00:23:52 00:17:03 00:00:05
Partial cluster HA   00:23:52 00:54:29 00:00:05
Full cluster capacity 00:23:52 03:19:16 00:00:05
Length of upgrade 00:23:52 03:19:16 01:02:10
Extra disk space 800 MB 27 GB 10 GB

Back to the Future

  • There is a need for a near-zero downtime automated upgrade solution for PostgreSQL [PgCon 2017 Developer Meeting]
  • Solution will be introduced to the community at PGDay'17 Russia on 6th July
  • The tool will have a use in 2ndQuadrant and our customers
  • Applicable to other DBMSs through logical replication
  • Applicable to other Operating Systems
  • PostgreSQL 10 will have built-in logical replication

Conclusion

  • Database clusters can be upgraded with minimal downtime without users being affected while the upgrade is happening.

  • An application​ can still respond to the request only with a small drop in performance.

  • Thesis and the case studies prove that pglupgrade minimizes the downtime to the level of 3-5 seconds.

Summary

  • THE GOAL: Minimize the downtime for database upgrades, specifically on PostgreSQL

  • METHODOLOGY: Engineer logical replication as an upgrade method

  • IMPLEMENTATION: Develop pglupgrade tool to automate the upgrade process using Ansible

  • CASE STUDIES: Benchmark on two different PostgreSQL clusters in cloud

  • RESULTS & CONCLUSION: Near-Zero downtime achieved without affecting the users 

Thanks!

Questions

First Question

[old-primary]
54.171.211.188

[new-primary]
54.246.183.100

[old-standbys]
54.77.249.81
54.154.49.180

[new-standbys:children]
old-standbys

[pgbouncer]
54.154.49.180

Inventory file hosts.ini


$ ansible-playbook -i hosts.ini pglupgrade.yml

Running pglupgrade playbook

Second Question

- name: Switch the pgbouncer (and applications) to new primary
  hosts: pgbouncer
  become: true
  become_user: postgres
  vars_files:
      - config.yml
  roles:
      - role: pgbouncer
        tags:
          - pgbouncer

pgbouncer play in pglupgrade playbook

remove

add

Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

By Gülçin Yıldırım Jelínek

Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

This presentation is prepared as Master's Thesis defence of Gulcin Yildirim at Tallinn University of Technology. All rights are reserved. Copying without permission is not allowed.

  • 2,264