MySQL @ Synthesio

@gestassy // gestassy@synthesio.com
  • Used to handle PB of web traffic @ eTF1

  • Now cope with billions of Mentions @ Synthesio

  • Some music, also

whoami

  • Synthesio is the leading social intelligence tool for social media monitoring & social analytics.

  • Synthesio crawls the Web for relevant data, enriches it with sentiment analysis and demographics to build social analytics dashboards.

Synthesio

  • MySQL, SGBD since 1995
  • Your workload is not mine

    • = my solution is certainly not what you need

    • = you can’t rely on generic workload tests

  • I won’t go into mysql internals/system/kernel tracing

Disclaimer

our main clusters

  • Historical storage for… everything, until 2015.

  • Percona PXC (Galera) or Master/Slave replication

  • InnoDB

  • a few clients, quite a lot of queries

  • simple structure and transactions

  • avg 30% storage organic growth

  • Stable architecture since 2015

MySQL @Synthesio, Oct 2017

MySQL @Synthesio, Oct 2017

Metadata Cluster

  • Percona PXC 5.6

  • 3.5To of deduplicated data

  • 650 inserts/s

  • 800 updates/s

  • 11k questions/s

  • 250 sim. connections

  • Percona PXC 5.6

  • 8To of deduplicated data

  • 1k inserts/s

  • 1.5k updates/s

  • 8k questions/s

  • 1k sim. connections

Frontend storage

  • Percona Master-Slave 5.7
  • 26To of deduplicated data
  • 700 insert/s ~= 60M Mentions/day
  • 500 sim. connections

Mentions cold storage

  • Galera principles
  • Galera not principles
  • Cluster - Qorum
  • Garbd
  • Replication sequence
  • Flow control
  • Node provisionning
  • SST / IST
  • Schema upgrade RSU/TOI/pt-online-schema-change
  • Load balacing with haproxy
  • Performance setup
  • Partitionned Tables
  • Monitoring
  • Backup policy

 

 

Agenda

Galera 3

  • InnoDB (or XtraDB) required

  • Multi-Master

  • Synchronous 

  • Automatic nodes provisioning

Galera main principles

wsrep_provider=/usr/lib/libgalera_smm.so
  • Write scalable : don't add nodes if you need more writes

  • Distributed :

    • Each node contains the full dataset

    • A slow node in write will impact the whole cluster write speed (outside gcs.fc_limit)

  • Perfectly synchronous : commit are applied in order but asynchronously  (by default wsrep_sync_wait=0)

Galera is not

Based on

  • Percona Server
  • Percona Xtrabackup
  • Codership Galera library
  • Performance improvements, bug fixes, security
  • pxc_strict_mode

 

Percona XtraDB Cluster

  • Galera is Qorum-based
  • Avoid split brain with an odd number of nodes
  • The 3rd node can be a Galera Arbitrator : garbd

Cluster - Qorum

  • It's only a facade
  • A process that is part of the cluster, receives and ack all the traffic of the cluster, but does not process any SQL
  • Cannot provoke Flow control
  • It can be hosted on a small machine : no CPU or Disk load
  • Essential component to replace a classic Master/Master Replication with a 2 nodes Galera cluster

Galera Arbitrator - Garbd

  • Galera communication is not mysql replication
  • Nodes speak to each other on port 4567 (Group communication)
  • ports 4568 and 4444 for IST/SST (initial setup)
  • Uses binary log events (but no physical binary log needed)
  • Parallel Replication (wsrep_slave_threads) is efficient
  • Each Transaction is replicated on commit

Cluster - Multi Master

  • Each Transaction is replicated on commit
  • On replication : first commit wins
  • 2-phase commit : Certification / Apply commit
  • Before responding to a commit on a node, the node has a mechanism to accept it = certification

Multi Master Replication

  • Node 1 receives commit: Query is in state "wsrep in pre-commit stage"
  • Node 1 creates a write-set (bin log event +  all references to PK, UK, FK)
  • Node 1 Replicates the write-set to the gcomm channel (wsrep_gcomm_uuid)
  • Awaits for certification of all nodes
  • Sends response : OK or DEADLOCK

Multi Master Replication

  • On the Group communication :
  • Node 1 receive the write-set, certifies it.
  • Node 2 and 3 certify the write-set and put the transaction in queue to be applied asynchronously (wsrep_local_recv_queue)

Multi Master Replication

  • Certification process ensure writeset can be applied on a node
  • It can fails if another writeset in the queue impacts the same keys. 
  • wsrep_local_cert_failures/ ​wsrep_local_bf_aborts​​​:
  • ​When a locally ongoing transaction conflits with the writeset to be applied, it is killed during the certification process
  • Long-lasting transactions may be more vulnerables

Certification

  • Traditional MySQL replication garantees at best a replication lag of 0 second
  • Galera replication garantees at worst a replication lag in numbers of transactions
  • If a node reach its gcs.fc_limit, it sends to the cluster a pause signal : wsrep_flow_control_sent, blocking all writes on the cluster, until (gcs.fc_limit x gcs.fc_factor) is met.

Galera Flow Control

wsrep_provider_options = "gcs.fc_limit = 10000; gcs.fc_factor = 0.8; gcs.fc_master_slave = yes"
  • gcs.fc_master_slave allows a dynamic change of the gcs.fc_limit (gcs.fc_limit = gcs.fc_limit x nodes quantity)
  • In most cases : set to yes to disable

Galera Flow Control

wsrep_provider_options = "gcs.fc_limit = 10000; gcs.fc_factor = 0.8; gcs.fc_master_slave = yes"
  • IST : Incremental State Transfer
  • SST : State SnapshotTransfer (Full dataset)

Automatic Node Provisioning

Joining Cluster Sequence

 wsrep_local_state

IST

  • Occurs when a previously member of the cluster re-joins it, with a consistent data set
  • The donor (wsrep_sst_donor) need to have all missing writesets in its Galera cache (and a little more)
  • => gcache file should be sized to avoid an SST after a maintenance period

IST - gcache calculation

show global status like 'wsrep_replicated_bytes'; 
show global status like 'wsrep_received_bytes';
+------------------------+-------------+
| Variable_name          | Value       |
+------------------------+-------------+
| wsrep_replicated_bytes | 14024686476 |
+------------------------+-------------+
1 row in set (0.01 sec)

+----------------------+----------------+
| Variable_name        | Value          |
+----------------------+----------------+
| wsrep_received_bytes | 25424591549980 |
+----------------------+----------------+
1 row in set (0.00 sec)
\frac{wsrep\_received\_bytes_{t1} - wsrep\_received\_bytes_{t0} + wsrep\_replicated\_bytes_{t1} - wsrep\_replicated\_bytes_{t0}} {(t1-t0)} * t(downtime)
wsrep_provider_options = "gcache.size=10G"

SST with xtrabackup

wsrep_sst_method=xtrabackup-v2
  • xtrabackup uses innoDB redo log, reads the LSN
  • xtrabackup copy the innodb data files and then performs crash recovery
  • Non-blocking for innoDB
  • my.cnf 

 

[sst]
inno-apply-opts="--use-memory=40G"

Schema Upgrades

  • TOI - Total Isolation Order
  • RSU - Rolling Schema upgrade
  • Percona's pt-online-schema-change

 

Total Order Isolation

  • DDL query is replicated as a normal statement
  • All nodes executes the modification at the same time, in a isolated way. Deadlock are sent to ongoing transactions, then everything is blocked until the modification is complete.
  • We noticed, on a heavy-write cluster, the DDL Query won't even occur.
SET SESSION wsrep_OSU_method='TOI';

Rolling Schema Upgrade

  • DDL query is applied locally only
  • The node is Desynchronized during the schema upgrade
  • The query must be applied manually to each node
  • There's an intermediate state where the schema is not consistent between the nodes
SET SESSION wsrep_OSU_method='RSU';

RSU - Blue/Green deployment ?

IT WORKS IN THEORY

not in production

Rolling Schema Upgrade

What happens when a table is created

pt-online-schema-change

  • works with wsrep_OSU_method = TOI 
  • Copy the structure of the table to be modified
  • Apply modifications on the copy
  • Copy rows
  • Rename tables and drop the old one
  • No locks on the source table during operation
  • Many security check 
pt-online-schema-change --alter "ADD COLUMN c1" D=testdb,t=mytable --max-flow-ctl=33.3 --execute

pt-online-schema-change

pt-online-schema-change --alter "ADD COLUMN c1 INT" D=test,t=history_uint --max-flow-ctl=33.3 --execute
No slaves found.  See --recursion-method if host sql01 has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `test`.`history_uint`...
Creating new table...
Created new table test._history_uint_new OK.
Altering new table...
Altered `test`.`_history_uint_new` OK.
2017-10-19T10:40:16 Creating triggers...
2017-10-19T10:40:17 Created triggers OK.
2017-10-19T10:40:17 Copying approximately 1 rows...
2017-10-19T10:40:17 Copied rows OK.
2017-10-19T10:40:17 Analyzing new table...
2017-10-19T10:40:17 Swapping tables...
2017-10-19T10:40:17 Swapped original and new tables OK.
2017-10-19T10:40:17 Dropping old table...
2017-10-19T10:40:17 Dropped old table `test`.`_history_uint_old` OK.
2017-10-19T10:40:17 Dropping triggers...
2017-10-19T10:40:17 Dropped triggers OK.
Successfully altered `test`.`history_uint`.

Schema Upgrade

  • Schema updates are more complex to handle with Galera
  • Avoid breaking (incompatible) change, like column type
  • Disassemble DDL orders in several steps
  • Keep DBs schema simples 

Load-balancing for Galera

HA Proxy

## Check ports for Mysql:
## Xinetd can answer on these ports
## 9001 : Master : auth test + simple query
## 9002 : Slave  : slave is running + slave lag not too high
## 9003 : Galera : galera cluster_status is Primary and local_state is 4


listen vip_mysql
    bind 10.10.10.10:3306
    default_backend be_mysql

backend be_mysql
    balance     source
    option httpchk GET
  server mysql01 10.10.0.12:3306 check port 9003 inter 5s rise 2 fall 3
  server mysql02 10.10.0.13:3306 check port 9003 inter 5s rise 2 fall 3  backup
  server mysql03 10.10.0.14:3306 check port 9003 inter 5s rise 2 fall 3  backup

TCP Connection but HTTP healthcheck :

Xinetd scripts based on https://github.com/olafz/percona-clustercheck

HA Proxy

  • Essential for Galera Operations and failover
  • When a node (N1) goes down, a backup (N2) takes over the traffic
  • When the node is back : 
    • New connections are made on N1
    • Persistent connections are still operating on N2
    • => provokes certifications failures and deadlocks
  • Solution :
    • => FW N1(port 3306) while State Transfer
    • => un-FW N1 and FW N2 when N1 wsrep_local_recv_queue=0

Future for us: ProxySQL

  • Sounds promising : 
    • Seamless failover
    • R/W Split, sharding
    • operations on query level
    • binlogs

Main levers for performances

Hardware / Kernel

  • RAID 10, XFS (25% faster on seq write on HDDs)

  • Linux kernel 4.x (we use 4.9 on Debian 8)

  • System tuning :

    • disk schedulers = deadline on bare metal, noop on VM

    • vm.swapiness = 1

    • increase open-files-limit in limits.conf

    • MTU 9000 (+8% network perf)

  • ​For Galera : same machines for all nodes

 

  • Network tolerance for Galera:

evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M;"

my.cnf

 

  • innodb_flush_log_at_trx_commit

    • 1 : ACID ! at each transaction : InnoDB log buffer -> log file -> disk.

    • 0 :   InnoDB log buffer (1/s) -> log file  (x/s)-> flush disk.

      • 1s lost if any mysqld crashes

    • 2 :   InnoDB log buffer (each transaction)-> log file (x/s)-> flush disk.

      • 1s lost if the OS/HW crashes

  • innodb_flush_log_at_timeout=x

my.cnf

innodb_write_io_threads=4

innodb_read_io_threads=4

innodb_io_capacity=10000

 

innodb_file_per_table = 1

innodb_file_format = barracuda

innodb_change_buffering=all

innodb_flush_method=O_DIRECT

 

my.cnf

 

  • table_definition_cache= 2000

  • table_open_cache = 5000

    table_open_cache_instances = 16

my.cnf

  • When a table grows and the code can't be modified 

Partitioned Tables

CREATE TABLE `history_uint` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` bigint(20) unsigned NOT NULL DEFAULT '0',
  `ns` int(11) NOT NULL DEFAULT '0',
  KEY `history_uint_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (itemid)
PARTITIONS 32 */
  • But as we work with big tables, it's costly : in disk space, I/O, and blocked transactions

Partitioned Tables

ALTER TABLE `history_uint` COALESCE PARTITION 4;
ALTER TABLE `history_uint` ADD PARTITION PARTITIONS 12;
ALTER TABLE `history_uint` REMOVE PARTITIONING;
  • Number of partitions can be modified at any time

Partitioned Tables

  • Maintenance operations can be done per partition
  • Alter table ... rebuild partition p0
  • Alter table ... optimize partition p0
  • And also :  analyze, repair, check

Disable ICP

 

Especially useful for queries on composite index that can return nothing, for instance :

SELECT * FROM history_uint h WHERE h.itemid='541969' AND h.clock>1492000245 

Partitioned Tables

SET optimizer_switch = 'index_condition_pushdown=off';

If we select only indexed fields, mysql read only index and it's fast. If a result is found, it's also fast.

Otherwise, ICP starts by filtering result with clock field and then filter by itemid.
The problem is, partition hash is by itemid. So there's all partitions to scroll before sending no result.

Partitioned Tables

  • Still, queries need to be adapted 
  • Wrong:

    • Select data from table_partitioned where id < 100 and id >10
  • Correct:
    • Select data from table_partitioned where id in ( 11, 12 … 99)

testing : Maxwell

  • Reads binary log events and write rows to Kafka
  mysql> insert into `test`.`maxwell` set id = 1, daemon = 'Stanislaw Lem';
  maxwell: {
    "database": "test",
    "table": "maxwell",
    "type": "insert",
    "ts": 1449786310,
    "xid": 940752,
    "commit": true,
    "data": { "id":1, "daemon": "Stanislaw Lem" }
  }

Moniroting 

Morinoting

Observability

  • Cluster Status  = Primary
  • Flow Control sent / received ~ 0
  • Replication queues: local_recv_queue_current < gcs.fc_limit
  • Certifications Failures ~ 0
  • Local State = 4
  • no ERROR in err file

Alerting

  • package percona-zabbix-templates
  • provides all necessary metrics to monitor properly a Galera or single MySQL Server 

Monitoring - zabbix

  • TCP Connections must be persistent
  • Client must be able to reconnect if connection is dropped
  • Optimizer hints are usefull ! 
  • All queries to databases must includes a comment :

 

Who sent this query ?

 SELECT /* machine:/path/to/code class:line */ foo 
   FROM ….

Backups

  • A main cluster has always its delayed slave

  • Binary logs are still collected in real time

  • SQL_Delay guarantees a delay to intervene in case of unwanted drop database or any other PEBCAK

Delayed Slave

CHANGE MASTER TO MASTER_DELAY = 7200;
  • the delayed slave runs periodically an innobackupex, streamed with xbstream to backup servers, then prepared (--apply-logs)

  • innobackupex --incremental : differential backup On frequently modified data, incremental backup are not really faster

Cold backups

  • Each full backup can be copied to a dedicated mysql server, to be use to operate the rollback operation

Restore backups

Questions ?

@gestassy // gestassy@synthesio.com

Also: We're Hiring ...

Mysql-Galera@Synthesio

By gestassy

Mysql-Galera@Synthesio

  • 2,978