@gestassy // gestassy@synthesio.com
Used to handle PB of web traffic @ eTF1
Now cope with billions of Mentions @ Synthesio
Synthesio is the leading social intelligence tool for social media monitoring & social analytics.
Synthesio crawls the Web for relevant data, enriches it with sentiment analysis and demographics to build social analytics dashboards.
Your workload is not mine
= my solution is certainly not what you need
= you can’t rely on generic workload tests
I won’t go into mysql internals/system/kernel tracing
Historical storage for… everything, until 2015.
Percona PXC (Galera) or Master/Slave replication
InnoDB
a few clients, quite a lot of queries
simple structure and transactions
avg 30% storage organic growth
Stable architecture since 2015
Percona PXC 5.6
3.5To of deduplicated data
650 inserts/s
800 updates/s
11k questions/s
250 sim. connections
Percona PXC 5.6
8To of deduplicated data
1k inserts/s
1.5k updates/s
8k questions/s
1k sim. connections
InnoDB (or XtraDB) required
Multi-Master
Synchronous
Automatic nodes provisioning
wsrep_provider=/usr/lib/libgalera_smm.so
Write scalable : don't add nodes if you need more writes
Distributed :
Each node contains the full dataset
A slow node in write will impact the whole cluster write speed (outside gcs.fc_limit)
Perfectly synchronous : commit are applied in order but asynchronously (by default wsrep_sync_wait=0)
Based on
wsrep_local_cert_failures/ wsrep_local_bf_aborts:
wsrep_provider_options = "gcs.fc_limit = 10000; gcs.fc_factor = 0.8; gcs.fc_master_slave = yes"
wsrep_provider_options = "gcs.fc_limit = 10000; gcs.fc_factor = 0.8; gcs.fc_master_slave = yes"
wsrep_local_state
show global status like 'wsrep_replicated_bytes';
show global status like 'wsrep_received_bytes';
+------------------------+-------------+
| Variable_name | Value |
+------------------------+-------------+
| wsrep_replicated_bytes | 14024686476 |
+------------------------+-------------+
1 row in set (0.01 sec)
+----------------------+----------------+
| Variable_name | Value |
+----------------------+----------------+
| wsrep_received_bytes | 25424591549980 |
+----------------------+----------------+
1 row in set (0.00 sec)
wsrep_provider_options = "gcache.size=10G"
wsrep_sst_method=xtrabackup-v2
[sst]
inno-apply-opts="--use-memory=40G"
SET SESSION wsrep_OSU_method='TOI';
SET SESSION wsrep_OSU_method='RSU';
IT WORKS IN THEORY
not in production
What happens when a table is created
pt-online-schema-change --alter "ADD COLUMN c1" D=testdb,t=mytable --max-flow-ctl=33.3 --execute
pt-online-schema-change --alter "ADD COLUMN c1 INT" D=test,t=history_uint --max-flow-ctl=33.3 --execute
No slaves found. See --recursion-method if host sql01 has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `test`.`history_uint`...
Creating new table...
Created new table test._history_uint_new OK.
Altering new table...
Altered `test`.`_history_uint_new` OK.
2017-10-19T10:40:16 Creating triggers...
2017-10-19T10:40:17 Created triggers OK.
2017-10-19T10:40:17 Copying approximately 1 rows...
2017-10-19T10:40:17 Copied rows OK.
2017-10-19T10:40:17 Analyzing new table...
2017-10-19T10:40:17 Swapping tables...
2017-10-19T10:40:17 Swapped original and new tables OK.
2017-10-19T10:40:17 Dropping old table...
2017-10-19T10:40:17 Dropped old table `test`.`_history_uint_old` OK.
2017-10-19T10:40:17 Dropping triggers...
2017-10-19T10:40:17 Dropped triggers OK.
Successfully altered `test`.`history_uint`.
## Check ports for Mysql:
## Xinetd can answer on these ports
## 9001 : Master : auth test + simple query
## 9002 : Slave : slave is running + slave lag not too high
## 9003 : Galera : galera cluster_status is Primary and local_state is 4
listen vip_mysql
bind 10.10.10.10:3306
default_backend be_mysql
backend be_mysql
balance source
option httpchk GET
server mysql01 10.10.0.12:3306 check port 9003 inter 5s rise 2 fall 3
server mysql02 10.10.0.13:3306 check port 9003 inter 5s rise 2 fall 3 backup
server mysql03 10.10.0.14:3306 check port 9003 inter 5s rise 2 fall 3 backup
TCP Connection but HTTP healthcheck :
Xinetd scripts based on https://github.com/olafz/percona-clustercheck
RAID 10, XFS (25% faster on seq write on HDDs)
Linux kernel 4.x (we use 4.9 on Debian 8)
System tuning :
disk schedulers = deadline on bare metal, noop on VM
vm.swapiness = 1
increase open-files-limit in limits.conf
MTU 9000 (+8% network perf)
For Galera : same machines for all nodes
evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M;"
innodb_flush_log_at_trx_commit
1 : ACID ! at each transaction : InnoDB log buffer -> log file -> disk.
0 : InnoDB log buffer (1/s) -> log file (x/s)-> flush disk.
1s lost if any mysqld crashes
2 : InnoDB log buffer (each transaction)-> log file (x/s)-> flush disk.
1s lost if the OS/HW crashes
innodb_flush_log_at_timeout=x
innodb_write_io_threads=4
innodb_read_io_threads=4
innodb_io_capacity=10000
innodb_file_per_table = 1
innodb_file_format = barracuda
innodb_change_buffering=all
innodb_flush_method=O_DIRECT
table_definition_cache= 2000
table_open_cache = 5000
table_open_cache_instances = 16
CREATE TABLE `history_uint` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`value` bigint(20) unsigned NOT NULL DEFAULT '0',
`ns` int(11) NOT NULL DEFAULT '0',
KEY `history_uint_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (itemid)
PARTITIONS 32 */
ALTER TABLE `history_uint` COALESCE PARTITION 4;
ALTER TABLE `history_uint` ADD PARTITION PARTITIONS 12;
ALTER TABLE `history_uint` REMOVE PARTITIONING;
Alter table ... rebuild partition p0
Alter table ... optimize partition p0
And also : analyze, repair, check
Especially useful for queries on composite index that can return nothing, for instance :
SELECT * FROM history_uint h WHERE h.itemid='541969' AND h.clock>1492000245
SET optimizer_switch = 'index_condition_pushdown=off';
If we select only indexed fields, mysql read only index and it's fast. If a result is found, it's also fast.
Otherwise, ICP starts by filtering result with clock field and then filter by itemid.
The problem is, partition hash is by itemid. So there's all partitions to scroll before sending no result.
Wrong:
Select data from table_partitioned where id < 100 and id >10
Select data from table_partitioned where id in ( 11, 12 … 99)
mysql> insert into `test`.`maxwell` set id = 1, daemon = 'Stanislaw Lem';
maxwell: {
"database": "test",
"table": "maxwell",
"type": "insert",
"ts": 1449786310,
"xid": 940752,
"commit": true,
"data": { "id":1, "daemon": "Stanislaw Lem" }
}
SELECT /* machine:/path/to/code class:line */ foo
FROM ….
A main cluster has always its delayed slave
Binary logs are still collected in real time
SQL_Delay guarantees a delay to intervene in case of unwanted drop database or any other PEBCAK
CHANGE MASTER TO MASTER_DELAY = 7200;
the delayed slave runs periodically an innobackupex, streamed with xbstream to backup servers, then prepared (--apply-logs)
innobackupex --incremental : differential backup On frequently modified data, incremental backup are not really faster
Each full backup can be copied to a dedicated mysql server, to be use to operate the rollback operation
@gestassy // gestassy@synthesio.com
Also: We're Hiring ...