Background on Backup Locks
I was very excited to see Backup locks support in release notes for the latest Percona XtraDB Cluster 5.6.21 release. For those who are not aware, backup locks offer an alternative to FLUSH TABLES WITH READ LOCK (FTWRL) in Xtrabackup. While Xtrabackup can hot-copy Innodb, everything else in MySQL must be locked (usually briefly) to get a consistent snapshot that lines up with Innodb. This includes all other storage engines, but also things like table schemas (even on Innodb) and async replication binary logs. You can skip this lock, but it isn’t generally considered a ‘safe’ backup in every case.
Until recently, Xtrabackup (like most other backup tools) used FTWRL to accomplish this. This worked great, but had the unfortunate side-effect of locking every single table, even the Innodb ones. This functionally meant that even a hot-backup tool for Innodb had to take a (usually short) global lock to get a consistent backup with MySQL overall.
Backup locks change that by introducing a new locking command on Percona Server called ‘LOCK TABLES FOR BACKUP’. This works by locking writes to non-transactional tables, as well as locking DDL on all tables (including Innodb). If Xtrabackup (of a recent vintage) detects that it’s backing up a Percona Server (also of recent vintage), it will automatically use LOCK TABLES WITH BACKUP instead of FLUSH TABLES WITH READ LOCK.
The TL;DR of this is that you can keep on modifying your Innodb data through the entire backup, since we don’t need to use FTWRL any longer.
This feature was introduced in Percona Server 5.6.16-64.0 and Percona XtraBackup 2.2. I do not believe you will find it in any other MySQL variant, though I could be corrected.
What this means for Percona XtraDB Cluster (PXC)
The most common (and logical) SST method for Percona XtraDB Cluster is using Xtrabackup. This latest release of PXC includes support for backup locks, meaning that Xtrabackup donor nodes will no longer need to get a global lock. Practically for PXC users, this means that your Donor nodes can stay in rotation without causing client interruptions due to FTWRL.
Seeing it in action
To test this out, I spun up a 3-node cluster on AWS and fired up a sysbench run on the first node. I forced and SST on the node. Here is a snippet of the innobackup.backup.log (generated by all Xtrabackup donors in Percona XtraDB Cluster):
InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy and Percona LLC and/or its affiliates 2009-2013. All Rights Reserved. This software is published under the GNU GENERAL PUBLIC LICENSE Version 2, June 1991. Get the latest version of Percona XtraBackup, documentation, and help resources: https://www.percona.com/xb/p 141218 19:22:01 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtraback up;mysql_socket=/var/lib/mysql/mysql.sock' as 'sst' (using password: YES). 141218 19:22:01 innobackupex: Connected to MySQL server 141218 19:22:01 innobackupex: Starting the backup operation IMPORTANT: Please check that the backup run completes successfully. At the end of a successful backup run innobackupex prints "completed OK!". innobackupex: Using server version 5.6.21-70.1-56 innobackupex: Created backup directory /tmp/tmp.Rm0qA740U3 141218 19:22:01 innobackupex: Starting ibbackup with command: xtrabackup --defaults-file="/etc/my.cnf" --defaults-group="mysqld" --backup --suspe nd-at-end --target-dir=/tmp/tmp.dM03LgPHFY --innodb_data_file_path="ibdata1:12M:autoextend" --tmpdir=/tmp/tmp.dM03LgPHFY --extra-lsndir='/tmp/tmp.dM 03LgPHFY' --stream=xbstream innobackupex: Waiting for ibbackup (pid=21892) to suspend innobackupex: Suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2' xtrabackup version 2.2.7 based on MySQL server 5.6.21 Linux (x86_64) (revision id: ) xtrabackup: uses posix_fadvise(). xtrabackup: cd to /var/lib/mysql xtrabackup: open files limit requested 0, set to 5000 xtrabackup: using the following InnoDB configuration: xtrabackup: innodb_data_home_dir = ./ xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend xtrabackup: innodb_log_group_home_dir = ./ xtrabackup: innodb_log_files_in_group = 2 xtrabackup: innodb_log_file_size = 1073741824 xtrabackup: using O_DIRECT >> log scanned up to (10525811040) xtrabackup: Generating a list of tablespaces [01] Streaming ./ibdata1 >> log scanned up to (10529368594) >> log scanned up to (10532685942) >> log scanned up to (10536422820) >> log scanned up to (10539562039) >> log scanned up to (10543077110) [01] ...done [01] Streaming ./mysql/innodb_table_stats.ibd [01] ...done [01] Streaming ./mysql/innodb_index_stats.ibd [01] ...done [01] Streaming ./mysql/slave_relay_log_info.ibd [01] ...done [01] Streaming ./mysql/slave_master_info.ibd [01] ...done [01] Streaming ./mysql/slave_worker_info.ibd [01] ...done [01] Streaming ./sbtest/sbtest1.ibd >> log scanned up to (10546490256) >> log scanned up to (10550321726) >> log scanned up to (10553628936) >> log scanned up to (10555422053) [01] ...done ... [01] Streaming ./sbtest/sbtest17.ibd >> log scanned up to (10831343724) >> log scanned up to (10834063832) >> log scanned up to (10837100278) >> log scanned up to (10840243171) [01] ...done xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2' with pid '21892' >> log scanned up to (10843312323) 141218 19:24:06 innobackupex: Continuing after ibbackup has suspended 141218 19:24:06 innobackupex: Executing LOCK TABLES FOR BACKUP... 141218 19:24:06 innobackupex: Backup tables lock acquired 141218 19:24:06 innobackupex: Starting to backup non-InnoDB tables and files innobackupex: in subdirectories of '/var/lib/mysql/' innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (74 files) >> log scanned up to (10846683627) >> log scanned up to (10847773504) innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files) innobackupex: Backing up file '/var/lib/mysql//test/db.opt' innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (53 files) >> log scanned up to (10852976291) 141218 19:24:09 innobackupex: Finished backing up non-InnoDB tables and files 141218 19:24:09 innobackupex: Executing LOCK BINLOG FOR BACKUP... 141218 19:24:09 innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... 141218 19:24:09 innobackupex: Waiting for log copying to finish >> log scanned up to (10856996124) xtrabackup: The latest check point (for incremental): '9936050111' xtrabackup: Stopping log copying thread. .>> log scanned up to (10856996124) xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_log_copied' with pid '21892' 141218 19:24:10 innobackupex: Executing UNLOCK BINLOG 141218 19:24:10 innobackupex: Executing UNLOCK TABLES 141218 19:24:10 innobackupex: All tables unlocked 141218 19:24:10 innobackupex: Waiting for ibbackup (pid=21892) to finish xtrabackup: Transaction log of lsn (9420426891) to (10856996124) was copied. innobackupex: Backup created in directory '/tmp/tmp.Rm0qA740U3' 141218 19:24:30 innobackupex: Connection to database server closed 141218 19:24:30 innobackupex: completed OK!
We can see the LOCK TABLES FOR BACKUP issued at 19:24:06 and unlocked at 19:24:10. Let’s see Galera apply stats from this node during that time:
mycluster / ip-10-228-128-220 (idx: 0) / Galera 3.8(rf6147dd) Wsrep Cluster Node Repl Queue Ops Bytes Conflct Gcache Window Flow time P cnf # Stat Laten Up Dn Up Dn Up Dn lcf bfa ist idx dst appl comm p_ms 19:23:55 P 5 3 Dono 698µs 0 72 0 5418 0.0 3.5M 0 0 187k 94 3k 3 2 0 19:23:56 P 5 3 Dono 701µs 0 58 0 5411 0.0 3.5M 0 0 188k 229 3k 3 2 0 19:23:57 P 5 3 Dono 701µs 0 2 0 5721 0.0 3.7M 0 0 188k 120 3k 3 2 0 19:23:58 P 5 3 Dono 689µs 0 5 0 5643 0.0 3.6M 0 0 188k 63 3k 3 2 0 19:23:59 P 5 3 Dono 679µs 0 55 0 5428 0.0 3.5M 0 0 188k 115 3k 3 2 0 19:24:01 P 5 3 Dono 681µs 0 1 0 4623 0.0 3.0M 0 0 188k 104 3k 3 2 0 19:24:02 P 5 3 Dono 690µs 0 0 0 4301 0.0 2.7M 0 0 188k 141 3k 3 2 0 19:24:03 P 5 3 Dono 688µs 0 2 0 4907 0.0 3.1M 0 0 188k 227 3k 3 2 0 19:24:04 P 5 3 Dono 692µs 0 44 0 4894 0.0 3.1M 0 0 188k 116 3k 3 2 0 19:24:05 P 5 3 Dono 706µs 0 0 0 5337 0.0 3.4M 0 0 188k 63 3k 3 2 0
Initially the node is keeping up ok with replication. The Down Queue (wsrep_local_recv_queue) is sticking around 0. We’re applying 4-5k transactions per second (Ops Dn). When the backup lock kicks in, we do see an increase in the queue size, but note that transactions are still applying on this node:
19:24:06 P 5 3 Dono 696µs 0 170 0 5671 0.0 3.6M 0 0 187k 130 3k 3 2 0 19:24:07 P 5 3 Dono 695µs 0 2626 0 3175 0.0 2.0M 0 0 185k 2193 3k 3 2 0 19:24:08 P 5 3 Dono 692µs 0 1248 0 6782 0.0 4.3M 0 0 186k 1800 3k 3 2 0 19:24:09 P 5 3 Dono 693µs 0 611 0 6111 0.0 3.9M 0 0 187k 651 3k 3 2 0 19:24:10 P 5 3 Dono 708µs 0 93 0 5316 0.0 3.4M 0 0 187k 139 3k 3 2 0
So this node isn’t locked from innodb write transactions, it’s just suffering a bit of IO load while the backup finishes copying its files and such. After this, the backup finished up and the node goes back to a Synced state pretty quickly:
19:24:11 P 5 3 Dono 720µs 0 1 0 4486 0.0 2.9M 0 0 188k 78 3k 3 2 0 19:24:12 P 5 3 Dono 715µs 0 0 0 3982 0.0 2.5M 0 0 188k 278 3k 3 2 0 19:24:13 P 5 3 Dono 1.2ms 0 0 0 4337 0.0 2.8M 0 0 188k 143 3k 3 2 0 19:24:14 P 5 3 Dono 1.2ms 0 1 0 4901 0.0 3.1M 0 0 188k 130 3k 3 2 0 19:24:16 P 5 3 Dono 1.1ms 0 0 0 5289 0.0 3.4M 0 0 188k 76 3k 3 2 0 19:24:17 P 5 3 Dono 1.1ms 0 42 0 4998 0.0 3.2M 0 0 188k 319 3k 3 2 0 19:24:18 P 5 3 Dono 1.1ms 0 15 0 3290 0.0 2.1M 0 0 188k 75 3k 3 2 0 19:24:19 P 5 3 Dono 1.1ms 0 0 0 4124 0.0 2.6M 0 0 188k 276 3k 3 2 0 19:24:20 P 5 3 Dono 1.1ms 0 4 0 1635 0.0 1.0M 0 0 188k 70 3k 3 2 0 19:24:21 P 5 3 Dono 1.1ms 0 0 0 5026 0.0 3.2M 0 0 188k 158 3k 3 2 0 19:24:22 P 5 3 Dono 1.1ms 0 20 0 4100 0.0 2.6M 0 0 188k 129 3k 3 2 0 19:24:23 P 5 3 Dono 1.1ms 0 0 0 5412 0.0 3.5M 0 0 188k 159 3k 3 2 0 19:24:24 P 5 3 Dono 1.1ms 0 315 0 4567 0.0 2.9M 0 0 187k 170 3k 3 2 0 19:24:25 P 5 3 Dono 1.0ms 0 24 0 5535 0.0 3.5M 0 0 188k 131 3k 3 2 0 19:24:26 P 5 3 Dono 1.0ms 0 0 0 5427 0.0 3.5M 0 0 188k 71 3k 3 2 0 19:24:27 P 5 3 Dono 1.0ms 0 1 0 5221 0.0 3.3M 0 0 188k 256 3k 3 2 0 19:24:28 P 5 3 Dono 1.0ms 0 0 0 5317 0.0 3.4M 0 0 188k 159 3k 3 2 0 19:24:29 P 5 3 Dono 1.0ms 0 1 0 5491 0.0 3.5M 0 0 188k 163 3k 3 2 0 19:24:30 P 5 3 Sync 1.0ms 0 0 0 5540 0.0 3.5M 0 0 188k 296 3k 3 2 0 19:24:31 P 5 3 Sync 992µs 0 106 0 5594 0.0 3.6M 0 0 187k 130 3k 3 2 0 19:24:33 P 5 3 Sync 984µs 0 19 0 5723 0.0 3.7M 0 0 188k 275 3k 3 2 0 19:24:34 P 5 3 Sync 976µs 0 0 0 5508 0.0 3.5M 0 0 188k 182 3k 3 2 0
Compared to Percona XtraDB Cluster 5.5
The Backup Locking is only a feature of Percona XtraDB Cluster 5.6, so if we repeat the experiment on 5.5, we can see a more severe lock:
141218 20:31:19 innobackupex: Executing FLUSH TABLES WITH READ LOCK... 141218 20:31:19 innobackupex: All tables locked and flushed to disk 141218 20:31:19 innobackupex: Starting to backup non-InnoDB tables and files innobackupex: in subdirectories of '/var/lib/mysql/' innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files) innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (72 files) >> log scanned up to (6633554484) innobackupex: Backing up file '/var/lib/mysql//test/db.opt' innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (18 files) 141218 20:31:21 innobackupex: Finished backing up non-InnoDB tables and files 141218 20:31:21 innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... 141218 20:31:21 innobackupex: Waiting for log copying to finish xtrabackup: The latest check point (for incremental): '5420681649' xtrabackup: Stopping log copying thread. .>> log scanned up to (6633560488) xtrabackup: Creating suspend file '/tmp/tmp.Cq5JRZEFki/xtrabackup_log_copied' with pid '23130' 141218 20:31:22 innobackupex: All tables unlocked
Our lock lasts from 20:31:19 until 20:31:21, so it’s fairly short. Note that with larger databases with more schemas and tables, this can be quite a bit longer. Let’s see the effect on the apply rate for this node:
mycluster / ip-10-229-68-156 (idx: 0) / Galera 2.11(r318911d) Wsrep Cluster Node Repl Queue Ops Bytes Conflct Gcache Window Flow time P cnf # Stat Laten Up Dn Up Dn Up Dn lcf bfa ist idx dst appl comm p_ms 20:31:13 P 5 3 Dono N/A 0 73 0 3493 0.0 1.8M 0 0 1.8m 832 746 2 2 0.0 20:31:14 P 5 3 Dono N/A 0 29 0 3578 0.0 1.9M 0 0 1.8m 850 749 3 2 0.0 20:31:15 P 5 3 Dono N/A 0 0 0 3513 0.0 1.8M 0 0 1.8m 735 743 2 2 0.0 20:31:16 P 5 3 Dono N/A 0 0 0 3651 0.0 1.9M 0 0 1.8m 827 748 2 2 0.0 20:31:17 P 5 3 Dono N/A 0 27 0 3642 0.0 1.9M 0 0 1.8m 840 762 2 2 0.0 20:31:18 P 5 3 Dono N/A 0 0 0 3840 0.0 2.0M 0 0 1.8m 563 776 2 2 0.0 20:31:19 P 5 3 Dono N/A 0 0 0 4368 0.0 2.3M 0 0 1.8m 823 745 2 1 0.0 20:31:20 P 5 3 Dono N/A 0 3952 0 339 0.0 0.2M 0 0 1.8m 678 751 1 1 0.0 20:31:21 P 5 3 Dono N/A 0 7883 0 0 0.0 0.0 0 0 1.8m 678 751 0 0 0.0 20:31:22 P 5 3 Dono N/A 0 4917 0 5947 0.0 3.1M 0 0 1.8m 6034 3k 7 6 0.0 20:31:24 P 5 3 Dono N/A 0 10 0 8238 0.0 4.3M 0 0 1.8m 991 1k 7 6 0.0 20:31:25 P 5 3 Dono N/A 0 0 0 3016 0.0 1.6M 0 0 1.8m 914 754 2 1 0.0 20:31:26 P 5 3 Dono N/A 0 0 0 3253 0.0 1.7M 0 0 1.8m 613 766 1 1 0.0 20:31:27 P 5 3 Dono N/A 0 1 0 3600 0.0 1.9M 0 0 1.8m 583 777 2 1 0.0 20:31:28 P 5 3 Dono N/A 0 0 0 3640 0.0 1.9M 0 0 1.8m 664 750 2 2 0.0
The drop here is more severe and the apply rate hits 0 (and stays there for the duration of the FTWRL).
Implications
Obviously Xtrabackup running on a PXC node will cause some load on the node itself, so there still maybe good reasons to keep a Donor node out of rotation from your application. However, this is less of an issue than it was in the past, where writes would definitely stall on a Donor node and present potentially intermittent stalls on the application.
How you allow applications to start using a Donor node automatically (or not) depends on how you have your HA between the application and cluster setup. If you use HAproxy or similar with clustercheck, you can either modify the script itself or change a command line argument. The node is in the Donor/Desynced state below:
[root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword! HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 44 Percona XtraDB Cluster Node is not synced. [root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword! 1 HTTP/1.1 200 OK Content-Type: text/plain Connection: close Content-Length: 40 Percona XtraDB Cluster Node is synced.
For those doing their own custom health checking, you basically just need to pass nodes that have a wsrep_local_state_comment of either ‘Synced’ or ‘Donor/Desynced’.
The post Testing backup locks during Xtrabackup SST on Percona XtraDB Cluster appeared first on MySQL Performance Blog.