Quantcast
Channel: galera Archives - Percona Database Performance Blog
Viewing all articles
Browse latest Browse all 117

Testing backup locks during Xtrabackup SST on Percona XtraDB Cluster

$
0
0

Background on Backup Locks

I was very excited to see Backup locks support in release notes for the latest Percona XtraDB Cluster 5.6.21 release. For those who are not aware, backup locks offer an alternative to FLUSH TABLES WITH READ LOCK (FTWRL) in Xtrabackup. While Xtrabackup can hot-copy Innodb, everything else in MySQL must be locked (usually briefly) to get a consistent snapshot that lines up with Innodb. This includes all other storage engines, but also things like table schemas (even on Innodb) and async replication binary logs. You can skip this lock, but it isn’t generally considered a ‘safe’ backup in every case.

Until recently, Xtrabackup (like most other backup tools) used FTWRL to accomplish this. This worked great, but had the unfortunate side-effect of locking every single table, even the Innodb ones.  This functionally meant that even a hot-backup tool for Innodb had to take a (usually short) global lock to get a consistent backup with MySQL overall.

Backup locks change that by introducing a new locking command on Percona Server called ‘LOCK TABLES FOR BACKUP’.  This works by locking writes to non-transactional tables, as well as locking DDL on all tables (including Innodb).  If Xtrabackup (of a recent vintage) detects that it’s backing up a Percona Server (also of recent vintage), it will automatically use LOCK TABLES WITH BACKUP instead of FLUSH TABLES WITH READ LOCK.

The TL;DR of this is that you can keep on modifying your Innodb data through the entire backup, since we don’t need to use FTWRL any longer.

This feature was introduced in Percona Server 5.6.16-64.0 and Percona XtraBackup 2.2.  I do not believe you will find it in any other MySQL variant, though I could be corrected.

What this means for Percona XtraDB Cluster (PXC)

The most common (and logical) SST method for Percona XtraDB Cluster is using Xtrabackup. This latest release of PXC includes support for backup locks, meaning that Xtrabackup donor nodes will no longer need to get a global lock. Practically for PXC users, this means that your Donor nodes can stay in rotation without causing client interruptions due to FTWRL.

Seeing it in action

To test this out, I spun up a 3-node cluster on AWS and fired up a sysbench run on the first node. I forced and SST on the node. Here is a snippet of the innobackup.backup.log (generated by all Xtrabackup donors in Percona XtraDB Cluster):

InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy
and Percona LLC and/or its affiliates 2009-2013. All Rights Reserved.
This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.
Get the latest version of Percona XtraBackup, documentation, and help resources:
https://www.percona.com/xb/p
141218 19:22:01 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtraback
up;mysql_socket=/var/lib/mysql/mysql.sock' as 'sst' (using password: YES).
141218 19:22:01 innobackupex: Connected to MySQL server
141218 19:22:01 innobackupex: Starting the backup operation
IMPORTANT: Please check that the backup run completes successfully.
 At the end of a successful backup run innobackupex
 prints "completed OK!".
innobackupex: Using server version 5.6.21-70.1-56
innobackupex: Created backup directory /tmp/tmp.Rm0qA740U3
141218 19:22:01 innobackupex: Starting ibbackup with command: xtrabackup --defaults-file="/etc/my.cnf" --defaults-group="mysqld" --backup --suspe
nd-at-end --target-dir=/tmp/tmp.dM03LgPHFY --innodb_data_file_path="ibdata1:12M:autoextend" --tmpdir=/tmp/tmp.dM03LgPHFY --extra-lsndir='/tmp/tmp.dM
03LgPHFY' --stream=xbstream
innobackupex: Waiting for ibbackup (pid=21892) to suspend
innobackupex: Suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2'
xtrabackup version 2.2.7 based on MySQL server 5.6.21 Linux (x86_64) (revision id: )
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 0, set to 5000
xtrabackup: using the following InnoDB configuration:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 1073741824
xtrabackup: using O_DIRECT
>> log scanned up to (10525811040)
xtrabackup: Generating a list of tablespaces
[01] Streaming ./ibdata1
>> log scanned up to (10529368594)
>> log scanned up to (10532685942)
>> log scanned up to (10536422820)
>> log scanned up to (10539562039)
>> log scanned up to (10543077110)
[01] ...done
[01] Streaming ./mysql/innodb_table_stats.ibd
[01] ...done
[01] Streaming ./mysql/innodb_index_stats.ibd
[01] ...done
[01] Streaming ./mysql/slave_relay_log_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_master_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_worker_info.ibd
[01] ...done
[01] Streaming ./sbtest/sbtest1.ibd
>> log scanned up to (10546490256)
>> log scanned up to (10550321726)
>> log scanned up to (10553628936)
>> log scanned up to (10555422053)
[01] ...done
...
[01] Streaming ./sbtest/sbtest17.ibd
>> log scanned up to (10831343724)
>> log scanned up to (10834063832)
>> log scanned up to (10837100278)
>> log scanned up to (10840243171)
[01] ...done
xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2' with pid '21892'
>> log scanned up to (10843312323)
141218 19:24:06 innobackupex: Continuing after ibbackup has suspended
141218 19:24:06 innobackupex: Executing LOCK TABLES FOR BACKUP...
141218 19:24:06 innobackupex: Backup tables lock acquired
141218 19:24:06 innobackupex: Starting to backup non-InnoDB tables and files
innobackupex: in subdirectories of '/var/lib/mysql/'
innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (74 files)
>> log scanned up to (10846683627)
>> log scanned up to (10847773504)
innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files)
innobackupex: Backing up file '/var/lib/mysql//test/db.opt'
innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (53 files)
>> log scanned up to (10852976291)
141218 19:24:09 innobackupex: Finished backing up non-InnoDB tables and files
141218 19:24:09 innobackupex: Executing LOCK BINLOG FOR BACKUP...
141218 19:24:09 innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
141218 19:24:09 innobackupex: Waiting for log copying to finish
>> log scanned up to (10856996124)
xtrabackup: The latest check point (for incremental): '9936050111'
xtrabackup: Stopping log copying thread.
.>> log scanned up to (10856996124)
xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_log_copied' with pid '21892'
141218 19:24:10 innobackupex: Executing UNLOCK BINLOG
141218 19:24:10 innobackupex: Executing UNLOCK TABLES
141218 19:24:10 innobackupex: All tables unlocked
141218 19:24:10 innobackupex: Waiting for ibbackup (pid=21892) to finish
xtrabackup: Transaction log of lsn (9420426891) to (10856996124) was copied.
innobackupex: Backup created in directory '/tmp/tmp.Rm0qA740U3'
141218 19:24:30 innobackupex: Connection to database server closed
141218 19:24:30 innobackupex: completed OK!

We can see the LOCK TABLES FOR BACKUP issued at 19:24:06 and unlocked at 19:24:10. Let’s see Galera apply stats from this node during that time:

mycluster / ip-10-228-128-220 (idx: 0) / Galera 3.8(rf6147dd)
Wsrep    Cluster  Node Repl  Queue     Ops       Bytes     Conflct   Gcache    Window        Flow
    time P cnf  # Stat Laten   Up   Dn   Up   Dn   Up   Dn  lcf  bfa  ist  idx dst appl comm  p_ms
19:23:55 P   5  3 Dono 698µs    0   72    0 5418  0.0 3.5M    0    0 187k   94  3k    3    2     0
19:23:56 P   5  3 Dono 701µs    0   58    0 5411  0.0 3.5M    0    0 188k  229  3k    3    2     0
19:23:57 P   5  3 Dono 701µs    0    2    0 5721  0.0 3.7M    0    0 188k  120  3k    3    2     0
19:23:58 P   5  3 Dono 689µs    0    5    0 5643  0.0 3.6M    0    0 188k   63  3k    3    2     0
19:23:59 P   5  3 Dono 679µs    0   55    0 5428  0.0 3.5M    0    0 188k  115  3k    3    2     0
19:24:01 P   5  3 Dono 681µs    0    1    0 4623  0.0 3.0M    0    0 188k  104  3k    3    2     0
19:24:02 P   5  3 Dono 690µs    0    0    0 4301  0.0 2.7M    0    0 188k  141  3k    3    2     0
19:24:03 P   5  3 Dono 688µs    0    2    0 4907  0.0 3.1M    0    0 188k  227  3k    3    2     0
19:24:04 P   5  3 Dono 692µs    0   44    0 4894  0.0 3.1M    0    0 188k  116  3k    3    2     0
19:24:05 P   5  3 Dono 706µs    0    0    0 5337  0.0 3.4M    0    0 188k   63  3k    3    2     0

Initially the node is keeping up ok with replication. The Down Queue (wsrep_local_recv_queue) is sticking around 0. We’re applying 4-5k transactions per second (Ops Dn). When the backup lock kicks in, we do see an increase in the queue size, but note that transactions are still applying on this node:

19:24:06 P   5  3 Dono 696µs    0  170    0 5671  0.0 3.6M    0    0 187k  130  3k    3    2     0
19:24:07 P   5  3 Dono 695µs    0 2626    0 3175  0.0 2.0M    0    0 185k 2193  3k    3    2     0
19:24:08 P   5  3 Dono 692µs    0 1248    0 6782  0.0 4.3M    0    0 186k 1800  3k    3    2     0
19:24:09 P   5  3 Dono 693µs    0  611    0 6111  0.0 3.9M    0    0 187k  651  3k    3    2     0
19:24:10 P   5  3 Dono 708µs    0   93    0 5316  0.0 3.4M    0    0 187k  139  3k    3    2     0

So this node isn’t locked from innodb write transactions, it’s just suffering a bit of IO load while the backup finishes copying its files and such. After this, the backup finished up and the node goes back to a Synced state pretty quickly:

19:24:11 P   5  3 Dono 720µs    0    1    0 4486  0.0 2.9M    0    0 188k   78  3k    3    2     0
19:24:12 P   5  3 Dono 715µs    0    0    0 3982  0.0 2.5M    0    0 188k  278  3k    3    2     0
19:24:13 P   5  3 Dono 1.2ms    0    0    0 4337  0.0 2.8M    0    0 188k  143  3k    3    2     0
19:24:14 P   5  3 Dono 1.2ms    0    1    0 4901  0.0 3.1M    0    0 188k  130  3k    3    2     0
19:24:16 P   5  3 Dono 1.1ms    0    0    0 5289  0.0 3.4M    0    0 188k   76  3k    3    2     0
19:24:17 P   5  3 Dono 1.1ms    0   42    0 4998  0.0 3.2M    0    0 188k  319  3k    3    2     0
19:24:18 P   5  3 Dono 1.1ms    0   15    0 3290  0.0 2.1M    0    0 188k   75  3k    3    2     0
19:24:19 P   5  3 Dono 1.1ms    0    0    0 4124  0.0 2.6M    0    0 188k  276  3k    3    2     0
19:24:20 P   5  3 Dono 1.1ms    0    4    0 1635  0.0 1.0M    0    0 188k   70  3k    3    2     0
19:24:21 P   5  3 Dono 1.1ms    0    0    0 5026  0.0 3.2M    0    0 188k  158  3k    3    2     0
19:24:22 P   5  3 Dono 1.1ms    0   20    0 4100  0.0 2.6M    0    0 188k  129  3k    3    2     0
19:24:23 P   5  3 Dono 1.1ms    0    0    0 5412  0.0 3.5M    0    0 188k  159  3k    3    2     0
19:24:24 P   5  3 Dono 1.1ms    0  315    0 4567  0.0 2.9M    0    0 187k  170  3k    3    2     0
19:24:25 P   5  3 Dono 1.0ms    0   24    0 5535  0.0 3.5M    0    0 188k  131  3k    3    2     0
19:24:26 P   5  3 Dono 1.0ms    0    0    0 5427  0.0 3.5M    0    0 188k   71  3k    3    2     0
19:24:27 P   5  3 Dono 1.0ms    0    1    0 5221  0.0 3.3M    0    0 188k  256  3k    3    2     0
19:24:28 P   5  3 Dono 1.0ms    0    0    0 5317  0.0 3.4M    0    0 188k  159  3k    3    2     0
19:24:29 P   5  3 Dono 1.0ms    0    1    0 5491  0.0 3.5M    0    0 188k  163  3k    3    2     0
19:24:30 P   5  3 Sync 1.0ms    0    0    0 5540  0.0 3.5M    0    0 188k  296  3k    3    2     0
19:24:31 P   5  3 Sync 992µs    0  106    0 5594  0.0 3.6M    0    0 187k  130  3k    3    2     0
19:24:33 P   5  3 Sync 984µs    0   19    0 5723  0.0 3.7M    0    0 188k  275  3k    3    2     0
19:24:34 P   5  3 Sync 976µs    0    0    0 5508  0.0 3.5M    0    0 188k  182  3k    3    2     0

Compared to Percona XtraDB Cluster 5.5

The Backup Locking is only a feature of Percona XtraDB Cluster 5.6, so if we repeat the experiment on 5.5, we can see a more severe lock:

141218 20:31:19  innobackupex: Executing FLUSH TABLES WITH READ LOCK...
141218 20:31:19  innobackupex: All tables locked and flushed to disk
141218 20:31:19  innobackupex: Starting to backup non-InnoDB tables and files
innobackupex: in subdirectories of '/var/lib/mysql/'
innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files)
innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (72 files)
>> log scanned up to (6633554484)
innobackupex: Backing up file '/var/lib/mysql//test/db.opt'
innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (18 files)
141218 20:31:21  innobackupex: Finished backing up non-InnoDB tables and files
141218 20:31:21  innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
141218 20:31:21  innobackupex: Waiting for log copying to finish
xtrabackup: The latest check point (for incremental): '5420681649'
xtrabackup: Stopping log copying thread.
.>> log scanned up to (6633560488)
xtrabackup: Creating suspend file '/tmp/tmp.Cq5JRZEFki/xtrabackup_log_copied' with pid '23130'
141218 20:31:22  innobackupex: All tables unlocked

Our lock lasts from 20:31:19 until 20:31:21, so it’s fairly short. Note that with larger databases with more schemas and tables, this can be quite a bit longer. Let’s see the effect on the apply rate for this node:

mycluster / ip-10-229-68-156 (idx: 0) / Galera 2.11(r318911d)
Wsrep    Cluster  Node Repl  Queue     Ops       Bytes     Conflct   Gcache    Window        Flow
    time P cnf  # Stat Laten   Up   Dn   Up   Dn   Up   Dn  lcf  bfa  ist  idx dst appl comm  p_ms
20:31:13 P   5  3 Dono   N/A    0   73    0 3493  0.0 1.8M    0    0 1.8m  832 746    2    2   0.0
20:31:14 P   5  3 Dono   N/A    0   29    0 3578  0.0 1.9M    0    0 1.8m  850 749    3    2   0.0
20:31:15 P   5  3 Dono   N/A    0    0    0 3513  0.0 1.8M    0    0 1.8m  735 743    2    2   0.0
20:31:16 P   5  3 Dono   N/A    0    0    0 3651  0.0 1.9M    0    0 1.8m  827 748    2    2   0.0
20:31:17 P   5  3 Dono   N/A    0   27    0 3642  0.0 1.9M    0    0 1.8m  840 762    2    2   0.0
20:31:18 P   5  3 Dono   N/A    0    0    0 3840  0.0 2.0M    0    0 1.8m  563 776    2    2   0.0
20:31:19 P   5  3 Dono   N/A    0    0    0 4368  0.0 2.3M    0    0 1.8m  823 745    2    1   0.0
20:31:20 P   5  3 Dono   N/A    0 3952    0  339  0.0 0.2M    0    0 1.8m  678 751    1    1   0.0
20:31:21 P   5  3 Dono   N/A    0 7883    0    0  0.0  0.0    0    0 1.8m  678 751    0    0   0.0
20:31:22 P   5  3 Dono   N/A    0 4917    0 5947  0.0 3.1M    0    0 1.8m 6034  3k    7    6   0.0
20:31:24 P   5  3 Dono   N/A    0   10    0 8238  0.0 4.3M    0    0 1.8m  991  1k    7    6   0.0
20:31:25 P   5  3 Dono   N/A    0    0    0 3016  0.0 1.6M    0    0 1.8m  914 754    2    1   0.0
20:31:26 P   5  3 Dono   N/A    0    0    0 3253  0.0 1.7M    0    0 1.8m  613 766    1    1   0.0
20:31:27 P   5  3 Dono   N/A    0    1    0 3600  0.0 1.9M    0    0 1.8m  583 777    2    1   0.0
20:31:28 P   5  3 Dono   N/A    0    0    0 3640  0.0 1.9M    0    0 1.8m  664 750    2    2   0.0

The drop here is more severe and the apply rate hits 0 (and stays there for the duration of the FTWRL).

Implications

Obviously Xtrabackup running on a PXC node will cause some load on the node itself, so there still maybe good reasons to keep a Donor node out of rotation from your application.  However, this is less of an issue than it was in the past, where writes would definitely stall on a Donor node and present potentially intermittent stalls on the application.

How you allow applications to start using a Donor node automatically (or not) depends on how you have your HA between the application and cluster setup.  If you use HAproxy or similar with clustercheck, you can either modify the script itself or change a command line argument. The node is in the Donor/Desynced state below:

[root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword!
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 44
Percona XtraDB Cluster Node is not synced.
[root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword! 1
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 40
Percona XtraDB Cluster Node is synced.

For those doing their own custom health checking, you basically just need to pass nodes that have a wsrep_local_state_comment of either ‘Synced’ or ‘Donor/Desynced’.

The post Testing backup locks during Xtrabackup SST on Percona XtraDB Cluster appeared first on MySQL Performance Blog.


Viewing all articles
Browse latest Browse all 117

Trending Articles