Ubuntu Manpage: pt-table-sync - Synchronize MySQL table data efficiently.

Provided by: percona-toolkit_3.2.1-1_all

NAME

       pt-table-sync - Synchronize MySQL table data efficiently.

SYNOPSIS

       Usage: pt-table-sync [OPTIONS] DSN [DSN]

       pt-table-sync synchronizes data efficiently between MySQL tables.

       This tool changes data, so for maximum safety, you should back up your data before using it.  When
       synchronizing a server that is a replication slave with the "--replicate" or "--sync-to-master" methods,
       it always makes the changes on the replication master, never the replication slave directly.  This is in
       general the only safe way to bring a replica back in sync with its master; changes to the replica are
       usually the source of the problems in the first place.  However, the changes it makes on the master
       should be no-op changes that set the data to their current values, and actually affect only the replica.

       Sync db.tbl on host1 to host2:

         pt-table-sync --execute h=host1,D=db,t=tbl h=host2

       Sync all tables on host1 to host2 and host3:

         pt-table-sync --execute host1 host2 host3

       Make slave1 have the same data as its replication master:

         pt-table-sync --execute --sync-to-master slave1

       Resolve differences that pt-table-checksum found on all slaves of master1:

         pt-table-sync --execute --replicate test.checksum master1

       Same as above but only resolve differences on slave1:

         pt-table-sync --execute --replicate test.checksum \
           --sync-to-master slave1

       Sync master2 in a master-master replication configuration, where master2's copy of db.tbl is known or
       suspected to be incorrect:

         pt-table-sync --execute --sync-to-master h=master2,D=db,t=tbl

       Note that in the master-master configuration, the following will NOT do what you want, because it will
       make changes directly on master2, which will then flow through replication and change master1's data:

         # Don't do this in a master-master setup!
         pt-table-sync --execute h=master1,D=db,t=tbl master2

RISKS

       WARNING: pt-table-sync changes data!  Before using this tool, please:

       •   Read the tool's documentation

       •   Review the tool's known "BUGS"

       •   Test the tool on a non-production server

       •   Backup your production server and verify the backups

           pt-table-sync  is  mature,  proven  in the real world, and well tested, but if used improperly it can
           have adverse consequences.  Always test syncing first with "--dry-run" and "--print".

DESCRIPTION

pt-table-sync does one-way and bidirectional synchronization of table data. It does not synchronize
table structures, indexes, or any other schema objects. The following describes one-way synchronization.
"BIDIRECTIONAL SYNCING" is described later.

This tool is complex and functions in several different ways. To use it safely and effectively, you
should understand three things: the purpose of "--replicate", finding differences, and specifying hosts.
These three concepts are closely related and determine how the tool will run. The following is the
abbreviated logic:

if DSN has a t part, sync only that table:
if 1 DSN:
if --sync-to-master:
The DSN is a slave. Connect to its master and sync.
if more than 1 DSN:
The first DSN is the source. Sync each DSN in turn.
else if --replicate:
if --sync-to-master:
The DSN is a slave. Connect to its master, find records
of differences, and fix.
else:
The DSN is the master. Find slaves and connect to each,
find records of differences, and fix.
else:
if only 1 DSN and --sync-to-master:
The DSN is a slave. Connect to its master, find tables and
filter with --databases etc, and sync each table to the master.
else:
find tables, filtering with --databases etc, and sync each
DSN to the first.

pt-table-sync can run in one of two ways: with "--replicate" or without. The default is to run without
"--replicate" which causes pt-table-sync to automatically find differences efficiently with one of
several algorithms (see "ALGORITHMS"). Alternatively, the value of "--replicate", if specified, causes
pt-table-sync to use the differences already found by having previously ran pt-table-checksum with its
own "--replicate" option. Strictly speaking, you don't need to use "--replicate" because pt-table-sync
can find differences, but many people use "--replicate" if, for example, they checksum regularly using
pt-table-checksum then fix differences as needed with pt-table-sync. If you're unsure, read each tool's
documentation carefully and decide for yourself, or consult with an expert.

Regardless of whether "--replicate" is used or not, you need to specify which hosts to sync. There are
two ways: with "--sync-to-master" or without. Specifying "--sync-to-master" makes pt-table-sync expect
one and only slave DSN on the command line. The tool will automatically discover the slave's master and
sync it so that its data is the same as its master. This is accomplished by making changes on the master
which then flow through replication and update the slave to resolve its differences. Be careful though:
although this option specifies and syncs a single slave, if there are other slaves on the same master,
they will receive via replication the changes intended for the slave that you're trying to sync.

Alternatively, if you do not specify "--sync-to-master", the first DSN given on the command line is the
source host. There is only ever one source host. If you do not also specify "--replicate", then you
must specify at least one other DSN as the destination host. There can be one or more destination hosts.
Source and destination hosts must be independent; they cannot be in the same replication topology. pt-
table-sync will die with an error if it detects that a destination host is a slave because changes are
written directly to destination hosts (and it's not safe to write directly to slaves). Or, if you
specify "--replicate" (but not "--sync-to-master") then pt-table-sync expects one and only one master DSN
on the command line. The tool will automatically discover all the master's slaves and sync them to the
master. This is the only way to sync several (all) slaves at once (because "--sync-to-master" only
specifies one slave).

Each host on the command line is specified as a DSN. The first DSN (or only DSN for cases like
"--sync-to-master") provides default values for other DSNs, whether those other DSNs are specified on the
command line or auto-discovered by the tool. So in this example,

pt-table-sync --execute h=host1,u=msandbox,p=msandbox h=host2

the host2 DSN inherits the "u" and "p" DSN parts from the host1 DSN. Use the "--explain-hosts" option to
see how pt-table-sync will interpret the DSNs given on the command line.

LIMITATIONS

       Replicas using row-based replication
           pt-table-sync  requires  statement-based  replication  when  used  with  the  "--sync-to-master"   or
           "--replicate"  option.  Therefore it will set "binlog_format=STATEMENT" on the master for its session
           if required.  To do this user must have "SUPER" privilege.

OUTPUT

If you specify the "--verbose" option, you'll see information about the differences between the tables.
There is one row per table. Each server is printed separately. For example,

# Syncing h=host1,D=test,t=test1
# DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE
# 0 0 3 0 Chunk 13:00:00 13:00:17 2 test.test1

Table test.test1 on host1 required 3 "INSERT" statements to synchronize and it used the Chunk algorithm
(see "ALGORITHMS"). The sync operation for this table started at 13:00:00 and ended 17 seconds later
(times taken from "NOW()" on the source host). Because differences were found, its "EXIT STATUS" was 2.

If you specify the "--print" option, you'll see the actual SQL statements that the script uses to
synchronize the table if "--execute" is also specified.

If you want to see the SQL statements that pt-table-sync is using to select chunks, nibbles, rows, etc.,
then specify "--print" once and "--verbose" twice. Be careful though: this can print a lot of SQL
statements.

There are cases where no combination of "INSERT", "UPDATE" or "DELETE" statements can resolve differences
without violating some unique key. For example, suppose there's a primary key on column a and a unique
key on column b. Then there is no way to sync these two tables with straightforward UPDATE statements:

+---+---+ +---+---+
| a | b | | a | b |
+---+---+ +---+---+
| 1 | 2 | | 1 | 1 |
| 2 | 1 | | 2 | 2 |
+---+---+ +---+---+

The tool rewrites queries to "DELETE" and "REPLACE" in this case. This is automatically handled after
the first index violation, so you don't have to worry about it.

Be careful when using pt-table-sync in any master-master setup. Master-master replication is inherently
tricky, and it's easy to make mistakes. You need to be sure you're using the tool correctly for master-
master replication. See the "SYNOPSIS" for the overview of the correct usage.

Also be careful with tables that have foreign key constraints with "ON DELETE" or "ON UPDATE" definitions
because these might cause unintended changes on the child tables. See "--[no]check-child-tables".

In general, this tool is best suited when your tables have a primary key or unique index. Although it
can synchronize data in tables lacking a primary key or unique index, it might be best to synchronize
that data by another means.

REPLICATION SAFETY

Synchronizing a replication master and slave safely is a non-trivial problem, in general. There are all
sorts of issues to think about, such as other processes changing data, trying to change data on the
slave, whether the destination and source are a master-master pair, and much more.

In general, the safe way to do it is to change the data on the master, and let the changes flow through
replication to the slave like any other changes. However, this works only if it's possible to REPLACE
into the table on the master. REPLACE works only if there's a unique index on the table (otherwise it
just acts like an ordinary INSERT).

If your table has unique keys, you should use the "--sync-to-master" and/or "--replicate" options to sync
a slave to its master. This will generally do the right thing. When there is no unique key on the
table, there is no choice but to change the data on the slave, and pt-table-sync will detect that you're
trying to do so. It will complain and die unless you specify "--no-check-slave" (see
"--[no]check-slave").

If you're syncing a table without a primary or unique key on a master-master pair, you must change the
data on the destination server. Therefore, you need to specify "--no-bin-log" for safety (see
"--[no]bin-log"). If you don't, the changes you make on the destination server will replicate back to
the source server and change the data there!

The generally safe thing to do on a master-master pair is to use the "--sync-to-master" option so you
don't change the data on the destination server. You will also need to specify "--no-check-slave" to
keep pt-table-sync from complaining that it is changing data on a slave.

ALGORITHMS

pt-table-sync has a generic data-syncing framework which uses different algorithms to find differences.
The tool automatically chooses the best algorithm for each table based on indexes, column types, and the
algorithm preferences specified by "--algorithms". The following algorithms are available, listed in
their default order of preference:

Chunk
Finds an index whose first column is numeric (including date and time types), and divides the
column's range of values into chunks of approximately "--chunk-size" rows. Syncs a chunk at a time
by checksumming the entire chunk. If the chunk differs on the source and destination, checksums each
chunk's rows individually to find the rows that differ.

It is efficient when the column has sufficient cardinality to make the chunks end up about the right
size.

The initial per-chunk checksum is quite small and results in minimal network traffic and memory
consumption. If a chunk's rows must be examined, only the primary key columns and a checksum are
sent over the network, not the entire row. If a row is found to be different, the entire row will be
fetched, but not before.

Note that this algorithm will not work if chunking a char column where all the values start with the
same character. In that case, the tool will exit and suggest picking a different algorithm.

Nibble
Finds an index and ascends the index in fixed-size nibbles of "--chunk-size" rows, using a non-
backtracking algorithm (see pt-archiver for more on this algorithm). It is very similar to "Chunk",
but instead of pre-calculating the boundaries of each piece of the table based on index cardinality,
it uses "LIMIT" to define each nibble's upper limit, and the previous nibble's upper limit to define
the lower limit.

It works in steps: one query finds the row that will define the next nibble's upper boundary, and the
next query checksums the entire nibble. If the nibble differs between the source and destination, it
examines the nibble row-by-row, just as "Chunk" does.

GroupBy
Selects the entire table grouped by all columns, with a COUNT(*) column added. Compares all columns,
and if they're the same, compares the COUNT(*) column's value to determine how many rows to insert or
delete into the destination. Works on tables with no primary key or unique index.

Stream
Selects the entire table in one big stream and compares all columns. Selects all columns. Much less
efficient than the other algorithms, but works when there is no suitable index for them to use.

Future Plans
Possibilities for future algorithms are TempTable (what I originally called bottom-up in earlier
versions of this tool), DrillDown (what I originally called top-down), and GroupByPrefix (similar to
how SqlYOG Job Agent works). Each algorithm has strengths and weaknesses. If you'd like to
implement your favorite technique for finding differences between two sources of data on possibly
different servers, I'm willing to help. The algorithms adhere to a simple interface that makes it
pretty easy to write your own.

BIDIRECTIONAL SYNCING

Bidirectional syncing is a new, experimental feature. To make it work reliably there are a number of
strict limitations:

* only works when syncing one server to other independent servers
* does not work in any way with replication
* requires that the table(s) are chunkable with the Chunk algorithm
* is not N-way, only bidirectional between two servers at a time
* does not handle DELETE changes

For example, suppose we have three servers: c1, r1, r2. c1 is the central server, a pseudo-master to the
other servers (viz. r1 and r2 are not slaves to c1). r1 and r2 are remote servers. Rows in table foo
are updated and inserted on all three servers and we want to synchronize all the changes between all the
servers. Table foo has columns:

id int PRIMARY KEY
ts timestamp auto updated
name varchar

Auto-increment offsets are used so that new rows from any server do not create conflicting primary key
(id) values. In general, newer rows, as determined by the ts column, take precedence when a same but
differing row is found during the bidirectional sync. "Same but differing" means that two rows have the
same primary key (id) value but different values for some other column, like the name column in this
example. Same but differing conflicts are resolved by a "conflict". A conflict compares some column of
the competing rows to determine a "winner". The winning row becomes the source and its values are used
to update the other row.

There are subtle differences between three columns used to achieve bidirectional syncing that you should
be familiar with: chunk column ("--chunk-column"), comparison column(s) ("--columns"), and conflict
column ("--conflict-column"). The chunk column is only used to chunk the table; e.g. "WHERE id >= 5 AND
id < 10". Chunks are checksummed and when chunk checksums reveal a difference, the tool selects the rows
in that chunk and checksums the "--columns" for each row. If a column checksum differs, the rows have
one or more conflicting column values. In a traditional unidirectional sync, the conflict is a moot
point because it can be resolved simply by updating the entire destination row with the source row's
values. In a bidirectional sync, however, the "--conflict-column" (in accordance with other
"--conflict-*" options list below) is compared to determine which row is "correct" or "authoritative";
this row becomes the "source".

To sync all three servers completely, two runs of pt-table-sync are required. The first run syncs c1 and
r1, then syncs c1 and r2 including any changes from r1. At this point c1 and r2 are completely in sync,
but r1 is missing any changes from r2 because c1 didn't have these changes when it and r1 were synced.
So a second run is needed which syncs the servers in the same order, but this time when c1 and r1 are
synced r1 gets r2's changes.

The tool does not sync N-ways, only bidirectionally between the first DSN given on the command line and
each subsequent DSN in turn. So the tool in this example would be ran twice like:

pt-table-sync --bidirectional h=c1 h=r1 h=r2

The "--bidirectional" option enables this feature and causes various sanity checks to be performed. You
must specify other options that tell pt-table-sync how to resolve conflicts for same but differing rows.
These options are:

* --conflict-column
* --conflict-comparison
* --conflict-value
* --conflict-threshold
* --conflict-error"> (optional)

Use "--print" to test this option before "--execute". The printed SQL statements will have comments
saying on which host the statement would be executed if you used "--execute".

Technical side note: the first DSN is always the "left" server and the other DSNs are always the "right"
server. Since either server can become the source or destination it's confusing to think of them as
"src" and "dst". Therefore, they're generically referred to as left and right. It's easy to remember
this because the first DSN is always to the left of the other server DSNs on the command line.

EXIT STATUS

       The following are the exit statuses (also called return  values,  or  return  codes)  when  pt-table-sync
       finishes and exits.

          STATUS  MEANING
          ======  =======================================================
          0       Success.
          1       Internal error.
          2       At least one table differed on the destination.
          3       Combination of 1 and 2.

OPTIONS

Specify at least one of "--print", "--execute", or "--dry-run".

"--where" and "--replicate" are mutually exclusive.

This tool accepts additional command-line arguments. Refer to the "SYNOPSIS" and usage information for
details.

--algorithms
type: string; default: Chunk,Nibble,GroupBy,Stream

Algorithm to use when comparing the tables, in order of preference.

For each table, pt-table-sync will check if the table can be synced with the given algorithms in the
order that they're given. The first algorithm that can sync the table is used. See "ALGORITHMS".

--ask-pass
Prompt for a password when connecting to MySQL.

--bidirectional
Enable bidirectional sync between first and subsequent hosts.

See "BIDIRECTIONAL SYNCING" for more information.

--[no]bin-log
default: yes

Log to the binary log ("SET SQL_LOG_BIN=1").

Specifying "--no-bin-log" will "SET SQL_LOG_BIN=0".

--buffer-in-mysql
Instruct MySQL to buffer queries in its memory.

This option adds the "SQL_BUFFER_RESULT" option to the comparison queries. This causes MySQL to
execute the queries and place them in a temporary table internally before sending the results back to
pt-table-sync. The advantage of this strategy is that pt-table-sync can fetch rows as desired
without using a lot of memory inside the Perl process, while releasing locks on the MySQL table (to
reduce contention with other queries). The disadvantage is that it uses more memory on the MySQL
server instead.

You probably want to leave "--[no]buffer-to-client" enabled too, because buffering into a temp table
and then fetching it all into Perl's memory is probably a silly thing to do. This option is most
useful for the GroupBy and Stream algorithms, which may fetch a lot of data from the server.

--[no]buffer-to-client
default: yes

Fetch rows one-by-one from MySQL while comparing.

This option enables "mysql_use_result" which causes MySQL to hold the selected rows on the server
until the tool fetches them. This allows the tool to use less memory but may keep the rows locked on
the server longer.

If this option is disabled by specifying "--no-buffer-to-client" then "mysql_store_result" is used
which causes MySQL to send all selected rows to the tool at once. This may result in the results
"cursor" being held open for a shorter time on the server, but if the tables are large, it could take
a long time anyway, and use all your memory.

For most non-trivial data sizes, you want to leave this option enabled.

This option is disabled when "--bidirectional" is used.

--channel
type: string

Channel name used when connected to a server using replication channels. Suppose you have two
masters, master_a at port 12345, master_b at port 1236 and a slave connected to both masters using
channels chan_master_a and chan_master_b. If you want to run pt-table-sync to synchronize the slave
against master_a, pt-table-sync won't be able to determine what's the correct master since SHOW SLAVE
STATUS will return 2 rows. In this case, you can use --channel=chan_master_a to specify the channel
name to use in the SHOW SLAVE STATUS command.

--charset
short form: -A; type: string

Default character set. If the value is utf8, sets Perl's binmode on STDOUT to utf8, passes the
mysql_enable_utf8 option to DBD::mysql, and runs SET NAMES UTF8 after connecting to MySQL. Any other
value sets binmode on STDOUT without the utf8 layer, and runs SET NAMES after connecting to MySQL.

--[no]check-child-tables
default: yes

Check if "--execute" will adversely affect child tables. When "--replace", "--replicate", or
"--sync-to-master" is specified, the tool may sync tables using "REPLACE" statements. If a table
being synced has child tables with "ON DELETE CASCADE", "ON UPDATE CASCADE", or "ON UPDATE SET NULL",
the tool prints an error and skips the table because "REPLACE" becomes "DELETE" then "INSERT", so the
"DELETE" will cascade to the child table and delete its rows. In the worst case, this can delete all
rows in child tables!

Specify "--no-check-child-tables" to disable this check. To completely avoid affecting child tables,
also specify "--no-foreign-key-checks" so MySQL will not cascade any operations from the parent to
child tables.

This check is only preformed if "--execute" and one of "--replace", "--replicate", or
"--sync-to-master" is specified. "--print" does not check child tables.

The error message only prints the first child table found with an "ON DELETE CASCADE", "ON UPDATE
CASCADE", or "ON UPDATE SET NULL" foreign key constraint. There could be other affected child
tables.

--[no]check-master
default: yes

With "--sync-to-master", try to verify that the detected master is the real master.

--[no]check-slave
default: yes

Check whether the destination server is a slave.

If the destination server is a slave, it's generally unsafe to make changes on it. However,
sometimes you have to; "--replace" won't work unless there's a unique index, for example, so you
can't make changes on the master in that scenario. By default pt-table-sync will complain if you try
to change data on a slave. Specify "--no-check-slave" to disable this check. Use it at your own
risk.

--[no]check-triggers
default: yes

Check that no triggers are defined on the destination table.

Triggers were introduced in MySQL v5.0.2, so for older versions this option has no effect because
triggers will not be checked.

--chunk-column
type: string

Chunk the table on this column.

--chunk-index
type: string

Chunk the table using this index.

--chunk-size
type: string; default: 1000

Number of rows or data size per chunk.

The size of each chunk of rows for the "Chunk" and "Nibble" algorithms. The size can be either a
number of rows, or a data size. Data sizes are specified with a suffix of k=kibibytes, M=mebibytes,
G=gibibytes. Data sizes are converted to a number of rows by dividing by the average row length.

--columns
short form: -c; type: array

Compare this comma-separated list of columns.

--config
type: Array

Read this comma-separated list of config files; if specified, this must be the first option on the
command line.

--conflict-column
type: string

Compare this column when rows conflict during a "--bidirectional" sync.

When a same but differing row is found the value of this column from each row is compared according
to "--conflict-comparison", "--conflict-value" and "--conflict-threshold" to determine which row has
the correct data and becomes the source. The column can be any type for which there is an
appropriate "--conflict-comparison" (this is almost all types except, for example, blobs).