There are three problems that make chained replication difficult to deploy. I expect most of these to be fixed by future MySQL releases.
- It is difficult to avoid replicating statements that don't need to be replicated.
- Each server in the chain has more replication delay from the master.
- Transactions may be lost when a failed node is removed from the chain.
As the slaves will be used for reporting queries, applications need to create tables to evaluate queries in multiple steps, to improve performance and to compensate for some SQL features that MySQL does not support. Tables will be created in the test database for this purpose. Statements used to create and modify rows for tables in the test database must not be replayed on other servers by the replication thread because those statements might fail when those tables do or do not exist on other servers. Those statements should not be replayed on other servers because this creates useless load. As all slaves have log_bin set, these statements will be written to the slave's binlog and received by the next slave in the replication chain.
The replicate-wild-ignore-table parameter prevents these statements from being replayed by replication. Unfortunately, the replication filter parameters are a constant source of confusion and a potential source of errors. On this I speak from experience and others agree with me. The replicate-wild-ignore-table parameter works as long as:
- All applications use the test database for working tables.
- Tables from the test database are never used in a statement that changes rows for tables in the main database. If they are, then those statements will be ignored by slaves and transactions will be lost.
Row based replication should fix all of this. Today, replication events are SQL statements and filters are limited to not replaying a statement that references a table or not replaying a statement based on the default database. When a replication event describes a change to one row in one table, filters can be specified so that events that modify a table are not replayed.
Before a replication transaction is replayed on a slave, it must first be replayed on the previous node in the chain. Each node on a chain adds replication delay. Unless all transactions are very fast or the replication chain is short, this creates a lot of replication delay towards the tail of the chain. This will be improved, but not completely fixed in future MySQL releases as row based replication removes the overhead of processing queries from replication.
When a slave in the interior of the chain fails, the node replicating from the failed slave must change to use the predecessor of the failed slave. Transactions are likely to be lost when the CHANGE MASTER command is run to do this. The problem is that replication state of a slave is local to the master used by the slave. It is a binlog filename and offset. This binlog file does not exist on any other server and there is no easy way to map this to a binlog file and offset on another master. It can be done in many cases, but it is difficult to automate and I suspect that most deployments use the easy solution, which is to begin replicating from the end of the binlog on the new master. The easy solution is likely to lose transactions.
Assume that the server bar has failed and baz is changed to replicate from the end of the active binlog on foo. This will lose transactions that replicated from foo to bar but not from bar to baz. It will also lose transactions that had yet to reach bar.
I don't like to lose transactions.
This too can be fixed and MySQL intends to fix it. One solution is to add metadata to replication events to make them globally unique. When this is done, the current replication state can be used to reconnect after failover without losing transactions. Assume the following is done:
- Metadata is added to replication events to include a globally unique ID. This can be the pair (server_id, event_id) where event_id is the value of a counter that is unique within one server and incremented for each event written to a binlog. The value is determined by the server that first logs the binlog event and does not change when binlog events are written by a slave because log_slave_updates is set. This behavior is already provided for server_id.
- A command is added to get the globally unique ID for the last replication event run by a node (SHOW SLAVE GLOBAL_OFFSET)
- A method is added to map a globally unique ID to a binlog filename and offset on a server (SHOW MASTER OFFSET FOR ID
). This might be slow as a server must scan existing binlogs to determine the offset.
- Run this on the slave that must reconnect: SHOW SLAVE GLOBAL_OFFSET
- Run this on the server to which the slave reconnects: SHOW MASTER OFFSET FOR ID
- Run this on the slave that must reconnect using the values from the previous step: CHANGE MASTER ...