I have been fortunate to work with very productive people. Two of them, Wei and Justin, are experts in MySQL replication internals. Wei implemented mirror binlogs, transactional replication and semi-sync replication. Our current expert, Justin, has made replication much more robust and implemented binlog event checksums and global transaction IDs. Global transaction IDs make hierarchical replication much easier to manage.
Justin has a talk at the MySQL Conference. He also has a new idea that may allow us remove the InnoDB specific code we added for transactional replication while preserving the functionality. As the InnoDB specific code has been very difficult to get right, I hope we can do it. The idea is to run slaves with --log-slave-updates set, enable another feature to only log updates from the replication stream but not from users connected to the slave, and use existing code in MySQL (internal XA) to keep the binlog and storage engines in sync on the slave. The slave SQL thread state that is currently stored in a separate file (relay-log.info) can be generated on demand from the last complete transaction stored in the binlog on the slave. In practice it will be retained but get regenerated when the slave is not shutdown cleanly.
This requires extra work on the slave as it must write a binlog for changes made from the replication stream. But we already require that because of other features we recently added. And even if we didn't, the CPU and IO overhead of writing a binlog is low for our servers.
It is critical for us that slave replication state be recoverable. This helps us in production and development. We add many features to replication and must be sure that they work. One of our tests uses a continuous stream of insert statements and two servers setup as a master and slave. The slave is killed and restarted at random points in time during the test. At the end of the test we confirm that the contents of the tables on the slave and master match. We have found and fixed many replication bugs because of this test and that has made our servers much easier to manage.