To replicate from MySQL to another data store, you need a description of each row changed in MySQL. MySQL 5.1 almost provides that with row-based replication. It is missing a library that can be used by a MySQL client to decode the contents of a binlog event. Hopefully, MySQL will provide such a library.
There are other problems that you might want to solve for this to work and these include:
- Support for incremental changes in the alternate data store. Files in Hadoop are written, closed and then read. Once closed, they are read-only. To replicate from MySQL into Hadoop, a new file must be created for each batch of replicated transactions.
- Support for schema changes in the alternate data store. The amount of data that must be changed in the alternate stored depends on several factors including whether rows are self-describing, whether all rows must have the same schema and whether the schema change impacts an indexed column.
- Support for updates and deletes in the alternate data store. If this is a file, then inserts can be handled by appending the new data to the end of the file. But updates and deletes require more work.
- Support for indexes. Indexes are not needed for query processing if you expect to scan all rows for each query. Primary key indexes are needed if you hope to process replication changes without doing full scans.
Is anyone else trying to do this? Does anyone else want to do this?