To replicate from MySQL to another data store, you need a description of each row changed in MySQL. MySQL 5.1 almost provides that with row-based replication. It is missing a library that can be used by a MySQL client to decode the contents of a binlog event. Hopefully, MySQL will provide such a library.
There are other problems that you might want to solve for this to work and these include:
- Support for incremental changes in the alternate data store. Files in Hadoop are written, closed and then read. Once closed, they are read-only. To replicate from MySQL into Hadoop, a new file must be created for each batch of replicated transactions.
- Support for schema changes in the alternate data store. The amount of data that must be changed in the alternate stored depends on several factors including whether rows are self-describing, whether all rows must have the same schema and whether the schema change impacts an indexed column.
- Support for updates and deletes in the alternate data store. If this is a file, then inserts can be handled by appending the new data to the end of the file. But updates and deletes require more work.
- Support for indexes. Indexes are not needed for query processing if you expect to scan all rows for each query. Primary key indexes are needed if you hope to process replication changes without doing full scans.
Is anyone else trying to do this? Does anyone else want to do this?


Performing distributed processing no MySQL is tangential to the underlying data representation.
ReplyDeleteWe have a cluster of robots which index the blogosphere at about 5M blogs per hour. It's north of 200G per month.
We use about 20 machines to perform this job and shard our MySQL nodes on top of innodb.
We then have a distributed compute infra on top of this that reads the data out of MySQL and commits it.
It's basically a sharded database...
Sound familiar? It's basically similar to what you guys have developed for Adwords.
Kevin
It would make a lot of sense to have a mysql input and output format for Hadoop, so that you wouldn't need to dump your data out. The performance would be slower than hdfs, but faster dumping from mysql and loading into hdfs.
ReplyDeleteIf you write one, please contribute it back to Hadoop. Patches are very welcome!