Sunday, October 28, 2007

Hadoop and MySQL

A few DBMS vendors support parallel query processing on large SMP servers or on clusters of commodity servers. MySQL does not do that yet. But there is a way to use MySQL and parallel data processing -- maintain a copy of your data in Hadoop. This can be done today if you are willing to frequently dump all of your tables from MySQL and reload them into the Hadoop Distributed File System. You might not want to do this because full dumps are slow and the copy of the data in Hadoop will be stale.

To replicate from MySQL to another data store, you need a description of each row changed in MySQL. MySQL 5.1 almost provides that with row-based replication. It is missing a library that can be used by a MySQL client to decode the contents of a binlog event. Hopefully, MySQL will provide such a library.

There are other problems that you might want to solve for this to work and these include:
  • Support for incremental changes in the alternate data store. Files in Hadoop are written, closed and then read. Once closed, they are read-only. To replicate from MySQL into Hadoop, a new file must be created for each batch of replicated transactions.
  • Support for schema changes in the alternate data store. The amount of data that must be changed in the alternate stored depends on several factors including whether rows are self-describing, whether all rows must have the same schema and whether the schema change impacts an indexed column.
  • Support for updates and deletes in the alternate data store. If this is a file, then inserts can be handled by appending the new data to the end of the file. But updates and deletes require more work.
  • Support for indexes. Indexes are not needed for query processing if you expect to scan all rows for each query. Primary key indexes are needed if you hope to process replication changes without doing full scans.

Is anyone else trying to do this? Does anyone else want to do this?

2 comments:

  1. Performing distributed processing no MySQL is tangential to the underlying data representation.

    We have a cluster of robots which index the blogosphere at about 5M blogs per hour. It's north of 200G per month.

    We use about 20 machines to perform this job and shard our MySQL nodes on top of innodb.

    We then have a distributed compute infra on top of this that reads the data out of MySQL and commits it.

    It's basically a sharded database...

    Sound familiar? It's basically similar to what you guys have developed for Adwords.

    Kevin

    ReplyDelete
  2. It would make a lot of sense to have a mysql input and output format for Hadoop, so that you wouldn't need to dump your data out. The performance would be slower than hdfs, but faster dumping from mysql and loading into hdfs.

    If you write one, please contribute it back to Hadoop. Patches are very welcome!

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.