1. A few DBMS vendors support parallel query processing on large SMP servers or on clusters of commodity servers. MySQL does not do that yet. But there is a way to use MySQL and parallel data processing -- maintain a copy of your data in Hadoop. This can be done today if you are willing to frequently dump all of your tables from MySQL and reload them into the Hadoop Distributed File System. You might not want to do this because full dumps are slow and the copy of the data in Hadoop will be stale.

    To replicate from MySQL to another data store, you need a description of each row changed in MySQL. MySQL 5.1 almost provides that with row-based replication. It is missing a library that can be used by a MySQL client to decode the contents of a binlog event. Hopefully, MySQL will provide such a library.

    There are other problems that you might want to solve for this to work and these include:
    • Support for incremental changes in the alternate data store. Files in Hadoop are written, closed and then read. Once closed, they are read-only. To replicate from MySQL into Hadoop, a new file must be created for each batch of replicated transactions.
    • Support for schema changes in the alternate data store. The amount of data that must be changed in the alternate stored depends on several factors including whether rows are self-describing, whether all rows must have the same schema and whether the schema change impacts an indexed column.
    • Support for updates and deletes in the alternate data store. If this is a file, then inserts can be handled by appending the new data to the end of the file. But updates and deletes require more work.
    • Support for indexes. Indexes are not needed for query processing if you expect to scan all rows for each query. Primary key indexes are needed if you hope to process replication changes without doing full scans.

    Is anyone else trying to do this? Does anyone else want to do this?
    2

    View comments

  2. When not injecting MySQL with code, I try to make MySQL work better for my employer. I also campaign for features I want in MySQL and against features I don't want in MySQL. Support for parallel operations is an area that I am still uncertain about it.

    There are a few reasons why I don't want it supported in MySQL.
    • It will make the MySQL server much more complex and bugs accompany complexity.
    • MySQL doesn't need it for many workloads. It favors throughput over response time for queries on large data sets. That model works for many customers. The simplicity and scalability of replication further enable throughput.
    • There are alternatives for parallel query processing including Greenplum, if they open source all of their code, and Hadoop, if you don't need SQL and you can figure out how to get your data into it from MySQL.
    But what does it mean to support parallel operations in MySQL? MySQL can support parallel query processing within one mysqld process. I am wary of this because it will require significant changes to the optimizer and query execution code in MySQL, even if this is limited to queries on partitioned tables.

    There are more limited forms of support for parallel operations that require fewer changes to MySQL.
    • InnoDB already supports parallel IO because it issues prefetch requests during table and index scan and the requests are processed by a background thread.
    • Table and index scans on partitioned tables can be done in parallel.
    • Filesort can use async IO, real or simulated, to overlap IO with sorting.
    Finally, there is an alternative to the traditional approach to parallel query processing that is made possible by the MySQL Proxy and the ease with which MySQL supports scale out. Queries could be parallelized in the MySQL Proxy rather than in the MySQL server. Different parts of a query can be run on different servers and combined within the proxy to be returned to the user as if a single query were used. I hope this is done.
    1

    View comments

  3. How do you tune load problems on MySQL? You might get lucky and see that there are a few long running statements when running SHOW PROCESSLIST. You might be working on a small system where you know all of the people and/or SQL statements. Unfortunately, these approaches don't work for me most of the time. My first approach was to run SHOW PROCESSLIST many times and aggregate the results by account and by statement. This is a simple and occasionally effective profiler. Aggregating by statement worked better after normalizing the query text by replacing all literals with a a constant. Eventually we implemented support for account, table and index monitoring. The results are reported with SHOW USER_STATISTICS, SHOW TABLE_STATISTICS and SHOW INDEX_STATISTICS. The features are described here.

    I use the output from these commands in several ways. I generate nightly reports that rank accounts number of seconds running SQL statements (Busy_time in SHOW USER_STATISTICS). I look at that report on a weekly basis to see if there is a change in the most busy accounts. I also look at that report when new applications are launched to see if there are SQL statements that must be adjusted. I should generate reports that contain the change in values from day to day, but I have yet to do that. Many of my reports are generated with a mix of Bash, Awk and SQL statements and my reporting capabilities are limited.

    When there is an immediate problem, I use a tool that displays the change in values from SHOW USER_STATISTICS over intervals that I define so I can see the load per account over the last 10 minutes rather than since the server was started. It would be great to roll this kind of functionality into innotop. But there is not much point in doing that until the new monitoring code is in MySQL.

    Occasionally people ask if patches for this feature are available on a particular release. Unfortunately, they are limited to 5.0.37 and 4.0.26 for now, and all changes are in one patch. If that doesn't work for you then update the feature request for this here. And if you don't want to wait, I am sure you could hire someone from the MySQL community to port it for you.
    1

    View comments

Loading