Thursday, January 26, 2012

Error injection tests for InnoDB would be nice

I am trying to figure out why an InnoDB table was lost when a DDL statement failed. I think it was a RENAME TABLE statement. I have yet to find the root cause but I did find that InnoDB doesn't report some errors when RENAME fails so the user thinks that the table was renamed, the FRM file is renamed, and the ibd file is not renamed. This is only a problem for files not in the InnoDB system tablespace so --innodb_file_per_table=1 must be used. This is bug 64144.

As I wrote in a previous blog post, it is time to add error injection tests to InnoDB.

Thursday, January 12, 2012

Who wants to write a storage engine?

LevelDB is here. It might make an interesting storage engine for MySQL as it has many performance benefits at the cost of a few limitations: no multi-statement transactions and MyISAM-like concurrency. I doubt we will ever get a production quality LevelDB storage engine for MySQL because the storage engine API is hard, so hard that such projects require funding. This is unfortunate.

LevelDB might be a great fit for MongoDB. MongoDB doesn't need multi-statement transactions. Both are limited by 1-writer or N-reader concurrency, but writes to database files are much faster with LevelDB because it doesn't do update in place. So LevelDB doesn't lose as much performance for IO-bound workloads by doing 1-writer or N-readers and my guess is that this could make MongoDB much better at supporting IO-bound workloads.

Friday, December 2, 2011

Marketing a bug in 3 easy steps


  1. File a request for crash recovery tests and wait a few months
  2. File a request for error injection tests during InnoDB DDL and wait a few months
  3. Lose a table during alter table because untested error handling is incorrect and blog about it

Friday, November 18, 2011

Great work, bug #12704861 was fixed!

MySQL Community Server 5.1.60 has been released and I am very happy because the release notes state that bug #12704861 has been fixed. I know this bug quite well. As my readers are very busy let me provide all of the details that have been made available to the community:
InnoDB Storage Engine: Data from BLOB columns could be lost if the server crashed at a precise moment when other columns were being updated in an InnoDB table. (Bug #12704861)

Wednesday, October 19, 2011

Unexplained jumps in Seconds_Behind_Master

I am trying to understand why a server would go from 0 to 45 and then back to 0 seconds of replication lag as reported by the Seconds_Behind_Master column in SHOW SLAVE STATUS output. This occurs over a few seconds so there isn't a statement that runs for 45 seconds on the slave. I then compared consecutive SET TIMESTAMP values in the binlog and the absolute value of the differences is at most 2 seconds.

Has anyone else been confused by this? I filed bug 62839 and think there is a race condition in the code that computes the value for Seconds_Behind_Master. If I am correct about this then we need to fix the problem. Replication lag is a big problem for many of us and reporting a value that is much larger than the actual value is bad PR.

Thursday, October 13, 2011

Multi-master, NoSQL and MySQL

The MySQL family has been innovating rapidly. New features need names and sometimes those names are confusing. Describing something as multi-master or a NoSQL solution has confused me.

Multi-master requires one of conflict prevention, conflict resolution or faith. MySQL Cluster provides both conflict prevention and resolution as described in these great posts. Regular MySQL has minimal support for conflict prevention (auto-increment-offset can prevent insert conflicts) and thus requires faith that the application does the right thing. Regular MySQL gets conflict prevention via synchronous replication when used with Galera.

There has been talk of adding support to replicate from multiple masters into one slave. We have yet to agree on a name for this. It has been called fan-in, multi-source and multi-master. I hope multi-master isn't reused for this.


You can now use HandlerSocket and the memcache API to access MySQL. But taking away SQL doesn't make this a NoSQL solution. And in the case of MySQL Cluster it already has many of the properties of a NoSQL solution, getting the memcache API is just gravy. The new APIs will allow some workloads to get more read throughput from MySQL. They won't do much for write throughput because those bottlenecks are in InnoDB and independent of SQL/NoSQL. Note that the memcache plugin for InnoDB has tuning options that are good for benchmarks but might not be good for production. There are options that reduce the frequency at which transactions are committed and started. These options allow the transaction start and commit bottlenecks to be avoided at the cost of stale reads and async commits -- see daemon_memcached_r_batch_size and daemon_memcached_w_batch_size.

I don't think the lack of SQL access is what makes many of the NoSQL products compelling. The big thing for me is that they can reduce TCO because you will spend less time managing a DBMS that has these features:
  • no sharding required or sharding and resharding are easy
  • no failover via multi-master or failover is automated
  • much less downtime on schema change or no schema changes required
Eventual consistency is supported by some of the NoSQL products. Given their simpler data models I expect it to be supported by more. This feature is critical for a service with users distributed around the world as response time can be reduced when users access a local database.

Regardless of the API (SQL or NoSQL), regular MySQL doesn't have any of the features listed above. MySQL Cluster already has all of the features listed above. But there is still hope. Tungsten has automated failover, Galera continues to get better, official MySQL can get automated failover once global transaction IDs are supported. Semi-sync replication is supported and I think a few things can be done to make that much more useful for HA systems. There are tools to do online schema changes but I think that is better done in the MySQL server.

Thursday, August 11, 2011

Read Amplification Factor

Flash devices are described in part by their write amplification factor (or WAF). When the OS writes a page once the device might write it more than once and this multiple is the write amplification factor. The WAF isn't always described in marketing and even if it were the value you get in production is workload dependent.

Variations of the log-structured merge tree have been used by many new storage servers including HBaseBigtable,  Cassandra and leveldb. These servers append changes (delete, insert, update) to the end of a file rather than in place. To find one row by key value with an LSM the server might have to read from from multiple files or multiple locations within one file to fine one. I have been calling this the read penalty because a workload is very likely to do more disk reads when using an LSM than when using an update-in-place engine but I think that read amplification factor (or RAF) might be a better phrase. If a workload does 100 disk reads on InnoDB and 120 disk reads on an LSM then the RAF is 1.2. The RAF matters even for many write-optimized servers because an update intensive workload requires many random disk reads. Although an LSM can avoid some of the reads when the update is a replace or when the operation is commutative and doesn't require an immediate result. For example an update that increments a row can log +1 when the request doesn't need to return the old value.

Many LSM implementations use a bloom filter to reduce the RAF. The bloom filter prevents some reads from files known not to have data for a given key. A bloom filter only works for point lookups. It cannot be used for a range scan and the RAF for a workload will be at its worst when you map a relational schema directly to HBase (1 row in InnoDB --> 1 row in HBase). Fortunately many of the LSM implementations support schemas in which more data is consolidated into one row and in many cases something that requires a range scan in a SQL RDBMS will use a point lookup in HBase.

There are new products (TokuDBAcunu, maybe RethinkDB) that claim to be better than an LSM in part because their RAF is much closer to one for both point lookups and range scans. By closer to one I mean that there is (almost) no read penalty. This should be easy to verify with a production workload.

While there are very interesting performance models described in the literature I use a very simple one when considering the read amplification factor. In my model all levels of a tree-structured index are in RAM except for the lowest level. In this model a point lookup with an update-in-place DBMS does at most one disk read from an index leaf page excluding access to external/overflow pages for LOB columns and other special cases. For something that claims to be better than an update-in-place DBMS I want to know how many index leaf pages are read in the worst and average cases.
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.