Thursday, August 11, 2011

Read Amplification Factor

Flash devices are described in part by their write amplification factor (or WAF). When the OS writes a page once the device might write it more than once and this multiple is the write amplification factor. The WAF isn't always described in marketing and even if it were the value you get in production is workload dependent.

Variations of the log-structured merge tree have been used by many new storage servers including HBaseBigtable,  Cassandra and leveldb. These servers append changes (delete, insert, update) to the end of a file rather than in place. To find one row by key value with an LSM the server might have to read from from multiple files or multiple locations within one file to fine one. I have been calling this the read penalty because a workload is very likely to do more disk reads when using an LSM than when using an update-in-place engine but I think that read amplification factor (or RAF) might be a better phrase. If a workload does 100 disk reads on InnoDB and 120 disk reads on an LSM then the RAF is 1.2. The RAF matters even for many write-optimized servers because an update intensive workload requires many random disk reads. Although an LSM can avoid some of the reads when the update is a replace or when the operation is commutative and doesn't require an immediate result. For example an update that increments a row can log +1 when the request doesn't need to return the old value.

Many LSM implementations use a bloom filter to reduce the RAF. The bloom filter prevents some reads from files known not to have data for a given key. A bloom filter only works for point lookups. It cannot be used for a range scan and the RAF for a workload will be at its worst when you map a relational schema directly to HBase (1 row in InnoDB --> 1 row in HBase). Fortunately many of the LSM implementations support schemas in which more data is consolidated into one row and in many cases something that requires a range scan in a SQL RDBMS will use a point lookup in HBase.

There are new products (TokuDBAcunu, maybe RethinkDB) that claim to be better than an LSM in part because their RAF is much closer to one for both point lookups and range scans. By closer to one I mean that there is (almost) no read penalty. This should be easy to verify with a production workload.

While there are very interesting performance models described in the literature I use a very simple one when considering the read amplification factor. In my model all levels of a tree-structured index are in RAM except for the lowest level. In this model a point lookup with an update-in-place DBMS does at most one disk read from an index leaf page excluding access to external/overflow pages for LOB columns and other special cases. For something that claims to be better than an update-in-place DBMS I want to know how many index leaf pages are read in the worst and average cases.

Wednesday, August 10, 2011

This Percona patch saved me

If you use InnoDB and have long-running select statements then the InnoDB undo space can grow large because purge can't advance beyond the longest open transaction. If this is blocked for too-long then pages to be purged might leave the InnoDB buffer cache and the purge thread will have to do many disk reads. As the InnoDB purge process is single-threaded it might not be able to catch up and a server can use too much disk space for a very long time.

I experienced this problem at work.  I modified innochecksum to report the number of pages by type and it was easy to see that undo accounted for most of the ibdata1 file (a few hundred GB). As InnoDB doesn't shrink ibd files there is no way to get this space back on a running server but I would like to prevent the problem in the future.

The problem was solved by using the multi-threaded purge patch from XtraDB for MySQL 5.1. The patch isn't too much code so I felt comfortable testing it and was able to review it. Since deploying it I no longer have to worry about servers using too much space for undo because of purge lag.

Tuesday, August 2, 2011

Technical debt

Last time I checked there was one test in the MySQL test suite (mtr) that covers one case for crash recovery. Perhaps there is a private test suite. Given that I modify InnoDB and replication code and that I frequently debug crashes at work I wish there were more tests. I added many tests in the Facebook patch for crash recovery to confirm that recovery works for the replication slave, replication master and InnoDB. While doing so I found at least one bug in rpl_transaction_enabled. While working on global transaction IDs Justin found a few bugs in official MySQL that prevented recovery after the slave crashed. These were fixed  in official MySQL 5.1 so there is a lot of value in having tests like this. But there is a lot more that should be tested including:

  • crash recovery during DDL. There are windows where recovery is not possible given that many DDL commands are not atomic between InnoDB and the FRM file.
  • crash recovery during DDL on a partitioned table. I know from debugging crashes that a file named ddl.log is written. 
  • crash recovery on a replication master. The master uses XA to keep the binlog and InnoDB in sync. But there are not tests to confirm that the right thing is done after a crash at each step of the two-phase commit process. The Facebook patch has such tests so I trust that the code in official MySQL but these tests should be in mtr.
  • crash recovery on a replication slave. Unfortunately, there are known race conditions in replication slave state updates that might be fixed in 5.6.3 so all of the possible failures cannot be tested. But updates to the replication state index files can be crash proof and I think they are now after the Google team reported the problems while testing global transaction IDs.
  • crash recovery for InnoDB. I don't think I have to write anything about this.

MySQL has extremely useful DEBUG macros for making the server crash at specific points in the code. This makes it very easy to add deterministic crash tests. The test suite was updated within the last year to allow the server to be crashed without failing a test. Unfortunately these features are rarely used in the official test suite.

I hope this changes. Tests with deterministic failures are just the start. The next step is a test with randomized failures. Unfortunately it won't be as easy to add this to mtr as such a test is much easier to write in Perl or Python. We have a test suite at work where a random workload is run against the master and the master is killed at random times. A slave replicates from that master and after each kill tables on the master and slave are compared.

And now some of the recovery bugs that I have encountered including problems that have been fixed:

And open bugs and feature requests
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.