Flash devices are described in part by their write amplification factor (or WAF). When the OS writes a page once the device might write it more than once and this multiple is the write amplification factor. The WAF isn't always described in marketing and even if it were the value you get in production is workload dependent.
Variations of the log-structured merge tree have been used by many new storage servers including HBase, Bigtable, Cassandra and leveldb. These servers append changes (delete, insert, update) to the end of a file rather than in place. To find one row by key value with an LSM the server might have to read from from multiple files or multiple locations within one file to fine one. I have been calling this the read penalty because a workload is very likely to do more disk reads when using an LSM than when using an update-in-place engine but I think that read amplification factor (or RAF) might be a better phrase. If a workload does 100 disk reads on InnoDB and 120 disk reads on an LSM then the RAF is 1.2. The RAF matters even for many write-optimized servers because an update intensive workload requires many random disk reads. Although an LSM can avoid some of the reads when the update is a replace or when the operation is commutative and doesn't require an immediate result. For example an update that increments a row can log +1 when the request doesn't need to return the old value.
Many LSM implementations use a bloom filter to reduce the RAF. The bloom filter prevents some reads from files known not to have data for a given key. A bloom filter only works for point lookups. It cannot be used for a range scan and the RAF for a workload will be at its worst when you map a relational schema directly to HBase (1 row in InnoDB --> 1 row in HBase). Fortunately many of the LSM implementations support schemas in which more data is consolidated into one row and in many cases something that requires a range scan in a SQL RDBMS will use a point lookup in HBase.
There are new products (TokuDB, Acunu, maybe RethinkDB) that claim to be better than an LSM in part because their RAF is much closer to one for both point lookups and range scans. By closer to one I mean that there is (almost) no read penalty. This should be easy to verify with a production workload.
While there are very interesting performance models described in the literature I use a very simple one when considering the read amplification factor. In my model all levels of a tree-structured index are in RAM except for the lowest level. In this model a point lookup with an update-in-place DBMS does at most one disk read from an index leaf page excluding access to external/overflow pages for LOB columns and other special cases. For something that claims to be better than an update-in-place DBMS I want to know how many index leaf pages are read in the worst and average cases.
Thursday, August 11, 2011
Subscribe to:
Post Comments (Atom)


The RAF of a log structured system could be improved by using compaction as an opportunity to improve locality for hot data.
ReplyDeleteThere is some talk about that in Cassandra JIRA https://issues.apache.org/jira/browse/CASSANDRA-1608
I also wonder if the IOs saved on writes offset the IOs lost on reads? Or does the update in place engine win because the memory it uses to coalesce writes also tends to be the hot data that needs to be in cache.
Figuring out how to do compaction should be an interesting DBMS research topic for a while and there is a lot of low-hanging fruit. I think leveldb already has something like that. Lots of new problems to figure out like when to drop tombstones and drop old versions. If this is only done during major compactions then you might need 10X the disk space for a frequently modified dataset. AFAIK, official HBase only drops old versions to enforce max-TTL and max-versions during major compactions.
ReplyDeleteCombined with listening to the Facebook Hbase talk at Oscon data, I a) understand what you talk about and b) learned something new.
ReplyDeleteWill now digest this to figure out how to apply it. (Maybe I need to use HBase or Tokudb now...)
Henrik
For rookies like you and I, Cassandra is easier to setup in production than HBase (one process per node for it versus setting up HDFS, HBase and ZooKeeper for HBase). TokuDB should be familiar for you and I think it is free to try.
ReplyDeleteWith Acunu's stratified b-trees, don't they still have to read multiple files? From glancing at one of their slide decks, it seemed like they're also using bloom filters to eliminate some lookups. On top of this, they're building pointers so they can jump into the correct section of the file and scan from that point onwards (like a skip list).
ReplyDeleteMark - i strongly suspect Tokutek/Rethinkdb etc. claims are not true. The Quora thread on this topic guided me to Fractional Cascading - and FC is a way to reduce CPU cost with fragmented indices (via chained indices). It's not a way to do IO pruning the way Bloom filters do in BigTable. It is likely that Tokutek etc. do a better job of using L1/L2 caches (their strategy for doing incremental checkpoints/merges seems better than HBase) - but that's completely orthogonal to the read path.
ReplyDeleteHaving observed Bradley of Toku in technical discussions I suspect the claim is valid. Eventually I will run more perf tests for it with IO-bound workloads to confirm my beliefs.
ReplyDeleteNote that read performance can be substantially altered by a good row cache (something that HBase does not have at the moment). The trouble with benchmarks is that they will mask the contributions of individual techniques. My main question is whether cascaded indices themselves are a good way to reduce read IOs across fragmented indices (as separate from a good final product that layers on additional techniques like row caches/filters/range-headers etc.)?
ReplyDeleteI am also curious about the RAF for range scans (within a row or across rows) in HBase. Bloom filters are not used for range scans. The row cache can't be used as the block index points to blocks, not rows.
ReplyDeleteI might be able to do more than talk/write about this if only I had some spare time.
But a row cache can be easily added on top. I believe Cassandra has one.
ReplyDeletePart of my skepticism about fractional cascading is that once row-caches are thrown into the mix - the whole trick of maintaining pointers across indices is further called into question (even for the in-memory case).
Hi Joydeep,
ReplyDeleteI'm a new-ish engineer at Tokutek, and I wrote one of the Quora answers (http://b.qr.ae/pfLEW4) you mentioned. But I wrote most of it before I joined the company, and I know a lot more about our technology now. Fractional cascading is important, but doesn't tell the whole story. As I understand this thread, we're concerned with the case where all internal nodes are in-memory, and it's only the I/O for the leaf nodes that matter. In this case, it doesn't cost any I/O to look at internal nodes. This is a more real-world perspective than you'll find in the literature, and one in which Fractal Tree indexes exhibit a very low read amplification factor.
I have been meaning to expand on my Quora answer in a blog post comparing Fractal Tree indexes with some other write-optimized data structures (LSM-trees, stratified B-trees), but I've been too busy. I know there's a lot of interest in seeing this comparison. I'll be sure to talk about read amplification and this sort of real-world analysis, and I hope to get the post out in the next few weeks, so keep an eye on our blog!
Leif - welcome to Tokutek. I look forward to more technical blog posts about it.
ReplyDeleteThanks. As promised, see http://www.tokutek.com/2011/09/write-optimization-myths-comparison-clarifications/ and http://www.tokutek.com/2011/10/write-optimization-myths-comparison-clarifications-part-2/
ReplyDeleteThanks - I responded on http://www.tokutek.com/2011/09/write-optimization-myths-comparison-clarifications. We need more traffic on your blog posts.
ReplyDeleteHi Mark and all,
ReplyDeleteThanks for the pointer. Write amplification is an interesting NAND flash-specific phenomenon, but it's somewhat orthogonal to the actual data structure used. For example, with Stratified Trees, the write amplification can be as large as O(log n) in the worst case, but that's before the device itself does anything interesting. Your notion of "RAF" is an interesting one. If I understand the definition correctly (the number of read IOs performed for a workload compared to a B-tree), then Acunu's Stratified Trees often achieve a RAF < 1. One of the major reasons for this is that, for range-query-heavy workloads, the data structure is extremely efficient - as the HotStorage paper (http://media.acunu.com/library/hotstorage11.pdf) demonstrates, the structure often beats CoW trees by orders of magnitude in both range query and write performance.
Thanks,
Andy.
And now another use of "read amplification" --> http://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf
ReplyDeleteI think the notion of read amplification there is more natural - it's the ratio of number of reads issued to the device versus the number of exogenous read requests (rather than comparing to a specific data structure, eg a B-tree). Essentially, read and write amplification are just talking about insert/query tradeoffs: the best reference for this is www.cs.au.dk/~gerth/slides/soda03.pdf and the follow up work (Streaming B-trees, Stratified B-trees, X-boxes, dynamic external hashing, and so on).
ReplyDelete