Wednesday, March 24, 2010

Durable, not durable and really not durable

A lot of interesting work is being done today on SQL and NoSQL servers. This generates a lot of interesting discussions about CAP, ACID and BASE. Be careful what you read.

The D in ACID stands for durability. A DBMS (SQL or NoSQL) is either durable or it is not. But there are several ways to be not durable. Two popular ways are lose-last-N-transactions (not durable) and lose-unspecified-amount-of-data (really not durable). One of these is much better than the other. Alas this distinction is frequently ignored when describing really-not-durable servers.
  • lose-last-N-transactions - not durable servers provide a configuration that enables better performance by allowing the last N transactions to be lost during a crash. InnoDB does this when innodb_flush_log_at_trx_commit=2. I am not a MySQL Cluster expert but I think that global checkpoints provide the same property. Cassandra can run in this mode. I think that HBase must run in this mode. Some filesystems provide a similar option. The key point is that the system will quickly recover to a consistent point in time after a crash and the time to which it recovers will be not too far in the past.
  • lose-unspecified-amount-of-data - really not durable servers can be great for batch and read-only workloads. I am skeptical about using them for OLTP. A server specific version of fsck must be run after a software or hardware crash. The really not durable servers that I am familiar with do not document this behavior prominently (the amount of data that might be lost after crash recovery and the amount of time required to run the recovery tool on a large database file). I suspect some of their users are not aware of the problems that await them.

15 comments:

  1. Presumably you consider some engines (such as MyISAM) to fall into the latter "really not durable" category.

    Although as MyISAM also doesn't support atomicity, durability is difficult to define, as it may end up preserving *part of* an operation that you thought should be atomic.

    ReplyDelete
  2. I think I can define 'not durable' and 'really not durable'. I have attempted above. A big difference between them is the scope of non-durability when new changes are done.

    For 'not durable' systems, only recent changes are not durable.

    For 'really not durable' systems, any change done (now or 2 years ago) might be lost when recent changes are pending.

    For specific 'really not durable' servers I think it is possible to reason about the scope of damage. But they tend not to document such things.

    ReplyDelete
  3. You are kind of correct about MySQL Cluster and GCP. It will loose the last X number of transaction _if_ all nodes fail. In a well-configured cluster this is does not happen.

    ReplyDelete
  4. I missed that, but the requirement is both on the cluster configuration and on the datacenter hardware. Regardless, cluster is one of those things I wish I new more about.

    ReplyDelete
  5. Yes, true, data center hardware too. We can recommend ways to limit this (2x NICs in each box, etc...), but as you well know you cannot prevent 100%

    Come to our UC tutorial, we will teach you Cluster! ;)

    ReplyDelete
  6. I don't arrive until Monday evening. I will miss all of the great tutorials. The first thing I want to know is how many disk reads per second MySQL Cluster can drive on a single node. On my HW, the peak for MyISAM 80k, for PBXT is 40k and for InnoDB is 20k.

    ReplyDelete
  7. I think some of my colleagues are still giving some good talks around cluster during the week and we will all be around for a chat.

    That is an interesting question, especially since you can have multiple IO threads for disk data in 7.0 and minor disk optimizations in recently 7.0 releases. When combined with ndbmtd I would hope "a lot" but I don't know how much benchmarking has been done.

    ReplyDelete
  8. Jonas has just told me that he has run some ndb benchmarks after reading this and he will post them on his blog at some point soon :)

    ReplyDelete
  9. AFAIK NDB started out as an in-memory database to meet the performance requirements for telco customers. It has added support for some disk based data since then. How will it adapt with the arrival of fast persistent storage (ssd, flash)? Is that fast enough for them that they can put most of their data there and will NDB support that? My question about the max IOPs per node from NDB is relevant in that case.

    ReplyDelete
  10. Yes, NDB was born in a telco and later acquired by MySQL. Disk data was added around the time of MySQL 5.1 but not everything can go on disk (yet, work in progress) such as indexed columns.

    Jonas has just put up his blog post at: http://jonasoreland.blogspot.com/2010/03/fast-io-and-ndb.html

    I hope this answers some of your questions.

    ReplyDelete
  11. > I think that HBase must run in this mode.

    HBase allows fully durable crash consistency or 'not durable' (lose last N) semantics.

    Of course - 'fully durable' is defined as committing to memory of K(typically =3) machines. That is 'really really durable' for most purposes - but if u lose power instantaneously to entire data center - then clearly it's not durable (assuming all K memory copies are in same data center). In this power loss case - the value of 'N' changes to something much larger than a regular crash (and is indirectly settable by choosing the block size of the commit log).

    > Cassandra can run in this mode.

    I am hoping the nosql types dont read ur blog. But durability is not a substitute for consistency. Yeah my data is durable - but when i read it - i can get really really stale data (unless i make my reads really really expensive and less available). But hey - take solace - it's there somewhere! And it won't happen most of the time.

    yeah - whatever.

    ReplyDelete
  12. He is back.

    I <3 Cassandra and HBase. This post was inspired by a NoSQL server that looks an awful lot like sharded MyISAM to me.

    ReplyDelete
  13. @Mark - when will you review Cassandra/Dynamo? :-)

    ReplyDelete
  14. I need to read the Dynamo paper again and then try to understand your critique. I also want to know what their peak single-node performance is for CPU-bound and IO-bound loads. So, perhaps never given my current schedule.

    ReplyDelete
  15. One of the things that scares me about NoSQL users is that they think they have something fairly durable because they write to multiple nodes, but no one ever looks to see that all the servers live in the same rack (lose power to the rack, potentially lose everything). Or they all live in the cloud. If you can abstract the problem into some layer of the stack you can't see, it no longer exists!

    Oh, obligatory Postgres note: it essentially can operate in all three modes. Default is durable. You can switch to synchrnous_commit_mode = off to get not durable (similar to the innodb setting you showed) which can be done per statement if desired. You can also run the server with fsync = off. This mode still pretends to be durable; if you crash you go through recovery, but there is no guarantee as to what data might have been lost. Not recommended for important data, but during server setup or on a replicating slave... ok.

    I think most rdbms have settings to control this behavior (Oracle definitely has some), it just isn't talked about very much, because rdbms people assume it's better to preserve data than lose it.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.