I think that MySQL+memcached is still the default choice and I don't think it is going away in the high-scale market. But some high-scale applications either don't need all of the features of a SQL RDBMS or are willing to go without those features to scale. This isn't a blanket endorsement of NoSQL as the definition of NoSQL is weak. I am referring to the NoSQL systems that support high-scale.
I don't believe all of the bad press that MySQL receives from high-scale applications. I know that some problems with MySQL are self-inflicted (seriously, I know this). It is hard to diagnose many problems for which the primary symptom is a slow MySQL server so it is also hard to identify self-inflicted problems. I also don't think that some NoSQL systems will provide a different scale-out experience than MySQL given that some NoSQL systems scale-out by sharding (just like MySQL) and that I can deploy MySQL like NoSQL (disallow joins and secondary indexes, use HANDLER statements)
I also wonder whether affordable SSD/Flash reduces the need to migrate from MySQL to NoSQL. Many MySQL deployments that were IO bound when it was difficult to get more than a few thousand IOPs on a commodity server can now get 10,000 to 100,000 IOPs in that server at commodity prices.
MySQL and NoSQL are also at significantly different stages. MySQL is mature and maturity has its benefits. MySQL has amazing support and documentation. There are client libraries for almost every language that you should use. There are even bindings for languages you shouldn't use. The MySQL C API is easy to use. The JDBC driver is awesome, even if support for JDBC makes it much more complex than needed. There is a lot of MySQL expertise that can be hired or rented (MySQL, Monty Program, Percona, Open Query, Pythian, FromDual) and there is some innovation (not enough companies, but they are doing amazing things) from third-parties such as InfiniDB, InfoBright and TokuDB.
What happened?
NoSQL systems are improving faster than MySQL, MySQL has focused on features for the enterprise RDBMS market in the past two releases and the changes we need from MySQL are hard to implement. Change is hard because MySQL is a complex server that supports many features. Change is also much harder than it should be because of the MySQL coding style. Parts of it are not modular and features are entangled. Some of the difficulty could be overcome were there interest from external contributors. There are external contributors willing and able to improve server code but they are working on other projects like NoSQL. The MySQL effort is also split (or diluted) between official MySQL, Drizzle and MariaDB.
What really happened?
I don't know. It may have been better for the business of MySQL to focus on the enterprise market. I can describe some of the problems that need to be fixed in MySQL to make things easier for me. I think other high-scale applications share these problems:
- Multi-master - high-scale applications have users around the world. Latency is reduced by distributing databases and application servers around the world. Databases are rarely sharded by location so the data store must support multi-master deployments with conflict resolution and eventual consistency for some database tables. There is no support for conflict resolution in MySQL. It might be possible to do something with the output of row-based replication.
- SQL - this is a problem that MySQL cannot fix. SQL makes it easy to make mistakes. Mistakes include insert, update and delete statements that lock all rows in a table. Alas, the EXPLAIN statement in MySQL does not support insert, update and delete statements. Another serious mistake is a query that has a lousy response time when the database buffer cache is cold because it does many random disk reads. The EXPLAIN statement in MySQL does not provide an estimate for the worst-case number of random disk IOs and many people who write SQL don't know how to interpret it to get an estimate. Worst-case performance is critical for queries run during web requests.
- Write-optimization - several NoSQL systems are write-optimized including Cassandra, HBase and Bigtable. A write-optimized system makes it possible to use more indexes than an update-in-place data store. With more indexes it is more likely that there can be an index defined for every popular query and the index reduces the number of disk reads that must be done to evaluate the query. This improves worst-case query response time and reduces the need to use memcached or a huge database buffer cache. Write-optimization has finally arrived for MySQL with the availability of TokuDB. I hope that RethinkDB provides a GA version in the future.
- Monitoring - without good monitoring you will either spend too much time fixing performance problems or never find them and buy too much hardware. I suspect that monitoring in MySQL is much better than anything in a NoSQL system but MySQL is missing features that make it easy to understand current and new sources of workload. I need to aggregate the overhead (CPU time, disk operations, rows read, ...) by database user, table and statement. It is extremely hard or not possible to do this by database user and table. It became possible in MySQL 5.1 to do this by statement for short periods of time by using the slow query log in MySQL 5.1. Prior to MySQL 5.1, the slow query log was limited to queries that ran for at least two seconds. The alternative is to use tcpdump with mk-query-digest. Despite all of the work that has gone into the performance schema, MySQL has yet to support anything like my favorite feature -- user and table monitoring.
- Crash proof slaves - replication slaves are not crash proof. The slave commits transactions to a storage engine and then updates a state file to maintain the replication offset. There is nothing to keep the state file and storage engine in sync. Until recently there wasn't even an option to force the state file to disk after it was updated. Unplanned hardware reboots are frequent when there are hundreds or thousands of slaves. If you are clever and use the right version of InnoDB it is possible in some cases to figure out the correct offset for the slave after a crash and repair it manually. This isn't a good use of DBA time. Otherwise DBAs must waste their time and network bandwidth to restore the slaves. The Google patch published two different fixes for this: rpl_transaction_enabled and global transaction IDs. MySQL is working on a fix.
- Automated failover - for MySQL deployments that have many slaves connected to one master it isn't possible to automate failover when a master crashes. Tungsten and DRBD might make this better.
- Resharding - sharding is an excellent way to scale MySQL. Sharding usually requires resharding. Resharding is hard and must be done with minimal downtime. It might be possible to build a tool that uses row-based replication output to reshard a database in the background with little downtime. No such tool exists today.
- Replication lag - a slave with replication lag is useless for OLTP scaleout. The replication thread is single-threaded. MySQL is working on support for parallel execution on a slave. Until then we need to improve mk-slave-prefetch (Domas can you hear me).
- Schema change - these are frequently needed for growing high-scale applications. Long running schema changes in MySQL require downtime unless first done on a slave (assuming you have a spare slave and that slave can become the master after the change). Users don't like downtime. I think it is possible to do many of these on a master with minimal downtime using the output from row-based replication. Alas, there is no tool for that today.
Do your homework when evaluating a NoSQL system as they differ greatly from each other:
- Crash safety - most NoSQL systems are crash safe but a few are not. I would limit the use of systems that are not crash safe to supporting batch workload. Unplanned server reboots are frequent for high-scale applications when a large number of servers is used. At least two prominent members of the NoSQL family are not crash safe. That should be documented in bold text on their project pages. It is not.
- Sharding - some NoSQL systems do sharding. BigTable and others do not. With sharding it is possible to support transactions and multiple-indexes on a table within the scope of a shard. That then requires support for resharding. It also requires that queries on secondary indexes to be run on all shards while queries on primary indexes can be limited to run on one shard which may limit the ability to use secondary indexes.
- Index types - Many NoSQL systems are limited to hash indexes. You can't do range scans on hash indexes. I wonder whether this leads to data redundancy when every query must be resolved by one index lookup.
- Secondary indexes - NoSQL systems like BigTable not only do not support transactions, they also do not support secondary indexes. You can explicitly maintain a secondary index but there is no support to make multiple-changes atomic and there can be a failure between the primary and secondary index updates which results in data drift. It is also difficult to do consistent reads between the two.
- Consistent reads - consistency is usually the responsibility of the client and done by using per-row timestamps.
- Single-node performance - I know that performance != scale-out but scale-out is not a substitute for lousy single-node performance in the high-scale application market. It might be acceptable to use 5X as many nodes because your data store is slow when you end up using 40 nodes. This becomes a show-stopper when you end up using thousands of nodes. One NoSQL system has accepted this compromise. While I know it has many other use cases I think that will limit the use of it for high-scale applications.
- Network efficiency - MySQL reduces use of the network because all query evaluation is done at the server. All NoSQL systems evaluate predicates implied by the index access. Only some NoSQL systems evaluate non-indexed predicates. This can result in more data returned to the client.
- Technology or solution - MySQL is more mature than the NoSQL systems. A lot of work remains to grow NoSQL from technology into solution with support for audit, backup, monitoring and all of the other things required to scale in a large company.


Hi Mark. Thanks for the mention for TokuDB. (As you mentioned, TokuDB v3.0 with full ACID support went GA last week.)
ReplyDeleteHow do the NoSQL systems maintain lots of indexes? Are they throughput-optimized with long latency?
Some of your points aren't really about SQL, but are really about the implementation. Let me add to your list: the query system in MySQL is very limited. Why don't we have anything besides nested join queries yet? (Merge sort, hash join, etc.)
For OLTP, it is OK to restrict MySQL to nested-loops join. That also means the optimizer is less likely to choose the wrong join method for a query. I would try to use something other than MySQL for complex query processing.
ReplyDeleteThe write-optimized NoSQL systems do not provide secondary indexes. A user can explicitly maintain one by updating a separate table (the secondary index table) when the primary index is updated. That has lots of failure scenarios.
The write-optimized stores are log structured. Changes are written with sequential IO to new files and eventually something in the background merges the files.
Hi Mark,
ReplyDeleteWhat do you think about NDBAPI? It's clearly not a SQL, but data in data nodes can be accessed using SQL if you like. It can re-shard online, handle transactions well, have secondary indexes and so on. IMHO, it appears NDBAPI could be the most advanced NoSQL solution at this moment.
I have yet to look at NDBAPI, but we are very interested in using a non-SQL interface for simple queries that fetch data by primary key. I am sure that NDBAPI is more advanced, but for now we will start with HANDLER. We added a new command HANDLER_READ_OPEN_CLOSE in the Facebook patch for MySQL 5.1 and will continue to focus on improving HANDLER -- http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3366
ReplyDeleteMySQL Monitoring - what about so called Merlin (or MySQL Enterprise Monitor)?
ReplyDeleteI have not used it, but that and many other utilities are needed to run data servers in production. I suspect that MySQL has a big advantage in that area.
ReplyDeleteHey ! Thanks for a nice & thoughtful article.
ReplyDeleteJust a minor annoyance (for me): when you write "At least two prominent members of the NoSQL family are not crash safe", I think you should tell which projects those are.
Maybe you don't want to "bad mouth" them, which I understand and appreciate, but then again, if it's true, it's not bad mouthing. And if it wasn't not true, you wouldn't write it...
Anyway, just to let you know that as a reader, I ended up with a piece of uncertainty and doubt in my head, and that being explicit here would be good.
You are right. I should not spread FUD. MongoDB and Tokyo Cabinet are not crash safe.
ReplyDeleteFor details on MongoDB see http://www.mongodb.org/display/DOCS/Durability+and+Repair and http://blog.mongodb.org/post/381927266/what-about-durability. Last time I checked, MongoDB used mmap to read/write database files so a crash can occur in the middle of changing a page.
For more on Tokyo Cabinet see http://1978th.net/tokyocabinet/spex-en.html and in particular pay close attention to "To avoid data missing or corruption, it is important to close every database file when it is no longer in use". There are more details in http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file.
I don't think it is a good idea to run online services using data servers with this behavior. I don't think that replication is a substitute for this. I feel the same way about MyISAM and MongoDB is similar to MyISAM -- must run fsck after a crash, writes cannot be concurrent with reads, writes cannot be concurrent with other writes.
Here is a classic quote from http://blog.boxedice.com/2009/07/25/choosing-a-non-relational-database-why-we-migrated-from-mysql-to-mongodb
"Corruption will only really occur if you kill the database process whilst it is in the middle of an operation."
Much thanks for the added info, and for this quote which totally made my day !
ReplyDeleteRegarding re-sharding on the fly;
ReplyDeletePaul Tuckfield at Youtube described a nice system where they wrote fairly simple scripts to parse the replication binlogs for slaves and split them between shards, allowing resharding while replicating.
I'm sure it's not that simple in practice but it sounds like a good one...
We need better tools to support that. Lots of clever things can be done with the output from row-based replication to support non-blocking schema changes and data migration.
ReplyDeleteSurprisingly simple solution for rebalancing shards: mk-archiver. Not the fastest, but it works.
ReplyDelete