Multi-master requires one of conflict prevention, conflict resolution or faith. MySQL Cluster provides both conflict prevention and resolution as described in these great posts. Regular MySQL has minimal support for conflict prevention (auto-increment-offset can prevent insert conflicts) and thus requires faith that the application does the right thing. Regular MySQL gets conflict prevention via synchronous replication when used with Galera.
There has been talk of adding support to replicate from multiple masters into one slave. We have yet to agree on a name for this. It has been called fan-in, multi-source and multi-master. I hope multi-master isn't reused for this.
You can now use HandlerSocket and the memcache API to access MySQL. But taking away SQL doesn't make this a NoSQL solution. And in the case of MySQL Cluster it already has many of the properties of a NoSQL solution, getting the memcache API is just gravy. The new APIs will allow some workloads to get more read throughput from MySQL. They won't do much for write throughput because those bottlenecks are in InnoDB and independent of SQL/NoSQL. Note that the memcache plugin for InnoDB has tuning options that are good for benchmarks but might not be good for production. There are options that reduce the frequency at which transactions are committed and started. These options allow the transaction start and commit bottlenecks to be avoided at the cost of stale reads and async commits -- see daemon_memcached_r_batch_size and daemon_memcached_w_batch_size.
I don't think the lack of SQL access is what makes many of the NoSQL products compelling. The big thing for me is that they can reduce TCO because you will spend less time managing a DBMS that has these features:
- no sharding required or sharding and resharding are easy
- no failover via multi-master or failover is automated
- much less downtime on schema change or no schema changes required
Regardless of the API (SQL or NoSQL), regular MySQL doesn't have any of the features listed above. MySQL Cluster already has all of the features listed above. But there is still hope. Tungsten has automated failover, Galera continues to get better, official MySQL can get automated failover once global transaction IDs are supported. Semi-sync replication is supported and I think a few things can be done to make that much more useful for HA systems. There are tools to do online schema changes but I think that is better done in the MySQL server.


Excellent summary of where we are and a topic close to my heart.
ReplyDeleteThe eventual consistency theorem says that your favorite NoSQL solution is consistent to the outside world when W + R > N (writes + reads > number of replicas). I've come to the conclusion this kind of general system is easy to build for a key-value store, but probably impossible for an RDBMS that will update multiple rows in multiple tables in one transaction. On the other hand synchronous replication solutions like Galera and NDB kind of do W=N and R=1, so we can achieve the same there.
The thing that nobody is really looking into yet is sharding. (Well, Drizzle, but it won't be done in the next release after all.) NDB again does all of this already. We have the spider engine, perhaps that and galera could be the NDB for InnoDB? Even with spider, I'm not aware of any solution that will re-shard automatically for you.
Online schema changes are important (NDB does it, of course) but I'm not sure if it is fair to compare that against NoSQL which are schemaless :-) You can use MySQL schemaless too and problem is solved.
We had drinks at Oscon 2008 around these topics. That discussion is still a motivator for me when I've been researching Galera :-)
Hi Mark, just wondering - what do you think of NoSQL as compared to transparent sharding solutions like http://www.scalebase.com (I work there) or http://www.dbshards.com?
ReplyDeleteLiran - I am glad that companies are building products for MySQL. But I don't know much about the sharding products.
ReplyDeleteMark, I like your list of features where NoSQL is better. These are common needs for large and growing data sets that access single "objects" where an object is more than a row but a lot less than a schema with unconstrained joins, full referential integrity and multi-statement transactions.
ReplyDeleteIt seems as if the DBMS world is headed toward a trifecta of common data management solutions: Dynamo/Bigtable for large data sets, relational for business (including SaaS) apps, and specialized data warehouses.
I don't believe the effort to make RDBMS have capabilities like geo-replication or linear scaling of data are going to be very successful because there's too much conflict between the underlying assumptions of the available solutions we can implement today. If you build a BigTable system you are just not going to write reports using anything like standard SQL tools. Until the underlying computer science changes this is the way it is.
Also, a lot of the efforts to put automated shards into MySQL (or other DBMS) miss the real issue for business applications, which is to make shards explicit in the application, e.g., not allowing updates or selects to cross shards as well as doing things like backup/restore/migration without disturbing other shards on the same system. This is a constant need for SaaS applications that I work with. It's a little surprising more people do not focus on it.
Hi Mark,
ReplyDeleteExcellent summary. I have some questions and opinions though. I would like to ask a question about the fan-in model. Apart from conflicts don't you think that the slave would be bombarded with the updates? In a write intensive application the writes of multiple masters will be propagated to the slave. The amount of work the slave will have to do would be enormous, and may cause the slave to lag and not respond to queries. Yes I can think of an application of this concept where the slave is just to hold the data from multiple masters. Your comments please?
Secondly there are many NoSQL solutions. Facebook is using Hbase and keeping in mind the amount of data they have Hbase looks promising. As Hbase is built on HDFS which has a single point of failure, I wonder if someone has addressed this issue.
Regarding sharding, I think there are some companies trying to build products. I don't remember the name but there was one company that was offering such solution. But they wanted an agent running on application servers in order to apply sharding mechanism, which I think is not a good idea, as agent failure may render the application useless. Also different companies may shard their data differently, do you think its possible to come out with a generic solution for sharding?
I agree that whether fan-in is useful depends on the workload. I am sure it will work for some customers and maybe enough that it is worth the cost of building and supporting.
ReplyDeleteIn my opinion, Multi-Master MySQL, NoSQL, and MySQL are actually different systems good for different purposes. A valid comparison though is between Multi-Master replication and MySQL Cluster.
ReplyDeleteIn regard of High Availability, the better choice in my opinion is the Multi-Master rather than MySQL cluster, taking into account the following factors:
(a) System hardware cost - MySQL Cluster is much more costly since it requires high RAM and better connections between nodes.
(b) The complexity of system management - Multi-Master is easier to manage because it is easy for the DBA to set-up a duplicate system for testing purposes to perform failover. Since in a MySQL Cluster setup, this can be much more expensive and system management cannot be thoroughly explored.
(c) System performance – here, I believe that MySQL Cluster is the better choice as it performs better.
So, basically it is important to determine what most important factor is and choose a system type accordingly.