Digg has begun to write about their reasons for migrating from MySQL to Cassandra. They provide an excellent summary and then describe a performance problem fixed by the migration. I think Cassandra and a few other members of the NoSQL family are amazing technology but I don't think a migration was needed to fix this performance problem. A better index on the Diggs table would have done that. Others have said the same thing. Maybe I don't have all of the details. I can only go on what was written in the blog.
You can learn more about the power of indexes at the MySQL conference.
The Diggs table was the source of the problem:
CREATE TABLE Diggs (It supported an important query that was too slow. A simple form of this query is:
id INT(11),
itemid INT(11),
userid INT(11),
digdate DATETIME,
PRIMARY KEY (id),
KEY user (userid),
KEY item (itemid)
) ENGINE=InnoDB;
SELECT digdate, idThis query requires too much random IO because it isn't index only. The query can use either the index on itemid or the index on userid. In both cases it will scan more entries than it needs to from the secondary index and then lookup the remaining columns from the primary index. Each lookup on the primary index can do one disk seek. On my test server the plan for this query is:
FROM Diggs
WHERE userid in (10, 20, 30) AND itemid = 50
ORDER BY digdate DESC, id DESC LIMIT 4;
id select_type table type possible_keys key key_len ref rows ExtraAfter running the query I ran SHOW SESSION STATUS LIKE "Handler_read%" and the result from that is below. The query scanned 5000 entries from the secondary index and would have done more than 5000 disk seeks in the worst case to lookup columns from the primary key index.
1 SIMPLE Diggs ref user,item item 5 const 8960 Using where; Using filesort
Variable_name ValueThe query is much faster for a table with different indexes
Handler_read_first 0
Handler_read_key 3
Handler_read_next 5000
Handler_read_prev 0
Handler_read_rnd 0
Handler_read_rnd_next 0
CREATE TABLE DiggsFast (
id INT(11),
itemid INT(11),
userid INT(11),
digdate DATETIME,
PRIMARY KEY (itemid,userid,digdate,id),
UNIQUE KEY (id)
) ENGINE=InnoDB;
The query has a better plan:
id select_type table type possible_keys key key_len ref rows ExtraIt also is much better in reality. The output from SHOW SESSION STATUS LIKE "Handler_read%" is listed below. With a better index the query scans 150 entries from 5 range scans of the index. It should do about 5 disk seeks in the worst case. It is also index only so it doesn't have to lookup other columns after the index scan. Although that doesn't matter much in this case because the query uses the primary key index which has all columns for an InnoDB table. This query will be much faster than the previous one (5 disk seeks versus 5000).
1 SIMPLE DiggsFast range PRIMARY PRIMARY 8 NULL 149 Using where; Using index; Using filesort
Note that this query uses the first two columns in the primary index for the predicates on itemid and userid. InnoDB stores all columns in the primary key index entries so any query that uses the PK index is index only.
Variable_name ValueUPDATE
Handler_read_first 0
Handler_read_key 5
Handler_read_next 150
Handler_read_prev 0
Handler_read_rnd 0
Handler_read_rnd_next 0
I created another variant of the Diggs table that uses a secondary index for the query. InnoDB includes all columns from a PK index in the secondary index to serve as the pointer to the row. Note there is a difference between being 'in the index' and being indexed.
CREATE TABLE DiggsFast2 (From the query plan:
id INT(11),
itemid INT(11),
userid INT(11),
digdate DATETIME,
KEY itemuserdig (itemid,userid,digdate),
PRIMARY KEY (id)
) ENGINE=InnoDB;
id select_type table type possible_keys key key_len ref rows ExtraAnd SHOW SESSION LIKE "Handler_read%"
1 SIMPLE DiggsFast2 range itemuserdig itemuserdig 10 NULL 150 Using where; Using index; Using filesort
digdate id
Variable_name Value
Handler_read_first 0
Handler_read_key 5
Handler_read_next 150
Handler_read_prev 0
Handler_read_rnd 0
Handler_read_rnd_next 0


Love it. So many of my customers have no clue that MySQL can do composite indexes. Though, I thought it was a limitation that your composite index had to be in same order as the predicates in your where clause. Was that fixed recently?
ReplyDelete@Matthew - I have never heard of that issue. Both TokuDB and InnoDB have optimizations that reduce the IO overhead from maintaining secondary indexes so we should use them when possible. But in this case I was able to use the PK index.
ReplyDelete@Matthew: This was a limitation many, many years ago. I don't recall off the top of my head when it was fixed, but rest assured, there is no need to do that.
ReplyDelete@Matthew - I think what you are talking about is composite indexes having to be in the right order for a sort. So if you whish to select on column a and b and sort your resultset on column c, the index would have to be either (a,b,c) or (b,a,c) to avoid sorts using temp tables
ReplyDeleteHey thanks for nice post, use of composite indexes in mysql is nice idea to fix such issues.
ReplyDelete@Matthew - I think you're referring to the fact that in order for MySQL to take advantage of an indexes, one or more of the index columns must be used from left-to-right. The actual order of the 'where' criteria can be arbitrary.
ReplyDeletee.g. If we have an index (ColumnA, ColumnB), then then queries with the following WHERE clauses could make use of the index:
WHERE ColumnA=X AND ColumnB=Y
WHERE ColumnB=Y AND ColumnA=X
WHERE ColumnA=X
..but the following would NOT:
WHERE ColumnB=Y
..because that doesn't use the leftmost column of the index. MySQL is clever enough (usually) to "rearrange" the WHERE clause to match the index column order.
An analogy might be postcodes (zip codes) and house numbers. They're used rather like an index with the columns (Postcode, HouseNumber). If I'm looking for Bill Smith, No. 73, ABC1 234, then I can quickly locate that adrress: the Postcode ABC1 234 quickly narrows the search to a small geographic region, then 73 locates an individual address in that region. Even if you gave me just the postcode (the left hand column), I can still use it to narrow down my search dramatically, then knock on all the doors until I find Bill Smith (i.e. an index scan on the house numbers within the postcode). But if you just give me the right-hand column, housenumber=73, then that doesn't help. I'd have to visit every No.73 in the country to find Bill Smith, and since I don't have a list of where just the No.73's are, I just have to visit every address in the country, check if it's 73, and ask for Bill Smith! ("Table scan"). (It might be cheaper to scan the index to find the No.73s, rather than scan a whole table, but AFAIK MySQL simply rejects an index if the leftmost column is not used.)
Matthew -- there was never any limitation on that order. What you're thinking of is the prefix-based nature of indexing, such that if you have an index on
ReplyDelete(userid,itemid)
you can use it for queries that require an index on userid + itemid (as in the example) AND for queries that only require an index on userid, but NOT for queries that only need an index on itemid.
An index on (userid,itemid) gives you index usage in both the following queries:
SELECT digdate, id FROM Diggs
WHERE userid in (10, 20, 30) AND itemid = 50
ORDER BY digdate DESC, id DESC LIMIT 4;
SELECT digdate, id FROM Diggs
WHERE userid in (10, 20, 30)
ORDER BY digdate DESC, id DESC LIMIT 4;
But NOT in this query:
SELECT digdate, id FROM Diggs
WHERE itemid = 50
ORDER BY digdate DESC, id DESC LIMIT 4;
If you wanted an index on all 3 types of queries, you'd want 2 indexes:
(userid, itemid)
(itemid)
many people think you need 3:
(userid,itemid)
(itemid)
(userid)
but this is wrong.
Basically MySQL can use the prefix of an index, so an index on (userid, itemid) also gives you (using no additional space/cpu/RAM) an index on (userid).
Mark -- so basically what you're saying is that Digg could have saved a lot of migration time if they just would hire a good MySQL DBA. I agree.
ReplyDeleteSorry, but what if:
ReplyDelete1. there also exists queries on 'id'? They will have no index anymore and becomes slow.
2.they have lots of insert/updates? Larger index will slow this down.
@Donald - to whom are you responding. The DiggsFast table above in the original post has an index on the id column.
ReplyDeleteYou should elaborate on your concern about IO maintenance overhead. Your statement is generic.
this is very true. with clustered indices, covering indices and materialized views (wich can be simulated by application code) you can shift read load to write load which is very powerful.
ReplyDeleteyou can alos turn off acid with mysql and with any other dbms by putting the db on a ram disk and snapshotting it.
Isn't your index useless past userid? Last I checked IN() was a range (considering it gets optimized by things such as the range access method) -- and you can't traverse further through the btree index past a range condition.
ReplyDeleteI do remember some odd things with IN() not being a true range though, so that could contribute to this as well. I think I'd need to crack open the source code to be sure.
@Mark - couple of questions
ReplyDelete- u've already alluded to this - wouldn't it be cheaper to have the covering index be a secondary index. wondering why suggest PK index at all?
- assuming Digg swapped out their current secondary indices with a covering seconary index - how much more expensive would inserts become?
i wonder if there's some published data point about (random) secondary index update performance in innodb versus Cassandra/LSM-trees etc. that would make it easier to compare these systems.
i also wonder why projects (like Digg) that don't like mysql's write performance (presumably based on their writeup) prefer to try 'nosql' (as opposed to trying out tokudb for example)?
@Joydeep - I updated the post with results for the table DiggsFast2 that uses a secondary index. Some problems go away if you can use the PK index. For example, if the SELECT list has a LOB column then that column cannot be in the index. This isn't a problem for queries that use the PK index with InnoDB as all columns are in the PK index.
ReplyDeleteInnoDB has a great optimization to reduce the overhead of secondary index maintenance. See http://mysqlha.blogspot.com/2008/12/innodb-insert-performance.html for more details. From SHOW INNODB STATUS output I have seen this reduce IO by a factor of 5 for secondary index maintenance.
Last time we checked, TokuDB from http://tokutek.com did better than InnoDB. I suspect it still does. Someone with hundreds of servers running MySQL might be reluctant to run software that requires a license, but I imagine that using TokuDB on 10 nodes might be more cost-effective than migrating from MySQL to something else.
I think we still have a lot to learn about the IO performance tradeoffs between update-in-place and systems with write-optimization like LSM or log structure. Many of the discussions are subjective and we need more things quantified.
@Darius - the output from SHOW STATUS suggests the IN-list is handled efficiently. The value of Handler_read_key is 2 more than the size of the IN-list. I tested that for IN-lists of size 3 and 5.
ReplyDeleteThe prefix of the index columns stop being used for index lookups after 1 range condition. In this case there is equality on itemid and userid and no range predicate. The digdate column can still be used from the index to keep the query index only, although that doesn't matter much for an access to the PK index of an InnoDB table as all columns are there.
Were this a secondary index, it would matter. For InnoDB, all PK columns are included in the PK index. I updated the blog post with an example of that.
@Mark, You say "Some problems go away if you can use the PK index." These are the problems for which TokuDB developed clustering keys (http://tokutek.com/2009/05/introducing_multiple_clustering_indexes/). Although clustering keys are not required for this specific problem, because the number of columns is small, clustering keys can work on tables with any schema (including those that have BLOB columns).
ReplyDelete@Mark - you mention that TokuDB was faster at indexing than InnoDB last time you checked. I'm happy to report we're even faster now; a customer recently experienced an 80x indexing speedup over InnoDB: http://tokutek.com/2010/03/80x-insertion-speedup-for-profile-technologys-facebook-application/. Fast indexing is where we shine and as you point out, fast indexing means fast queries.
ReplyDeleteAlso, our pricing scheme (http://tokutek.com/downloads/tokudb-price-list.pdf) is based on the data stored, not the number of servers, so the cost of a 100-node server is the same as a 10-node server, for the same amount of data.
Thanks for the great post!
And the digg discussion continues to go nowhere. The former Digg guru still does not understand that their performance problem was caused by their lousy schema -- http://stu.mp/2010/03/nosql-vs-rdbms-let-the-flames-begin.html
ReplyDeleteI'm always amazed at how some obviously smart folks, very clever and good at things like PHP, Perl, Ruby, have a some sort of black hole in their knowledgebase as far as your every day database tuning problem.
ReplyDeleteThis is a nice post, but this is pretty basic for anyone even moderately familiar with databases. This is not to say all startups, thrown much too fast into having to massively scale their architecture, are in this situation - take Skype for example, which uses PostrgeSQL as a back-end. Note the SQL in the name. And this is also not to say, map/reduce and other technologies have their place.
Long before these technologies were around, apps have been solving these scalability problems, consider how much information Visa processes and stores for example, or PayPal, or eBay.
To me, this is a typical developer flaw, i.e. look what those guys are doing, we need some of that, i.e. SOAP, NOSql, CORBA, J2EE, Spring, EJB or whatever the latest fad is, when there is no real sound technical reason.
I saw other comment over at the stu.mp site that jumped out at me:
"Assuming $500 a disk (15k disks range from $300 to $800 on Newegg) that’s $7,500 just for disks."
Golly. We've been saying "disks are cheap" back when they weren't. But disks are cheaper than massively modifying your architecture, burning up time and money that could be spent elsewhere to better ends, and apparently so much you can't go back (never heard of rollbacks either I guess?), as was the case with Digg.
Learning pains. Every new batch of developers goes through this. The next bunch that come through will make the same mistakes, with the Digg developers shaking their heads perhaps.
I should spend more time learning about physical database design. I suspect the same is true of others. There are fundamental topics that we should not ignore while also keeping up with the new topics.
ReplyDeleteI thought it was pretty funny that Mr. Stump thinks it takes a $50k server to get a few thousand queries per second out of MySQL. I have a lot of consulting clients who spend way less on hardware and get tens of thousands of QPS. And they are doing things right, not cutting corners or risking failure or using MyISAM (but I repeat myself.)
ReplyDeleteBut it does take many expensive disks when each query does many disk reads because the are not index only.
ReplyDeleteThis problem is common for anyone who works with a "friends" based system. How do you scale when you are doing a query for someone with 40,000 friends (or more)? Even if the table is restructured so the query is index only, isn't it a bit ridiculous to have an IN clause with 40,000 ids in it? So, instead, the computations are done at the time someone diggs an article (write time) instead of when someone else wants to see who dugg the article (read time). I'm pretty sure there's not a way to do this in mysql as efficiently as it can be done in Cassandra. From the article - "Thanks to Cassandra’s excellent write performance and batch operations, every column is inserted at once, atomically, in under a second." The downside is you end up wasting a lot of disk space because a lot of that data may never be read, but the performance gains have to be considerable, no? It's like you're replacing SELECT count(0) FROM Diggs where user_id IN (......) and item_id = 345 with SELECT friend_diggs FROM Friend_Diggs WHERE user_id = 1535 and item_id = 345. Instead of finding and counting a whole bunch of rows, you're just looking up a value from one row.
ReplyDeleteNow going and replacing the entire mysql system with nosql still seems questionable in terms of the potential benefit vs time spent refactoring and migrating. Thanks for pointing out the advantage of using the primary key in InnoDB instead of a secondary index. Forgot about that. It does prevent you from using auto_increment, but that's a whole other can of worms. Cheers.
Thanks for the interesting comment.
ReplyDeleteSometimes the IN-lists are so large that the max query size limit is reached. I don't enjoy that.
We need a robust write-optimized store to reduce the cost of index maintenance for MySQL. The Innodb insert buffer makes a big difference. TokuDB from http://tokutek.com makes an even bigger difference.
I think that Cassandra is fascinating and hope to read more interesting stories about successful deployments.