Was there no backup server to replace the failed db server? Twitter previously suffered a long outage from a long recovery when using Joyent after a storage server running ZFS had internal corruption. I guess that even ZFS needs a filesystem repair tool.
Sheeri's post got me to thinking about this. I don't think we are singling out Twitter. It is just that their production problems have gotten a lot of publicity and post-mortems are an effective way for DBAs to learn. We should thank them for sharing details as most of us would not be able to do that.
The reason for the database server crash was too many connections. How might that cause a crash? I can only guess. This problem should not cause a crash.
- max_connections was set too high and the process ran of of address space from all of the thread stacks or file descriptors for all of the sockets for the connections
- max_connections limit was reached and someone killed the mysqld process rather than connecting as a user with SUPER privileges and killing many of the connections
One feature that InnoDB needs is the ability to change the rate of background IO. The limits are currently set with the assumption that the server can do 100 IOPs max and most servers today can do much more than that. The upcoming InnoDB patch at code.google.com will have a variable by which the server IO capacity can be set to allow the background IO rate to be tuned.
I hope we get more details because we can learn from this.


Coldstarting a disk-read-bound mySQL master instance.
ReplyDeleteHow many spindles can be added before mySQL cannot feed them with enough read requests? 8? 16? 32?
If coldstart takes hours, then there should be a standby server for failures like this.
ReplyDeleteI know that 8 isn't a problem and that 16 should be fine, but I don't have access to a server with 32 spindles, so I don't get to observe this. InnoDB has performance problems when there are many concurrent connections, but I suspect that the problem does not exist when those connections are blocked on IO.
InnoDB also issues prefetch requests, but there is only one background thread to handle the requests (unless you use the Google patch which supports as many as you want). Prefetch requests are more likely to be issued for scans than for index lookups. This could make attempts to warm up the cache extremely slow if many concurrent sessions do table and index scans.
If you are able to do full table scans to warm up the cache, then it can be faster to warm up the OS buffer cache first by reading the *.ibd files with external processes before MySQL is started or while it is warming up.
They haven't used Sun/ZFS since last December, they're using RHEL/ext3. (or at least they were as of about a month ago, haven't checked recently)
ReplyDelete>If you are able to do full table
ReplyDelete>scans to warm up the cache, then it
>can be faster to warm up the OS
>buffer cache first by reading the
>*.ibd files with external processes
> before MySQL is started or while it
> is warming up.
Even with innodb_flush_method=O_DIRECT?
> Even with innodb_flush_method=O_DIRECT?
ReplyDeleteNo, and probably not for anything else that doesn't use the OS buffer cache.
Note: I'm not the DBA...
ReplyDeleteThere are read-slaves for fail-over, but it isn't trivial to keep them warm enough. Synthetic queries help some, but without actual traffic hitting the current active set, the fail-over slave's cache is always colder than we'd like.
Full table scans aren't helpful, as the tables are much larger than memory. Instead, we try to get the indexes for critical tables in memory, but this doesn't help enough.
I think we should look into continuously scanning the last few days, but there's always the unpredictable long tail into the past that slows things down.
We don't use OS caching -- we need all the memory for mySQL.
If anyone has experience above 16 spindles, I'd be happy to hear of it. Especially how you've set up your RAID controller.
While the idea of an intervention is kind of funny, it really isn't neccessary. Over the years I have seen countless instances like twitter where you just shake your head and think "if they were using postgres/oracle they wouldn't have these problems". It's not that those systems dont have thier issues, but mysql puts up walls at a much lower level than you'd get from an oracle. Never seems to be a consideration from people once they start down the mysql path though.
ReplyDeleteWith raw IO, you may suffer from the bottleneck of having only 1 thread for prefetch read requests and 1 thread for dirty page writes. The Google patch has code to support a configurable number of reader and writer threads. As InnoDB has yet to add equivalent features to the official release, you or someone might have to backport it if you want it.
ReplyDeleteIf you are also using Linux, then you might be doing extra IO for paging because of all the memory allocated to the InnoDB buffer cache. A fix is to put swap on tmpfs. Don from SmugMug and Kevin Burton have documented the problem and workarounds.
Mark,
ReplyDeleteThe DBAs avoid swap. They either run swap-less or monitor it very very carefully.
Although mySQL only has one read thread -- I'd guess that it's using OS-level async IO. I'd hope that this could keep > 16 spindles saturated...
The hardware path is so murky, that it's hard to make decisions there for the moment.
The background IO threads do not use async IO. Some reads and writes are done directly by the user session threads in InnoDB. But all prefetch reads and most dirty page writes are done by the one read or write thread using blocking IO.
ReplyDeleteMark,
ReplyDeleteDo you have an ETA on multiple background thread patch? I would like to test it on a thousands of IOPS system to see if it helps us.
Support for multiple background IO threads is in the big Google patch. Most of it is in os0file.c, srv0srv.c and srv0start.c.
ReplyDeleteNote that Sun also has a PostgreSQL business:
ReplyDeletehttp://www.sun.com/software/products/postgresql/index.jsp
Yes, they do and PostgreSQL is getting a lot better for OLTP.
ReplyDelete