Thursday, May 29, 2008

Twitter and us

Twitter appears to depend on MySQL and Sun. So much so that a failed (MySQL?) database server was worthy of downtime, a blog post and several hours of recovery. Is it time to stage an intervention before PostgreSQL or Oracle get the account? Should we send in Percona, Proven Scaling and the other expert consultants?

Was there no backup server to replace the failed db server? Twitter previously suffered a long outage from a long recovery when using Joyent after a storage server running ZFS had internal corruption. I guess that even ZFS needs a filesystem repair tool.

Sheeri's post got me to thinking about this. I don't think we are singling out Twitter. It is just that their production problems have gotten a lot of publicity and post-mortems are an effective way for DBAs to learn. We should thank them for sharing details as most of us would not be able to do that.

The reason for the database server crash was too many connections. How might that cause a crash? I can only guess. This problem should not cause a crash.
  • max_connections was set too high and the process ran of of address space from all of the thread stacks or file descriptors for all of the sockets for the connections
  • max_connections limit was reached and someone killed the mysqld process rather than connecting as a user with SUPER privileges and killing many of the connections
Why did the server require several hours to recover? According to this post, they use InnoDB. So the long recovery time is either file system repair after a server crash, but that should be fast if they use Sun and ZFS, right? So I will guess that the problem was a long crash recovery time. The most likely culprit for slow crash recovery is undo of a long running insert/update/delete or DDL statement, and the way to avoid that is to not do those statements when you can't afford the slow recovery. InnoDB doesn't do parallel crash recovery, Oracle does, sometimes it pays to pay. The other culprit for slow crash recovery is that the server doesn't have sufficient random IO bandwidth to handle the recovery workload, and that workload is rarely tested. InnoDB has a few features that allow the amount of pending IO to be limited. You can set innodb_max_dirty_pages_pct. Oracle more features in that area that allow the DBA to control the expected time to recover after a crash.

One feature that InnoDB needs is the ability to change the rate of background IO. The limits are currently set with the assumption that the server can do 100 IOPs max and most servers today can do much more than that. The upcoming InnoDB patch at code.google.com will have a variable by which the server IO capacity can be set to allow the background IO rate to be tuned.

I hope we get more details because we can learn from this.

14 comments:

  1. Coldstarting a disk-read-bound mySQL master instance.

    How many spindles can be added before mySQL cannot feed them with enough read requests? 8? 16? 32?

    ReplyDelete
  2. If coldstart takes hours, then there should be a standby server for failures like this.

    I know that 8 isn't a problem and that 16 should be fine, but I don't have access to a server with 32 spindles, so I don't get to observe this. InnoDB has performance problems when there are many concurrent connections, but I suspect that the problem does not exist when those connections are blocked on IO.

    InnoDB also issues prefetch requests, but there is only one background thread to handle the requests (unless you use the Google patch which supports as many as you want). Prefetch requests are more likely to be issued for scans than for index lookups. This could make attempts to warm up the cache extremely slow if many concurrent sessions do table and index scans.

    If you are able to do full table scans to warm up the cache, then it can be faster to warm up the OS buffer cache first by reading the *.ibd files with external processes before MySQL is started or while it is warming up.

    ReplyDelete
  3. They haven't used Sun/ZFS since last December, they're using RHEL/ext3. (or at least they were as of about a month ago, haven't checked recently)

    ReplyDelete
  4. >If you are able to do full table
    >scans to warm up the cache, then it
    >can be faster to warm up the OS
    >buffer cache first by reading the
    >*.ibd files with external processes
    > before MySQL is started or while it
    > is warming up.

    Even with innodb_flush_method=O_DIRECT?

    ReplyDelete
  5. > Even with innodb_flush_method=O_DIRECT?

    No, and probably not for anything else that doesn't use the OS buffer cache.

    ReplyDelete
  6. Note: I'm not the DBA...

    There are read-slaves for fail-over, but it isn't trivial to keep them warm enough. Synthetic queries help some, but without actual traffic hitting the current active set, the fail-over slave's cache is always colder than we'd like.

    Full table scans aren't helpful, as the tables are much larger than memory. Instead, we try to get the indexes for critical tables in memory, but this doesn't help enough.

    I think we should look into continuously scanning the last few days, but there's always the unpredictable long tail into the past that slows things down.

    We don't use OS caching -- we need all the memory for mySQL.

    If anyone has experience above 16 spindles, I'd be happy to hear of it. Especially how you've set up your RAID controller.

    ReplyDelete
  7. While the idea of an intervention is kind of funny, it really isn't neccessary. Over the years I have seen countless instances like twitter where you just shake your head and think "if they were using postgres/oracle they wouldn't have these problems". It's not that those systems dont have thier issues, but mysql puts up walls at a much lower level than you'd get from an oracle. Never seems to be a consideration from people once they start down the mysql path though.

    ReplyDelete
  8. With raw IO, you may suffer from the bottleneck of having only 1 thread for prefetch read requests and 1 thread for dirty page writes. The Google patch has code to support a configurable number of reader and writer threads. As InnoDB has yet to add equivalent features to the official release, you or someone might have to backport it if you want it.

    If you are also using Linux, then you might be doing extra IO for paging because of all the memory allocated to the InnoDB buffer cache. A fix is to put swap on tmpfs. Don from SmugMug and Kevin Burton have documented the problem and workarounds.

    ReplyDelete
  9. Mark,

    The DBAs avoid swap. They either run swap-less or monitor it very very carefully.

    Although mySQL only has one read thread -- I'd guess that it's using OS-level async IO. I'd hope that this could keep > 16 spindles saturated...

    The hardware path is so murky, that it's hard to make decisions there for the moment.

    ReplyDelete
  10. The background IO threads do not use async IO. Some reads and writes are done directly by the user session threads in InnoDB. But all prefetch reads and most dirty page writes are done by the one read or write thread using blocking IO.

    ReplyDelete
  11. Mark,

    Do you have an ETA on multiple background thread patch? I would like to test it on a thousands of IOPS system to see if it helps us.

    ReplyDelete
  12. Support for multiple background IO threads is in the big Google patch. Most of it is in os0file.c, srv0srv.c and srv0start.c.

    ReplyDelete
  13. Note that Sun also has a PostgreSQL business:

    http://www.sun.com/software/products/postgresql/index.jsp

    ReplyDelete
  14. Yes, they do and PostgreSQL is getting a lot better for OLTP.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.