Was there no backup server to replace the failed db server? Twitter previously suffered a long outage from a long recovery when using Joyent after a storage server running ZFS had internal corruption. I guess that even ZFS needs a filesystem repair tool.
Sheeri's post got me to thinking about this. I don't think we are singling out Twitter. It is just that their production problems have gotten a lot of publicity and post-mortems are an effective way for DBAs to learn. We should thank them for sharing details as most of us would not be able to do that.
The reason for the database server crash was too many connections. How might that cause a crash? I can only guess. This problem should not cause a crash.
- max_connections was set too high and the process ran of of address space from all of the thread stacks or file descriptors for all of the sockets for the connections
- max_connections limit was reached and someone killed the mysqld process rather than connecting as a user with SUPER privileges and killing many of the connections
One feature that InnoDB needs is the ability to change the rate of background IO. The limits are currently set with the assumption that the server can do 100 IOPs max and most servers today can do much more than that. The upcoming InnoDB patch at code.google.com will have a variable by which the server IO capacity can be set to allow the background IO rate to be tuned.
I hope we get more details because we can learn from this.