Don't tell us, show us with realistic benchmark results and customer case studies. A post about using OpenSolaris at SmugMug was hugely popular. We need more of the same. The post was popular because the content was very technical.
DTrace is great. But examples seem to be limited to producing information I can already get from MySQL. Show me what I can't get from community MySQL branches such as OurDelta and Percona that have user, index and table monitoring. Where is the repository of DTrace scripts for MySQL?
I think that ZFS can make some MySQL installations run better and faster. The features I am most interested in are compression and the ability to disable the InnoDB doublewrite buffer. Copy-on-write filesystems make it more likely that compression does not kill performance, so ZFS compression can be useful in the real world. Write throughput on a busy Innodb server should be much higher when the doublewrite buffer is disabled. When it is enabled, writes are done in two phases (write up to 128 16kb pages to a log file, sync, write up to 128 16kb pages in place using random IO, sync). But I have a few questions.
- ZFS does not support O_DIRECT. Nor does it support the hints provided by directio. It supports O_DSYNC and O_SYNC so that synchronous writes can be done. When direct IO is used, filesystem prefetching on reads is not done and bad decisions about what to cache in the filesystem buffer cache are avoided. Linux has had problems for MySQL using InnoDB and a large buffer cache. Is Solaris is smarter about managing the ZFS buffer cache? Are there other ways to limit filesystem read prefetching?
- ZFS uses copy-on-write rather than update-in-place. This can fragment files. InnoDB generates prefetch requests with the assumption that logically adjacent blocks within a 1MB extent are physically adjacent. Most of the benchmarks I have seen for ZFS + InnoDB are on files that have not been aged and are unlikely to have fragmentation. Is this a problem? Are there tools to measure and fix fragmentation?
- Group commit is broken for Innodb when the binlog (replication log) is enabled. If the master is run with innodb_flush_trx_log_at_commit=1, then the Innodb commit log is flushed to disk per commit so the writes for the log will be smaller. What value should be used for recordsize and the Innodb transaction log?
Who wants to go through the hassle of building MySQL on OpenSolaris? I have seen too many blog posts that describe too many problems.
Is anyone running MySQL and Solaris in a public cloud? Sun's service isn't open to new users at this time. OpenSolaris can be used with Amazon EC2, but you have to register to get the OpenSolaris AMI. MySQL is extremely popular in private clouds and there is a huge opportunity to provide services and tools that make it easier to run MySQL in this setup.


Couldn't agree with you more.
ReplyDeleteZFS doesn't support O_DIRECT!!!???????
ReplyDeleteouch.
Solaris doesn't support O_DIRECT -- http://docs.sun.com/app/docs/doc/816-5167/open-2?a=view
ReplyDeleteInstead it provides directio() -- http://docs.sun.com/app/docs/doc/816-5168/directio-3c?a=view, but that is advisory and implemented for UFS but not for ZFS.
directio() is an overloaded term to mean synchronous writes, no buffering, no single writer lock, no COW etc.. ZFS implements parts of what directio() means.
ReplyDeleteZFS supports turning of caching (primarycache attribute in later versions of nevada). ZFS has no single writer lock problem (it uses range locks). The copy-on-write penalty is reduced by streaming sequential writes to disk. You can also turn off prefetch in ZFS. Check out the ZFS Best Practices guide.
Regarding the fragmentation, this is moot for ssds :-) For regular disks there are couple of ideas floating around this. Fortunately, to "defragment" a file in ZFS, all you need to do is to copy it to another file!
@neel -- that behavior for directio() is 'generally' provided 'when possible'. It is not guaranteed. I don't want anyone to think they don't need fsync() when using directio().
ReplyDeleteWhy does UFS still have a single RW lock rather than range locks? Was it that it was simpler to disable the lock when directio() was used rather than to add internal byte range locks? As an aside, the versions of Linux that I use appear to have a single-file lock when O_DIRECT is used with ext2.
Can you set the size of the ZFS cache?
I still want tools for monitoring and measuring the amount of fragmentation. Do they exist?
The copy on write penalty is incurred on reads that were logically sequential but become physically random. Innodb prefetches an extent at a time and allocates an extent at a time. Copy on write means that frequently written blocks with the 1MB extent are not adjacent on disk.
Don't forget, the page size of ZFS matters too. If the InnoDB page size is 16Kb and ZFS' page size is 128Kb, you're going to have lots of performance problems.
ReplyDelete@Sheeri - ZFS must be a fun opportunity for consultants. There is a whole new set of details to get right and new things to learn. For example, what is the relationship between recordsize and compression? From what I have read, whether compression is used should not change the value to use for recordsize -- but I wish the answer to this were easier to find and explained with a few more details.
ReplyDeleteMark,
ReplyDeleteGood notes about fragmentation. This is generally the problem with testing of ZFS and also some of SSD benchmarks I've seen.
People just assume these behave like the standard drives and filesystem so the time is not significant component - like writes to the same logical locations of the disk do the same thing over time.
With ZFS the good question becomes how copy on write fragmentation affects you... with SSD you also deal with a lot of book keeping due to requirement to erase the whole block.
Now that ZFS+SSD is here, someone needs to run the benchmarks for MySQL. Is using the ZFS ZIL on SSD faster than Linux RAID with battery backed disk cache? Does it help to put the Innodb transaction log on SSD with Linux? How much of a benefit is provided by using SSD for the ZFS L2ARC as a read cache?
ReplyDeleteMark,
ReplyDeleteYes, you can set the size of the ZFS cache. You can also fully turn it off, or cache just the metadata. ZFS also supports a seconday cache (L2ARC) that can be hosted on faster disks like ssds.
Unfortunately, I do not know of any tools that let you measure the fragmentation or its impact. Richard Elling has done some really cool graphs of the space map, but I am not sure it will give you what you are looking for.
One way to look at the COW & databases is that the original record can be marked "free" (and reused) when you write to it. i.e you could do in-place writes in _some_ circumstances and still follow the COW principle. This has some effect on the snapshots, but there are some ideas floating around. Hopefully the ZFS engineers will find some time to provide this functionality.
@Sheeri - ZFS supports multiple record (block) sizes. For Innodb, you can set the zfs record size to 16k. Then you do not have to read-modify-write.
ReplyDeletegreat post, Mark!
ReplyDeleteAs someone who ran a major MySQL on Solaris installation, I couldn't agree more with you. of areas where Sun/MySQL collaboration hasn't delivered synergies that were expected.
A repository for DTrace scripts will be awesome. ZFS not supporting O_DIRECT is indeed shocking.
You're right about potential performance issues that ZFS' copy-on-write can cause. Combined with InnoDB's non-optimized read-ahead algorithms this can be troubling for certain real-world applications. Also seems like the benefits of using InnoDB's clustered index for read optimization would go to hell for a write intensive application.
I never got to benchmark MySQL on ZFS mostly because every time we will take a step towards implementing it, we were told to wait as X bug is getting fixed in Y release.
I think more than anything Sun needs a sheepherder to prioritize projects that focus on integration efforts between OpenSolaris, ZFS, DTrace and MySQL.
It seems the behavior of Sun towards OpenSolaris is the same as it used to be towards MySQL before acquisition.
Frank
For random i/o where iosize <= recordsize, ZFS will read one recordsize chunk of data, and not prefetch beyond that. At least, that's what my tests show, and its probably the behaviour you want.
ReplyDeleteMy focus has so far been on mysql scalability where the database fits in memory, so I've only occasionally looked at i/o. Early on I did do a series of tests to further reduce i/o, which led me to turn off log-bin and innodb_support_xa. At that time I was using UFS (Solaris) which incurred an extra i/o for its journal, so switched to ZFS.
Although i/o-impoverished, I subsequently purloined an SSD. My setup now is to use two separate ZFS pools: one for data with 16K recordsize, and one for logs with 128K (default) recordsize. I haven't [yet] studied changing recordsize for the log environment.
I'm not using any RAID option either, so its not exactly a model of what somebody should do for a production environment.
I have various DTrace scripts, but they're not really in the sort of shape that you would package up. In many cases I find my scripts growing organically as I hack/modify/tweak/adapt them while drilling down on specific problems.
The performance of the Query Cache was one that I looked at for a while--how often hash functions were being called, how many records were having to be searched on a hash chain on average, how often were relatively long strings being hashed (e.g. 300-byte SQL text) etc.
Another area where I've used DTrace is to try to identify the hottest objects getting locked underneath functions like buf_page_get_gen [as you know, the implementation of InnoDB's RW Locks and Mutexes aren't particularly scalable].
Building MySQL on Solaris hasn't been a hassle for me for a while, but that's probably more due to having a recipe that now works, and I've avoided parts that have given trouble in the past e.g. Falcon. There have been a number of gnuisms and fragments of dubious portability in the code, which take a while to resolve.
As various ISVs may well attest to, there is a general support problem around building software: ISVs want to keep to a minimum the number of different builds they have to create, may well have to support quite old platforms, and then find that this limits the toolchains they can use. Its easy for me to use Studio 12 or gcc 4.3.1 but I don't have to support the result.
Mark,
ReplyDeleteA repository of dtrace scripts for MySQL is a great idea! For now, the DTraceToolkit is an excellent dtrace repository, and although there are no MySQL specific scripts yet, many of them are applicable. It is available in the SUNWDTraceTookit package in OpenSolaris (look in /opt/DTT) as well as at the http://opensolaris.org/os/community/dtrace/dtracetoolkit/ page. In addition to being generally useful, these scripts are great templates for building your own scripts.
You should also have a look at http://forge.mysql.com/worklog/task.php?id=4541 which lists the Dtrace probes that are being added to MySQL 6.x. A patch that adds the new probes is at http://lists.mysql.com/commits/56041 but keep in mind that this is a work in progress.
"The 5.1 release cycle has not been plagued by performance problems as 5.0 was and Sun just happens to have a talented team of performance engineers working on this."
ReplyDelete5.0 went GA in October 2005. How is it that Sun is unable or unwilling to release 5.1 to GA?
(I wrote a bit about this in September: http://www.intelligententerprise.com/blog/archives/2008/09/infobright_kick.html)
Seth
@Seth -- MySQL isn't the only DBMS vendor that has had years between RDBMS releases. Even some big closed-source vendors have done this. Except for Oracle which gets things done on schedule.
ReplyDeleteMySQL acquired a few experienced release/project managers with the Sun integration. I have seen them in action -- they are very good. Future releases will have fewer delays.
About Directio for Reads, the problem UFS had was that as soon as 2 consecutive _pages_ where touched, UFS would prefetch a full cluster, typically 1 MB. This is clearly suboptimal and a systemic problem when DB blocks is greater than 8K. In that case any read will touch 2 pages and trigger 1MB of read I/O. So yes ZFZ is more cleaver here and Directio is not required to avoid readhead for DB (IMO). It's also possible to tuned prefetching out with kernel variable zfs_prefetch_disable (see per Evil Tuning guide).
ReplyDeleteDirectio is a catchall to bypass filesystems deficiencies for databases. It would be more compelling to see data, showing for instance, how much CPU times was lost to managing an extra copy. Note that, in Opensolaris we do have the option to control if data should install in the ZFS cache or not (zfs property primarycache).
About fragmentation, we do have the option to play with the recordsize which is where the tradeoff between more OLTP performance vs more on-disk contiguity is made.
Looks great information. thanks again.
ReplyDelete