Friday, June 5, 2009

Buffered versus direct IO for InnoDB

I learn a lot about InnoDB when people ask me questions. In this case, someone asked whether fsync was used for writes done when InnoDB is running with innodb_flush_method=O_DIRECT. It is still used. It does not need to be used and I am not sure whether this has a measurable impact on performance.

Buffered IO

You can configure InnoDB to use direct IO for data files or for transaction log files but not for both at the same time. I added a new value for innodb_flush_method, allsync, to change that. When innodb_flush_method=allsync is used, the behavior for O_DIRECT and O_DSYNC (described below) are implied. I have yet to find a significant performance benefit from that change. I may not have the right hardware.

Data files are opened with O_DIRECT when innodb_flush_method is set to O_DIRECT. fsync is still used in this case, but it doesn't need to be.

Transaction log files are opened with O_SYNC when innodb_flush_method is set to O_DSYNC. fsync is not used in this case. Writes to the transaction log are done in multiples of OS_FILE_LOG_BLOCK_SIZE (set to 512 in os0file.h). The filesystem block size is likely to be much larger than 512 and each 512 byte (or N*512 byte) write done by InnoDB requires a file system block write to the disk. According to the man page and actual behavior on Linux, buffering can be done so that a sequence of 512 byte writes with a 4kb file system block size does not require a disk read on every write.

It might be good to use a larger value for OS_FILE_LOG_BLOCK_SIZE (1024?) and a smaller file system block size for the file system that stores the transaction log (but not for the file system that stores the data files). This is more likely to be useful on SSD and dependent on your workload.

Update

I used sysbench fileio to determine whether there is a performance impact from calling fsync after writes on a file opened with O_DIRECT. There is an impact, but it remains to be seen whether that translates to a performance impact for InnoDB. To test this, I used a server with:
  • 2 CPU cores
  • 2 disks, SATA, 7200 RPM
  • 2 disk SW RAID 0, 1MB RAID stripe, XFS file system
  • SATA write cache enabled
I used this sysbench command line with fsf set to (0, 1) and nt to (1,2,4,8,16):
sysbench --test=fileio --file-num=1 --file-total-size=4G --file-test-mode=rndwr --file-extra-flags=direct --file-fsync-freq=$fsf --num-threads=$nt --max-requests=0 --max-time=60 run
Writes per second was always higher when fsync was not done and the SATA write cache enabled:
  • 503 vs 532 for --num-threads=1
  • 501 vs 438 for --num-threads=2
  • 510 vs 447 for --num-threads=4
  • 495 vs 451 for --num-threads=8
  • 502 vs 464 for --num-threads=16
They were also higher when fsync was not done and the SATA write cache disabled:
  • 109 vs 49 for --num-threads=1
  • 160 vs 58 for --num-threads=2
  • 189 vs 70 for --num-threads=4
  • 209 vs 77 for --num-threads=8
Update2

I get ~5% speedup on tpcc-mysql (1294 vs 1237 transactions per second) when fsync calls are disabled with innodb_flush_method set to O_DIRECT.  The server for this test is described in the previous section. The SATA write cache was disabled for the test. These my.cnf parameters were used:

  • innodb_buffer_pool_size=1200M
  • innodb_log_file_size=1900M
  • innodb_flush_log_at_trx_commit=2
  • innodb_io_capacity=250
  • innodb_read_io_threads=2
  • innodb_write_io_threads=2
  • innodb_max_dirty_pages_pct=20
  • innodb_ibuf_max_pct_of_buffer=10
  • skip_innodb_ibuf_reads_sync
  • innodb_doublewrite=0
  • innodb_file_per_table
  • allow_view_trigger_sp_subquery
  • skip_innodb_readahead_random
  • skip_innodb_readahead_sequential
  • innodb_flush_method=O_DIRECT
  • innodb_check_max_dirty_foreground

10 comments:

  1. I thought InnoDB does not expose the sync method on TXN logs to outside, and flush method is only applies to data files.

    In the fast I did experiment txn logs with different options by patching, but did not reveal any performance .. may be did not test it with right work load..

    ReplyDelete
  2. Stewart says that on Linux, fsync() is not necessary but general POSIX could have your data sitting in userspace.

    Just out of curiosity I ran some tests on the overhead of having fsync() in addition to O_DIRECT, and when the write size is larger than absolutely tiny, it becomes insignificant.
    So there's no need to add logic to not call fsync() on Linux when O_DIRECT is enabled.

    ReplyDelete
  3. I still think there might be a benefit from disabling the call to fsync. InnoDB holds a mutex when it is called, so the extra time from calling fsync may increase contention on that mutex. But I need to test that on an SMP server with a write-intensive server, innodb_file_per_table and innodb_flush_log_at_trx_commit=1. But I need to test that.

    There is a lot of bookkeeping code in Innodb devoted to counting the number of pending and completed fsync's per file. If that could be removed, then this might be significant.

    But I need a big server with a lot of IO capacity to figure it out.

    ReplyDelete
  4. Updated with results from sysbench fileio to show that in some cases, not doing fsync after write is faster.

    ReplyDelete
  5. I was searching for the exact data for InnoDB.. Thanx... Keep it up the blog...

    ReplyDelete
  6. First - it looks like you are running a non standard version of MySQL...say the Percona build?

    Second - I have noticed that a lot of configs that have O_DIRECT are disabling the dobulewrite buffer, is this a good move? Can you guarantee that your OS isn't writing half pages? What is the performance benefit?

    ReplyDelete
  7. I always run non-standard builds. But it is probably better for most people to use Percona or the new InnoDB plugins.

    Unless your filesystem (ZFS?) guarantees that partial page writes do not occur, then you should leave the doublewrite buffer on.

    ReplyDelete
  8. Mark, thanks for the reply.

    No ZFS, just ext3. I'm looking for ways to squeeze IO out of a system running MySQL-5.0.79-enterprise-log. (soon will be switching to 5.0.86-percona-highpref-b16-log).

    In doing so I'm trying to understand where the IO time is being spent. Looking at innotop buffer I/O I see almost 100 times the amount of writes than I do reads, this can't be good would you agree?

    I guess I'm asking what your practices are when trying to diagnosis a performance issue that is clearly an IO bottleneck, maybe a blog post? :)

    ReplyDelete
  9. I will try to describe how to use the extra monitoring I added to SHOW INNODB STATUS. When your working set is cached or almost cached, 100X more writes than reads is OK. Otherwise you might have a problem that can be fixed by tuning.

    ReplyDelete
  10. hello, I find it quite interesting topic, the truth the first part of the article is a bit confusing, but then you explain better, thank you for sharing this information, I served for my college project

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.