Whether Innodb can saturate the IO capacity of a server depends on the IO capacity of the server and the workload. I am writing about IO bound workloads and will start with a few generalizations. First, a read-only workload with many concurrent connections is the most likely to saturate the IO capacity. Second, Innodb is unlikely to use all of the IO capacity of a server with many disks. Third, Innodb can be slower than expected when it does not saturate the IO capacity of a server.
These are not profound claims. The details make them interesting. I will explain how IO is done by Innodb. While running Tokutek's insert benchmark the performance was slower than expected and displayed more variance than expected. So I read the code, changed the code, reran tests and then started over.
The blog entry favors details over clarity. There are a lot of details to present and I did not want to write a book. Baron and others have already done a great job with that.
The IO problem was apparent to me when a server that could do 1000 IOPs would do no more than 400 IOPs on the iibench benchmark. With changes described below, I was able to get the server to do more than 900 IOPs. In addition to not getting enough IO activity, the other problems were:
- the insert buffer reached its max size of ~250,000 pages. This used half of the buffer cache memory and once the insert buffer is full it isn't used, meaning there is no performance gain, but it continues to use half of the buffer cache.
- the max dirty pages percent was frequently above the limit.
Sources of IO requests
There are several sources of information that you should use to monitor the IO performance of Innodb: iostat, vmstat, SHOW STATUS and SHOW INNODB STATUS. You can use this information to determine how much IO your system is doing and the sources of that IO. The most common sources of IO are:
- 16kb page reads are done synchronously by a user's connection
- N * 16kb page prefetch reads are submitted by user's connections and processed by a background thread.
- 16kb page writes are submitted by user's connections and processed by a background thread. The user's connection waits for these to complete when there are not enough free blocks in the buffer pool.
- N * 512 byte transaction log writes are done by a background thread. User's connections may wait for these to complete.
- 16kb page reads are submitted by a background thread to process pending changes from the insert buffer. SHOW INNODB STATUS displays the size of the insert buffer in pages and the number of reads that have been done to merge entries from the insert buffer to secondary indexes.
- 16kb page writes are submitted by a background thread to flush dirty pages and processed by another background thread.
- a background thread performs IO to physically remove deleted rows. This is rarely a problem for me so I won't say much more about this.
- one thread uses synchronous IO to process the read requests submitted for prefetches. Percona and Google have patches to use multiple threads for this.
- one thread uses synchronous IO to process the write requests for dirty pages. Percona and Google have patches to use multiple threads for this. This is less of a problem when buffered IO is used as writes to the OS buffer cache are usually fast.
- group commit is broken for Innodb in MySQL 5. This can limit a server to 100 commits per second when innodb_flush_log_at_trx_commit=1 is used and disk write/fsync is not cached.
The Google patch adds average time per IO request to the FILE I/O section. Average time per read is high so using multiple read threads should help. IO times for the log and write threads are low.
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 254629 requests 254629 io secs 2.4 io msecs/request 0.01 max_io_wait 35.97The FILE I/O section also reports on the total number of reads and writes done and the average over the last N seconds:
I/O thread 2 state: waiting for i/o request (read thread) reads 25670511 writes 0 requests 2887087 io secs 24374 io msecs/request 8.44 max_io_wait 543.92
I/O thread 6 state: waiting for i/o request (write thread) reads 0 writes 6765463 requests 4864961 io secs 4357 io msecs/request 0.89 max_io_wait 2310.88
39013958 OS file reads, 40682454 OS file writes, 5268691 OS fsyncsThe insert buffer section reports on the size in pages (size 1), the number of entries that have been written back to a secondary index (2549108 merged recs) and the number of reads that have been done to merge entries (1653924 merges).
0.00 reads/s, 0 avg bytes/read, 13.33 writes/s, 1.33 fsyncs/s
INSERT BUFFER AND ADAPTIVE HASH INDEXThe buffer pool section lists the number of dirty pages, the total number of pages read and written and the rate of page reads and writes over the last N seconds.
Ibuf: size 1, free list len 10115, seg size 10117,
2549116 inserts, 2549108 merged recs, 1653924 merges
BUFFER POOL AND MEMORYThe transactions section lists the status of the purge thread that removes deleted rows.
Buffer pool size 524288
Free buffers 1
Database pages 509011
Modified db pages 873
Pages read 91731122, created 1310762, written 25496758
0.00 reads/s, 0.33 creates/s, 0.00 writes/s
TRANSACTIONSBackground IO architecture
Trx id counter 0 344120133
Purge done for trx's n:o < 0 344093914 undo n:o < 0 0
This is done once per second from srv_master_thread()
- flush transaction log to disk
- reserve space in in-memory log buffer
- log_free_check() -> log_check_margins() -> log_flush_margin()
- flush the too old dirty pages from the buffer pool
- log_free_check() -> log_check_margins() -> log_checkpoint_margin()
- there might be no pages that are too old
- depending on how old the dirty pages are, the page flush might be done synchronously. This code is also run directly by user's connections so we want the background thread to make sure that no pages are too old rather than have the user's connection wait on page flushes.
- merge insert buffer entries for up to 5 index pages
- this is done using synchronous read requests and only when the server has not done much IO
- flush up to 100 dirty pages if there are too many
- when this is done the next iteration of the 1-second loop starts immediately rather than sleeping for 1 second
- flush up to 100 dirty pages if there are too many
- this is only done if the server has not done many IOs recently
- merge insert buffer entries for up to 5 index pages
- this is done using synchronous IO
- adaptive checkpointing - this is available from Percona. I can guess at what it does but I have yet to look at the code.
- innodb_io_capacity - this is in patches from Google and Percona. Background IO in Innodb has rate limits based on the assumption that the server can do 100 IOPs. This option changes the assumed maximum to the value set in a my.cnf parameter. For example, it will change the number of page reads done for insert buffer merges from 5 per second to 5% of innodb_io_capacity.
- innodb_read_io_threads, innodb_write_io_threads - this is in patches from Google and Percona. It supports multiple background IO threads for prefetch reads and dirty page writes.
I made many other changes to boost the IOPs rate on a busy server. I need more time to evaluate them.
- support more background IO requests - Innodb supports 256 pending requests per background thread type. That is, there can be at most 256 pending prefetch read requests and 256 pending write requests. When innodb_io_capacity is set to a value much larger than 100, it is likely that at certain times more than 256 requests will be submitted. This is much more likely if other changes described here are also made. When the max number of pending requests have been submitted, other requests for async IO will block.
- limit async background IO requests - the main background thread is not supposed to block for long on any of the tasks that it must do. It uses async IO requests to flush the buffer cache and can use them to merge insert buffer entries. It should stop submitting requests when the limit for the background read or write threads have been reached. Otherwise, it will block waiting for an available IO request slot.
- merge insert buffer entries more frequently - Innodb merges entries for up to 5 index pages per second but only when the server is not doing much IO. Otherwise, entries are merged for up to 5 index pages once per 10 seconds. With the innodb_io_capacity patch, the number of index pages updates is 0.05 * innodb_io_capacity rather than 5, but that is still too few. When the insert buffer is full (uses half of the buffer cache memory) then inserts are no longer done to it, but it will still be flushed slowly. This can kill performance. The server must do much more IO to avoid this state. I am evaluating these changes:
- change the 1-second background IO loop to merge entries for up to 0.5 * innodb_io_capacity pages when the insert buffer uses more than 25% of the buffer cache memory
- make a similar change for the 10-second background IO loop
- use async read IO requests handled by the read prefetch threads rather than sync read IO requests handled by the main background thread
- flush neighbor pages less frequently - in many cases when Innodb flushes a page it also flushes all other dirty pages in the same 64-page extent. While this can reduce random IO on writes it also may increase the number of dirty pages that are flushed and it will take much longer to flush all of the old dirty pages since so many not-so-old pages will be flushed at the same time.
- change buf_LRU_buf_pool_running_out - this function assumes that the buffer pool is running out of free pages when about 90% of the pages are dirty. If you have configured your server with innodb_max_dirty_pages_pct=90 then this will frequently be true. The workaround is to use a smaller value for innodb_max_dirty_pages_pct or to change the code.
- avoid blocking a user's connection - a connection may block near the start of a transaction in log_checkpoint_margin(). It might block waiting for old dirty pages to be flushed to disk. This can be avoided when the main background thread is able to flush dirty pages as fast as needed. And that is possible when more background IO threads are used (innodb_io_write_threads) and innodb_io_capacity is set to an appropriate value and the background IO threads each support enough IO requests for the value of innodb_io_capacity.
- make IO request scheduling faster - the algorithm for IO request scheduling used by background IO threads is O(N*N) where N is the number of IO slots. The number of IO slots is 256 today, but might need to be increased to the value of innodb_io_capacity. O(N*N) is a bit more of a problem for N >= 1000 then it is for N = 256. It is probably a good idea to fix this.