Today is my first day at Facebook. I am thrilled to be here. This is a great opportunity for me to work on hard problems with talented people. I expect to continue making MySQL faster.
You can friend me on Facebook. I have started a Facebook group on which MySQL can be discussed. I will start posting there soon. But this is a group so you are welcome to join and post.
I will be at the Velocity Conference on Tuesday evening. Will anyone else who deploys MySQL be in town?
-
The default table type was changed from MyISAM to InnoDB on a few production servers. There were good reasons for doing this. Besides, what could possibly go wrong? Sometimes the status quo has value even when you can't elaborate it.
MyISAM tables are created in several cases:
- CREATE TABLE (...) engine=MyISAM
- CREATE TABLE (...) ; // when the default storage engine is MyISAM
- implicit temporary tables created for ORDER BY and GROUP BY processing that are too large are converted from HEAP to MyISAM
Users began reporting ER_LOCK_TABLE_FULL errors during long running CREATE TABLE AS SELECT statements. These statements did not specify the storage engine type and began using InnoDB. From reading the code, this error is raised when buf_LRU_buf_pool_running_out returns TRUE during an insert statement. It returns true when less than 25% of the memory allocated for innodb_buffer_pool_size is available for the buffer pool. Memory from this allocation can be used elsewhere including for lock structures.
I need to investigate this to determine whether the locks were used for the created table or the selected table. I don't think the locks have to be used in either case. Locks on rows in the created table might not be needed because the table is not visible to other sessions until the CTAS statement completes (right?). In this case it was CREATE TEMPORARY TABLE ... SELECT, so the table will never be visible to others. Share locks on rows in the selected table might have been taken to guarantee deterministic replay in replication. But this statement was run on a slave, so that should not be needed. I have dealt with this particular problem in the past.
When innodb_locks_unsafe_for_binlog is set in my.cnf, then share locks are not obtained on rows read from the selected table. InnoDB really should be clever enough to do this when the binlog is not open.
Lock structs appear are not allocated for the inserted table regardless of the value of innodb_locks_unsafe_for_binlog, so I have yet to determine the source of memory allocations from the buffer pool. The binary we use has behavior similar to innodb_locks_unsafe_for_binlog but based on whether or not the binlog is open.
Still a mystery
Memory is allocated from the buffer pool for:
- Buffer pool frames
- Row locks (trx->lock_heap) - but row locks should not be allocated for this statement given the fix or if innodb_locks_unsafe_for_binlog were used.
- Recovery (recv_sys->heap) - this looks like code that is only run during crash recovery when the server is started.
- Adaptive hash index (btr_search_sys->hash_index) - this should use no more than a fixed amount of memory as the number of pages to be indexed is fixed.
Notes from 5.0.77
A call path to buf_LRU_get_free_block is:
mem_heap_create_block -> buf_frame_alloc -> buf_block_alloc
And mem_heap_create_block is called by (search for MEM_HEAP_BUFFER)
- mem_heap_create_func
- mem_heap_add_block
- mem_heap_create_in_buffer
- used for trx->lock_heap, recv_sys->heap, ha_create
- mem_heap_create_in_btr_search
- used for ha_create
0Add a comment
-
InnoDB supports multiple page sizes. The default page size is 16kb and compiled into the binary/plugin. The valid page sizes are 8kb, 16kb, 32kb and 64kb. However, there is a known bug for page sizes > 16kb with row_format=COMPACT. Also, there are few deployments with a page size other than 16kb, so the amazing reliability of InnoDB may degrade for other page sizes until the bugs are found and fixed.
To use an alternate page size, edit innobase/include/univ.i to change these:
/* The universal page size of the database */
Does InnoDB support 4kb pages? Maybe. If you edit the fields listed above for a 4kb page size and run MySQL, you will quickly get a core dump. The problem is that there are other per-page objects that are too big for a 4kb page. The first problem I encountered is fixed by editing innobase/include/trx0rseg.h to reduce the number of rollback segment slots per page. The default value is:
#define UNIV_PAGE_SIZE (2 * 8192)
/* The 2-logarithm of UNIV_PAGE_SIZE: */
#define UNIV_PAGE_SIZE_SHIFT 14
#define TRX_RSEG_N_SLOTS 1024
To be really safe, I set it to 256 and the segfault was fixed. I haven't done testing beyond this so your mileage may vary.
InnoDB had a 900 byte per-page overhead when I last measured this. 900 bytes per 4kb page is much worse than per 16kb page. I am not sure when this will get fixed.5View comments
-
I have run many tests on mid-size servers to show how patches make InnoDB faster for IO bound workloads. The servers could do more than 1000 IOPs and the patches make a big difference. But do the patches help on small servers? I ran tpcc-mysql on my home server with 2 SATA disks, 2 CPU cores and 2GB RAM. There were two interesting results:
- The v4 Google patch makes things faster.
- Only the v4 Google patch enforced innodb_max_dirty_pages_pct=20. InnoDB is not good at enforcing this limit on write-intensive workloads. This problem has not received much press. If you are running a critical OLTP server and want it to recover quickly after a crash then this limit must be enforced. Code has been added to the v4 Google patch to delay user sessions by making them flush dirty pages prior to making more pages dirty if the limit has been exceeded. This is enabled by the my.cnf parameter innodb_check_max_dirty_foreground. Nobody has reviewed this code (hint, hint).
tpcc-mysql was run with 20 warehouses and 8 users with a 600 second warmup and 3600 second measurement period.
Results
The 5077 binaries were faster than the v4 Google patch at innodb_max_dirty_pages_pct=20 because they did not enforce that limit. I have included a result for the v4 Google patch at innodb_max_dirty_pages_pct=50 to provide an additional reference point.
Binary TpmC innodb_max_dirty_pages_pct Avg %dirty pages v4-5037 1294 20 25.1 percona-5077 2112 20 60.0 mysql-5077 2836 20 72.1 - v4-5037 3024 50 52.8 - v4-5037 3854 80 72.0 percona-5077 2930 80 73.7 mysql-5077 3373 80 82.8
Throughput over time for v4-5037:
Configuration
Notes:
- tests were run twice for each binary with innodb_max_dirty_pages_pct set to 20 and 80
- the Percona highperf b13 5.0.77 binary uses more memory for the same value of innodb_buffer_pool_size, so I had to reduce that value to 1G
The my.cnf settings for the v4 Google patch:
innodb_buffer_pool_size=1200M
The my.cnf settings for Percona highperf b13 5.0.77:
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_io_capacity=250
innodb_read_io_threads=2
innodb_write_io_threads=2
innodb_max_dirty_pages_pct=80 or 20
innodb_ibuf_max_pct_of_buffer=10
skip_innodb_ibuf_reads_sync
innodb_doublewrite=0
innodb_file_per_table
allow_view_trigger_sp_subquery
skip_innodb_readahead_random
skip_innodb_readahead_sequential
innodb_flush_method=O_DIRECT
innodb_check_max_dirty_foreground
innodb_log_file_size=1900M
The my.cnf settings for 5.0.77:
innodb_buffer_pool_size=1000m
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
innodb_io_capacity=250
innodb_read_io_threads=2
innodb_write_io_threads=2
innodb_max_dirty_pages_pct=80
innodb_ibuf_max_size=120M
innodb_ibuf_active_contract=1
innodb_ibuf_accel_rate=200
innodb_doublewrite=0
innodb_read_ahead=0
innodb_adaptive_checkpoint=1
innodb_buffer_pool_size=1200M
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_max_dirty_pages_pct=80
innodb_doublewrite=0
innodb_file_per_table
innodb_flush_method=O_DIRECT2View comments
-
I learn a lot about InnoDB when people ask me questions. In this case, someone asked whether fsync was used for writes done when InnoDB is running with innodb_flush_method=O_DIRECT. It is still used. It does not need to be used and I am not sure whether this has a measurable impact on performance.
Buffered IO
You can configure InnoDB to use direct IO for data files or for transaction log files but not for both at the same time. I added a new value for innodb_flush_method, allsync, to change that. When innodb_flush_method=allsync is used, the behavior for O_DIRECT and O_DSYNC (described below) are implied. I have yet to find a significant performance benefit from that change. I may not have the right hardware.
Data files are opened with O_DIRECT when innodb_flush_method is set to O_DIRECT. fsync is still used in this case, but it doesn't need to be.
Transaction log files are opened with O_SYNC when innodb_flush_method is set to O_DSYNC. fsync is not used in this case. Writes to the transaction log are done in multiples of OS_FILE_LOG_BLOCK_SIZE (set to 512 in os0file.h). The filesystem block size is likely to be much larger than 512 and each 512 byte (or N*512 byte) write done by InnoDB requires a file system block write to the disk. According to the man page and actual behavior on Linux, buffering can be done so that a sequence of 512 byte writes with a 4kb file system block size does not require a disk read on every write.
It might be good to use a larger value for OS_FILE_LOG_BLOCK_SIZE (1024?) and a smaller file system block size for the file system that stores the transaction log (but not for the file system that stores the data files). This is more likely to be useful on SSD and dependent on your workload.
Update
I used sysbench fileio to determine whether there is a performance impact from calling fsync after writes on a file opened with O_DIRECT. There is an impact, but it remains to be seen whether that translates to a performance impact for InnoDB. To test this, I used a server with:
- 2 CPU cores
- 2 disks, SATA, 7200 RPM
- 2 disk SW RAID 0, 1MB RAID stripe, XFS file system
- SATA write cache enabled
sysbench --test=fileio --file-num=1 --file-total-size=4G --file-test-mode=rndwr --file-extra-flags=direct --file-fsync-freq=$fsf --num-threads=$nt --max-requests=0 --max-time=60 run
Writes per second was always higher when fsync was not done and the SATA write cache enabled:
- 503 vs 532 for --num-threads=1
- 501 vs 438 for --num-threads=2
- 510 vs 447 for --num-threads=4
- 495 vs 451 for --num-threads=8
- 502 vs 464 for --num-threads=16
- 109 vs 49 for --num-threads=1
- 160 vs 58 for --num-threads=2
- 189 vs 70 for --num-threads=4
- 209 vs 77 for --num-threads=8
I get ~5% speedup on tpcc-mysql (1294 vs 1237 transactions per second) when fsync calls are disabled with innodb_flush_method set to O_DIRECT. The server for this test is described in the previous section. The SATA write cache was disabled for the test. These my.cnf parameters were used:
- innodb_buffer_pool_size=1200M
- innodb_log_file_size=1900M
- innodb_flush_log_at_trx_commit=2
- innodb_io_capacity=250
- innodb_read_io_threads=2
- innodb_write_io_threads=2
- innodb_max_dirty_pages_pct=20
- innodb_ibuf_max_pct_of_buffer=10
- skip_innodb_ibuf_reads_sync
- innodb_doublewrite=0
- innodb_file_per_table
- allow_view_trigger_sp_subquery
- skip_innodb_readahead_random
- skip_innodb_readahead_sequential
- innodb_flush_method=O_DIRECT
- innodb_check_max_dirty_foreground
13View comments
-
Jeremy has an article in LinuxMag. This is worth reading.0
Add a comment
-
Is synchronous replication possible in MySQL? Yes. Is it possible without major surgery to the existing code? Probably (or hopefully). Notes on an approach are at code.google.com.
The MySQL replication team may be working on this now for MySQL 6.0. They have spent a lot of time recently making replication flexible to support semi-sync and other new features. I assume they plan to support sync replication as well.10View comments
-
InnoDB prefetches blocks when it detects multiple accesses to blocks within an extent. Unfortunately, there are no metrics in the server to determine whether it is effective. There are also weak metrics in the server to determine how frequently it is done -- counters incremented each time the readahead code prefetches one or more blocks rather than once per prefetch request.
There are cases where prefetch improves performance. A query that does a full table scan was run with prefetch enabled and disabled. It was 35% slower with prefetch disabled.
Percona and Matt have written about potential performance problems from this feature. There isn't much data to indicate when this feature should be enabled. I have published data for a few IO-bound benchmarks. On these tests, the prefetching done by InnoDB reduces performance. The tests run were:
- insert benchmark
- tpcc-mysql with data cached by InnoDB
- tpcc-mysql with too much data to cach
Update -- my peers just reminded me to add support for per-session (dynamic) usage of the new my.cnf parameters (innodb_readahead_random, innodb_readahead_sequential).8View comments
View comments