1. A lot has been written about the performance problems in InnoDB on CPU-bound workloads and servers with more than 4 cores. Not as much has been written about IO performance. For many workloads InnoDB will not utilize the IO capacity of a server that has a lot of IO capacity (a lot means much more than 100 IOPs).

    Whether Innodb can saturate the IO capacity of a server depends on the IO capacity of the server and the workload. I am writing about IO bound workloads and will start with a few generalizations. First, a read-only workload with many concurrent connections is the most likely to saturate the IO capacity. Second, Innodb is unlikely to use all of the IO capacity of a server with many disks. Third, Innodb can be slower than expected when it does not saturate the IO capacity of a server.

    These are not profound claims. The details make them interesting. I will explain how IO is done by Innodb. While running Tokutek's insert benchmark the performance was slower than expected and displayed more variance than expected.  So I read the code, changed the code, reran tests and then started over.

    The blog entry favors details over clarity. There are a lot of details to present and I did not want to write a book. Baron and others have already done a great job with that.
     
    The IO problem was apparent to me when a server that could do 1000 IOPs would do no more than 400 IOPs on the iibench benchmark. With changes described below, I was able to get the server to do more than 900 IOPs. In addition to not getting enough IO activity, the other problems were:
    • the insert buffer reached its max size of ~250,000 pages. This used half of the buffer cache memory and once the insert buffer is full it isn't used, meaning there is no performance gain, but it continues to use half of the buffer cache.
    • the max dirty pages percent was frequently above the limit.
    DTrace would have made my investigation much easier. oProfile is great for finding where a program spends CPU time. But for this debugging I needed to find where mysqld was waiting or spending wall clock time.

    Sources of IO requests

    There are several sources of information that you should use to monitor the IO performance of Innodb: iostat, vmstat, SHOW STATUS and SHOW INNODB STATUS. You can use this information to determine how much IO your system is doing and the sources of that IO. The most common sources of IO are:
    • 16kb page reads are done synchronously by a user's connection
    • N * 16kb page prefetch reads are submitted by user's connections and processed by a background thread.
    • 16kb page writes are submitted by user's connections and processed by a background thread. The user's connection waits for these to complete when there are not enough free blocks in the buffer pool.
    • N * 512 byte transaction log writes are done by a background thread. User's connections may wait for these to complete.
    • 16kb page reads are submitted by a background thread to process pending changes from the insert buffer. SHOW INNODB STATUS displays the size of the insert buffer in pages and the number of reads that have been done to merge entries from the insert buffer to secondary indexes.
    • 16kb page writes are submitted by a background thread to flush dirty pages and processed by another background thread.
    • a background thread performs IO to physically remove deleted rows. This is rarely a problem for me so I won't say much more about this.
    Performance may be worse than expected because:
    • one thread uses synchronous IO to process the read requests submitted for prefetches. Percona and Google have patches to use multiple threads for this.
    • one thread uses synchronous IO to process the write requests for dirty pages. Percona and Google have patches to use multiple threads for this. This is less of a problem when buffered IO is used as writes to the OS buffer cache are usually fast.
    • group commit is broken for Innodb in MySQL 5. This can limit a server to 100 commits per second when innodb_flush_log_at_trx_commit=1 is used and disk write/fsync is not cached.
    A guide to SHOW INNODB STATUS output

    The Google patch adds average time per IO request to the FILE I/O section. Average time per read is high so using multiple read threads should help. IO times for the log and write threads are low.
    I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 254629 requests 254629 io secs 2.4 io msecs/request 0.01 max_io_wait 35.97
    I/O thread 2 state: waiting for i/o request (read thread) reads 25670511 writes 0 requests 2887087 io secs 24374 io msecs/request 8.44 max_io_wait 543.92
    I/O thread 6 state: waiting for i/o request (write thread) reads 0 writes 6765463 requests 4864961 io secs 4357 io msecs/request 0.89 max_io_wait 2310.88
    The FILE I/O section also reports on the total number of reads and writes done and the average over the last N seconds:
    39013958 OS file reads, 40682454 OS file writes, 5268691 OS fsyncs
    0.00 reads/s, 0 avg bytes/read, 13.33 writes/s, 1.33 fsyncs/s
    The insert buffer section reports on the size in pages (size 1), the number of entries that have been written back to a secondary index (2549108 merged recs) and the number of reads that have been done to merge entries (1653924 merges).
    INSERT BUFFER AND ADAPTIVE HASH INDEX
    -------------------------------------
    Ibuf: size 1, free list len 10115, seg size 10117,
    2549116 inserts, 2549108 merged recs, 1653924 merges
    The buffer pool section lists the number of dirty pages, the total number of pages read and written and the rate of page reads and writes over the last N seconds.
    BUFFER POOL AND MEMORY
    ----------------------
    Buffer pool size   524288
    Free buffers       1
    Database pages     509011
    Modified db pages  873
    Pages read 91731122, created 1310762, written 25496758
    0.00 reads/s, 0.33 creates/s, 0.00 writes/s
    The transactions section lists the status of the purge thread that removes deleted rows.
    TRANSACTIONS
    ------------
    Trx id counter 0 344120133
    Purge done for trx's n:o < 0 344093914 undo n:o < 0 0
    Background IO architecture

    This is done once per second from srv_master_thread()
    • flush transaction log to disk
      • log_buffer_flush_to_disk()
    • reserve space in in-memory log buffer
      • log_free_check() -> log_check_margins() -> log_flush_margin()
    • flush the too old dirty pages from the buffer pool
      • log_free_check() -> log_check_margins() -> log_checkpoint_margin()
      • there might be no pages that are too old
      • depending on how old the dirty pages are, the page flush might be done synchronously. This code is also run directly by user's connections so we want the background thread to make sure that no pages are too old rather than have the user's connection wait on page flushes.
    • merge insert buffer entries for up to 5 index pages
      • ibuf_contract_for_n_pages()
      • this is done using synchronous read requests and only when the server has not done much IO
    • flush up to 100 dirty pages if there are too many
      • buf_flush_batch()
      • when this is done the next iteration of the 1-second loop starts immediately rather than sleeping for 1 second
    This is done once per 10 seconds:
    •  flush up to 100 dirty pages if there are too many
      • buf_flush_batch()
      • this is only done if the server has not done many IOs recently
    • merge insert buffer entries for up to 5 index pages
      • ibuf_contract_for_n_pages()
      • this is done using synchronous IO
    Room for improvement - ready now

    • adaptive checkpointing - this is available from Percona. I can guess at what it does but I have yet to look at the code.
    • innodb_io_capacity - this is in patches from Google and Percona. Background IO in Innodb has rate limits based on the assumption that the server can do 100 IOPs. This option changes the assumed maximum to the value set in a my.cnf parameter. For example, it will change the number of page reads done for insert buffer merges from 5 per second to 5% of innodb_io_capacity.
    • innodb_read_io_threads, innodb_write_io_threads - this is in patches from Google and Percona. It supports multiple background IO threads for prefetch reads and dirty page writes.
    Room for improvement - work in progress

    I made many other changes to boost the IOPs rate on a busy server. I need more time to evaluate them.
    • support more background IO requests - Innodb supports 256 pending requests per background thread type. That is, there can be at most 256 pending prefetch read requests and 256 pending write requests. When innodb_io_capacity is set to a value much larger than 100, it is likely that at certain times more than 256 requests will be submitted. This is much more likely if other changes described here are also made. When the max number of pending requests have been submitted, other requests for async IO will block.
    • limit async background IO requests - the main background thread is not supposed to block for long on any of the tasks that it must do. It uses async IO requests to flush the buffer cache and can use them to merge insert buffer entries. It should stop submitting requests when the limit for the background read or write threads have been reached. Otherwise, it will block waiting for an available IO request slot.
    • merge insert buffer entries more frequently - Innodb merges entries for up to 5 index pages per second but only when the server is not doing much IO. Otherwise, entries are merged for up to 5 index pages once per 10 seconds. With the innodb_io_capacity patch, the number of index pages updates is 0.05 * innodb_io_capacity rather than 5, but that is still too few. When the insert buffer is full (uses half of the buffer cache memory) then inserts are no longer done to it, but it will still be flushed slowly. This can kill performance. The server must do much more IO to avoid this state. I am evaluating these changes:
      • change the 1-second background IO loop to merge entries for up to 0.5 * innodb_io_capacity pages when the insert buffer uses more than 25% of the buffer cache memory
      • make a similar change for the 10-second background IO loop
      • use async read IO requests handled by the read prefetch threads rather than sync read IO requests handled by the main background thread
    • flush neighbor pages less frequently - in many cases when Innodb flushes a page it also flushes all other dirty pages in the same 64-page extent. While this can reduce random IO on writes it also may increase the number of dirty pages that are flushed and it will take much longer to flush all of the old dirty pages since so many not-so-old pages will be flushed at the same time.
    • change buf_LRU_buf_pool_running_out - this function assumes that the buffer pool is running out of free pages when about 90% of the pages are dirty. If you have configured your server with innodb_max_dirty_pages_pct=90 then this will frequently be true. The workaround is to use a smaller value for innodb_max_dirty_pages_pct or to change the code.
    • avoid blocking a user's connection - a connection may block near the start of a transaction in log_checkpoint_margin(). It might block waiting for old dirty pages to be flushed to disk. This can be avoided when the main background thread is able to flush dirty pages as fast as needed. And that is possible when more background IO threads are used (innodb_io_write_threads) and innodb_io_capacity is set to an appropriate value and the background IO threads each support enough IO requests for the value of innodb_io_capacity.
    • make IO request scheduling faster - the algorithm for IO request scheduling used by background IO threads is O(N*N) where N is the number of IO slots. The number of IO slots is 256 today, but might need to be increased to the value of innodb_io_capacity. O(N*N) is a bit more of a problem for N >= 1000 then it is for N = 256. It is probably a good idea to fix this.
    3

    View comments

  2. I ran the insert benchmark using Innodb. The rate of rows inserted per second degraded from 20,000 to 2,000 during the test. I don't think another storage engine would do much more than 20,000 rows per second on that workload and similar hardware as much of the overhead is above the storage engine. But Innodb degrades by 10X from 20,000 to 2,000 rows per second over time. A storage engine would be 10X faster than Innodb if performance for it did not degrade.

    Would 10X faster be enough to get you to switch? Do you need something that is faster for insert-only workloads or must it also be faster for updates and deletes?

    Papers have been written that describe both the theory and practice of building systems that support high rates of inserts, updates and deletes. Some of this technology may eventually appear in a MySQL storage engine:
    • Log-Structured Merge-Tree provides a framework for evaluating performance.  
    • Bigtable describes a similar approach for avoiding random IO during updates. 
    • ROSE describes how to combine these techniques with compression for modern CPUs (check out the authors)
    • Graefe describes improvements that can be made for b-tree indexes
    5

    View comments

  3. There may be a simple way to improve InnoDB performance on SMP servers. The benefit is not as great as that obtained by using the smpfix from Google or Percona, but the change is much simpler. It is also portable so it makes MySQL faster on non-x86 platforms.

    I want to hear from people who test this with real workloads on servers with 8+ cores or who do any type of testing on platforms other than Linux/x86. The patch for MySQL 5.0 is at code.google.com.

    The change is to replace the mutex_t used in the InnoDB rw_lock_struct with a pthread_mutex_t. Calls to lock, unlock, create and destroy rw_lock_struct::mutex in sync0rw.c must also be updated.

    InnoDB implements a mutex (mutex_t) and a read-write lock (rw_lock_struct). Both of these spin when a lock cannot be granted. On my platforms, the code spins for about 4 microseconds and then the thread waits on a condition variable. rw_lock_struct uses mutex_t to protect its internal state. I think that InnoDB is faster on SMP when pthread_mutex_t is used in place of mutex_t for rw_lock_struct::mutex. The following describes the overhead from the use of the InnoDB mutex when there is contention. A thread that must sleep waiting for a lock does:
    • spin for a few microseconds trying to get the lock
    • reserve a slot in the sync array (one lock/unlock of the global sync array pthread_mutex_t)
    • reset an event (lock/unlock the event pthread_mutex_t)
    • wait on the event (lock/unlock the global sync array pthread_mutex_t, lock the event pthread_mutex_t, wait on a pthread_cond_t)
    There are 4 pthread_mutex_lock calls and 3 pthread_mutex unlock calls on this codepath and 2 of the lock calls are for a global mutex which can be another source of mutex contention. All of this can be replaced with the pair pthread_mutex_lock/pthread_mutex_unlock when rw_lock_struct::mutex is changed to use a pthread mutex.

    Of course, you shouldn't take my word for it so I will provide a few results. These were measured on an 8-core x86 server that used Linux 2.6. Three mysqld binaries were tested:
    • base - MySQL 5.0.37 and the Google patch excluding the smpfix changes
    • smpfix+tcmalloc - MySQL 5.0.37 and the Google patch including the smpfix changes and linked with tcmalloc
    • pthread_mutex - base with rw_lock_struct::mutex changed to use pthread_mutex_t
    Results for sysbench --test=oltp --oltp-read-only. This displays transactions per second for sysbench run with 1, 2, 4, 8, 16, 32 and 64 concurrent users.
     


    Results for sysbench --test=oltp --oltp-read-write. This displays transactions per second for sysbench run with 1, 2, 4, 8, 16, 32 and 64 concurrent connections.
    Results for concurrent queries. Each query is a primary key - foreign key join between tables that each have 2M rows. Too long means it ran for 10s of minutes and I killed it. This displays the time in seconds to complete the query for 1, 2, 4, 8 and 16 concurrent users.

    Binary 1 user 2 users 4 users 8 users 16 users
    base 2.6 3.9 8.1 182.5 Too long
    smpfix+tcmalloc 2.6 3.7 4.9 7.6 15.2
    pthread_mutex 2.5 3.7 9.1 27.8 58.6

    Results for concurrent inserts. Each user does a sequence of insert statements to a different table. Too long means it ran for 10s of minutes and I killed it. This displays the time in seconds to complete the inserts for 1, 2, 4, 8 and 16 concurrent users.

    Binary 1 user 2 users 4 users 8 users 16 users
    base 15.5 32.4 78.2 Too long Too long
    smpfix+tcmalloc 12.6 21.5 40.5 112.4 232.9
    pthread_mutex 13.5 23.8 76.0 378.7 Too long
    9

    View comments

  4. I am in the process of upgrading from MySQL 5.0.37 to 5.0.67. Many bugs were fixed between the two versions, so this should be a good thing. I think that 5.0.67 will be better than 5.0.37 but how do I judge that?

    I can look at the lists of open bugs. Open bugs are not a problem for 5.0.37 as we know what they are and have figured out how to avoid them. Known and unknown bugs may be a problem for 5.0.67 (or for 5.1), so we need a framework to evaluate them. Lists of open bugs help me determine what features I should not use. Deciding what I can use requires more work.

    It would help to see aggregated metrics on bug counts and support tickets per version and per feature. Giuseppe provided some insight into the 5.1 release but I realize this data might not be published.

    I get some email from MySQL/Sun telling me to upgrade to 5.1 now and use events, partitioning and RBR. I am not sure if I can use that as the basis. As long as the email doesn't become a deluge, this is not a problem. It would be bad if I had to setup spam filters for things related to MySQL.

    Otherwise, a lot of testing and monitoring is done to determine whether an upgrade can be done. This includes:
    • crash rate monitoring
    • testing queries for correctness and performance
    • performance testing with benchmarks
    • additional QA testing
    • data drift testing
    • long-term monitoring
    Crash rate monitoring determines the rate at which servers crash. If you never restart a server, then you can monitor for crashes by looking at the uptime of mysqld. Otherwise, you need to grep the database error log. I suspect that this is not done as much as it should be done. Something more substantial than mysqld_safe for running mysqld would help here. mysqld_safe does its best to keep mysqld running including restarting it after a crash. But it does not track or report restarts.

    Testing should be done to determine that query performance does not degrade during an upgrade and that query results are correct. During an upgrade from MySQL 4.0 to 5.0, many queries had parse errors because the parser stopped parsing ambiguous join conditions. Also, numeric literals changed from type Double to Decimal and that changed the results for some queries. We use a tool named the Query Tester to test for these problems. It samples queries from production using SHOW PROCESSLIST and then runs the query on two servers (one running the old version of MySQL and one running the new version). The results are compared for correctness and performance. This tool has made it much easier to evaluate upgrades and has found many (but unfortunately not all) problems.

    I also run performance benchmarks to look for obvious performance problems. One of the tests is the time required to reload a database using the output from mysqldump. This found a major performance regression in the parser for MySQL 5.0.37 that has since been fixed. I also run sysbench and a few custom benchmarks. Eventually I will run dbt-2.

    We depend on the regression tests provided by MySQL. It is great that 5.1 has many more of these. We have also written many stress tests using a separate framework. Much of our focus has been on replication. Some stress tests can't be done for replication because slave replication state is not crash proof. But it is gettting closer. We have a few tests that crash the slave at random places to confirm that it does the right thing during recovery. Many of the tests are run with valgrind. This has found several problems in the Google patch and a few minor problems elsewhere.

    Eventually, the new binary is used on a few slave servers to confirm that it is stable. At that point queries are done to confirm that the data on the slave matches the data on the master. During the upgrade from MySQL 4.0 to 5.0 this was not easy to guarantee. Most of the problems were from the change for numeric literals from type Double to Decimal. A few WHERE clauses had to be rewritten to compensate for this and it took a lot of work to identify the statements in the replication log that caused the problem. This is one reason to use statement based replication or change row-based replication to include the original SQL statement. I am not sure how we could have identified the problem using only the output from RBR.

    Once the new binary is used in production, long-term monitoring of the server can begin to confirm that it doesn't misbehave. You need to include as many variables as possible so that you can debug problems after the fact. It is good to know that there are no memory leaks and that query response time doesn't degrade significantly. A likely problem is that query performance will change significantly for a few queries. Depending on the complexity of your applications, you might not be able to spot this before using the new binary in production. If you have SHOW USER_STATISTICS you can monitor performance per account and determine which accounts have this problem. If you do query profiling by sampling queries from SHOW PROCESSLIST or use the Query Analyzer, then you may be able to find the problem queries.

    Another form of monitoring that should be done is searching the database error log for new error messages. I am not aware of good tools for this. A combination of ssh, Python and Perl can solve it.

    One thing that is missing is error monitoring. MySQL does not count and export errors via SHOW STATUS. I want to know the number of errors that have occurred per account and globally. I think this can be done in a future SHOW USER_STATISTICS patch. Not all errors indicate a problem in MySQL, but I still want to monitor this. There may be applications that have a high rate of lock wait timeouts or deadlocks.
    1

    View comments

  5. Stop reading now if you don't like handwaving.

    I have taken the Tokutek challenge and run iibench. This test matches behavior that occasionally matters to me: reloading all tables of a database via INSERT statements, alter table requires a copy of the table to be made and all indexes to be rebuilt. The iibench workload has one table with a primary key index and two secondary indexes. 1B rows are inserted in primary key order. The other indexed columns have random values.

    How fast should this run on InnoDB?

    This analysis assumes that the table is large enough so that all data does not fit in the buffer cache.  On the typical DBMS that updates rows in place and implements a NO FORCE and STEAL policy for the buffer manager, each insert is likely to require 1 read and 1 write per index. The read is done for the leaf block that must be updated and the write is done to flush a dirty page from the buffer cache to get space for the read block.

    The primary key leaf blocks should not require reads as the inserts are done in primary key order. The writes for the row should be amortized over many inserts for both index-organized and heap tables.

    Each insert needs 4 IOs once the table is much larger than the buffer cache. If I run with a server that can do 1000 IOPs, then I expect to insert 250 rows/second.

    Alas, Heikki is clever. InnoDB has an insert buffer. IO for secondary index maintenance after UPDATE and INSERT statements is deferred when leaf blocks are not in the buffer cache. Instead, a change record is written to the insert buffer in (index-id, key) order. This clusters changes for the same leaf blocks. Changes are applied from the insert buffer by a background thread.

    On real servers that I watch in production this provides a 4:1 or 8:1 reduction in IO for secondary index maintenance. But what impact does it have on iibench? If the insert buffer reduces IO by 8:1, then the row insert rate should be 8X faster -- 2000 rows/second rather than 250.

    The insert buffer works when it is large enough to buffer multiple changes to an index leaf block. We can construct a scenario where the insert buffer does not help, but it works for the iibench case. After 1B rows have been inserted into the iibench test table, the secondary indexes use ~6M pages and each index entry requires ~50 bytes. The insert buffer can use half of the InnoDB buffer pool which is several hundred thousand pages for a buffer pool that is close to 10GB. From output in SHOW INNODB STATUS, about 50% of the space allocated for the insert buffer was used for entries. The result of this is that the insert buffer can store ~50M entries when the buffer pool is near 10GB for the iibench test. When there are 50M insert buffer entries for 6M index pages, there will be multiple entries per leaf page so the insert buffer should help performance. I want to use (# insert buffer entries / # index pages) as the estimate for the IO reduction rate but tests I have run suggest that there are other factors.

    The test ran for 382431 seconds (6373 minutes or 106 hours and 14 minutes or 4.42 days). The results are close to what my handwaving suggests can be achieved. The row insert rate near the end of this test was ~1800/second. The results are from a server with:
    • 8 CPU cores
    • 16GB RAM
    • 10 disks using SW RAID 0
    • InnoDB with the patch to use multiple background IO threads
    • innodb_read_io_threads=4, innodb_write_io_threads=4 (a Google patch)
    • innodb_log_files_in_group=3, innodb_log_file_size=1300M
    • innodb_flush_log_at_trx_commit=2
    • innodb_io_capacity (another Google patch)
    • innodb_doublewrite=0
    • innodb_buffer_pool_size=12G
    • innodb_max_dirty_pages_pct=90

    11

    View comments

  6. We have begun to use MySQL 5.1 and one of my colleagues, Yuan Wei, will describe the project at the MySQL User Conference. By we I mean my place of work, but I don't speak for them so this isn't a press release (although they do provide a great blog service).

    The project maintains a persistent log of row changes. We modified MySQL to generate a log that is similar to the row-based replication log except the output is much easier to parse and is text-based. Another tool parses this log on a slave and appends a protocol buffer to a file in GFS for each entry in the log. This is guaranteed to not lose data. The file in GFS can be used by applications that want to be notified when particular rows have changed. It can also be used to maintain a copy of the MySQL database in Bigtable or the open-source equivalent. There is little delay from the time the row is modified on a master to the time the protocol buffer is written to a file in GFS.

    You should go to Yuan's talk to get all of the details. The useful parts of this will be published.

    This was based on MySQL 5.1 so it could reuse and extend support for log tables that is new in 5.1.  I also want to start using 5.1 in some form to get experience with it -- I am interested in the InnoDB plugin and partitioning. And as much as I claim that I don't want new features, I want foreign keys. While that won't be in 5.1, it might be easier to backport it to 5.1 than to 5.0.
    8

    View comments

  7. MySQL 5.1 is GA and according to Monty it is a better 5.0 than 5.0. This is a strong endorsement as he has high standards. How will the external development community extend it?
    • make partitioning support multiple engine types in one table (compressed MyISAM and InnoDB)
    • make row-based replication binlog events optionally include the original SQL statement
    • make the --base64-output=DECODE-ROWS option for mysqlbinlog include column names
    What else can we do?

    I have done some QA and performance testing with it and it was solid. I look forward to using the InnoDB plugin with it. Maria and Falcon would get more early users if they were to release plugins for 5.1. That would also validate the handler plugin interface, as in it is pluggable when a new storage engine does not require a new MySQL release.
    7

    View comments

  8. Why did malloc fail in InnoDB for me? The error message claims that ~14.2 GB would have been allocated by InnoDB, but innodb_buffer_pool_size=12GB and the server has 16 GB of RAM.  Once I resolved the problem at startup -- a large temporary allocation is done during crash recovery, the size of the mysqld virtual address space is ~13.7 GB.

    What is the overhead for innodb_buffer_pool_size in InnoDB? By overhead I mean the amount of memory used in Innodb that is determined by the size of the buffer pool. I ignore memory allocations that are determined by other my.cnf variables.

    The overhead after startup is ~8% of innodb_buffer_pool_size or 1462 bytes per 16 KB page on a 64-bit Linux 2.6 server. The overhead at startup is ~21% or 3537 bytes per page because of a temporary large allocation used for crash recovery.  The overhead for early 5.0 versions (<= 5.0.30 I think) is described by Percona. Domas has also written about this.  And Peter Z got this right in his post on basic optimization, but I wanted to provide a few more details.

    This lists the large allocations that depend on the size of the InnoDB buffer pool. They were measured when innodb_buffer_pool_size=12GB. The overhead excluding crash recovery code that runs at startup is 1462 bytes per 16kb page.
    • 12 GB - buf_pool->frame_mem - memory to cache disk pages
    • 708 MB - (944 bytes / 16kb page) - buf_pool->blocks - one instance of buf_block_t for each cached disk page. The size of the struct is 560 bytes and 3 os_event_struct (120 bytes each) instances are allocated separately for each buf_block_t. Assuming the 120 byte allocation is rounded up to 128 bytes, then the overhead for this is (560 + 3*128) == 944 bytes.
    • 6 MB - (8 bytes / 16kb page) - buf_pool->blocks_of_frames
    • 12 MB - (16 bytes / 16kb page) - buf_pool->page_hash
    • 195 MB- (259 bytes / 16kb page) - btr_search_sys->hash_index
    • 30 MB - (40 bytes / 16kb page) - lock_sys->rec_hash
    • 146 MB - (195 bytes / 16kb page) - dict_init
    • 1556 MB - (2075 bytes / 16kb page) - recv_sys_init - this is a temporary allocation used at startup and then deallocated.
    With an 8% overhead, the limit on a 64 GB server for innodb_buffer_pool_size is 59.25 GB. You should not use a value larger than this. You can exceed this limit if you have swap space, but that is likely to hurt performance and result in excessive page-in and page-out activity.

    This limit assumes that no memory is used by the rest of MySQL. You should use less than the limit to save space for the process image, thread stacks, sort areas, in-memory tables used for group by and order by processing, the MyISAM key cache, the OS buffer cache for the binlog and the OS buffer cache for MyISAM data blocks. Don't forget that InnoDB will do a large (0.12 * innodb_buffer_pool_size) temporary allocation at startup to run crash recovery.

    The per buffer pool page overhead is high. Can it be lower? There is one buf_block_t instance per page in the buffer pool and:
    • each buf_block_t has an InnoDB mutex_t and an InnoDB rw_lock_t. 
    • each rw_lock_t has a mutex_t. 
    • each rw_lock_t and mutex_t have an os_event_t (condition variable).
    • each mutex_t has a pthread_mutex_t. 
    So the overhead per buf_block_t is 1 rw_lock_t, 2 mutex_t and 3 os_event_t. And the overhead in terms of pthread objects is 5 pthread_mutex_t and 3 pthread_cond_t.

    The overhead can be reduced to 2% by using 64kb pages instead of 16kb pages. This might be a good thing to based on the 5 minute rule. You have to change UNIV_PAGE_SIZE and UNIV_PAGE_SIZE_SHIFT, build MySQL and then run a lot of tests to get this. To get mysql-test to pass you need to increase the default values for innodb_log_file_size and innodb_buffer_pool_size -- I multiplied them by 4, and you need to fix an assertion in srv0start.

    The size of buf_block_t would be smaller if the fields used to wake and sleep threads were moved from buf_block_t into the thread control block and other changes were made to the synchronization code that I will not describe here. InnoDB doesn't really have a thread control block. For threads handling user connections, trx_struct can be used. This save about 1/3 of the overhead.  buf_block_t can be at least 56 bytes smaller by using integer rather than long integer for some fields and removing at least one pointer from non-debug builds.

    InnoDB synchronization could be simplified to use one pthread_cond_t per mutex_struct and one per rw_lock_struct in place of os_event_t. This would save ~200 bytes per page.

    rw_lock_struct and mutex_struct use long integers for some fields for which integers are sufficient. This would save 16 bytes for mutex_lock_struct and 32 bytes for rw_lock_struct.  It would save 64 bytes per buf_block_t

    The InnoDB adaptive search index is ~20% of the overhead or 1.6% of innodb_buffer_pool_size.

    The InnoDB data dictionary is ~10% of the overhead or 0.8% of innodb_buffer_pool_size. I don't think the size of the InnoDB data dictionary needs to grow linearly with the size of the InnoDB buffer cache.
    2

    View comments

  9. I wrote this to remember what vmstat output means.

    I am running a benchmark with MySQL and Linux on a server with 16GB. I used innodb_buffer_pool_size=12G. The process size is 13.6GB. What else uses all of the memory? Domas investigated the problem and one of the causes is that the per buffer pool page overhead in Innodb is high.

    But a 13.6GB process shouldn't be a problem on a 16GB box, should it? Well, the Linux VM tends to generate too many page-ins and page-outs in this case. Don, Kevin, Dathan, Peter, Peter, and Peter have written about this. In my case, the sum of page-ins and page-outs were more than 500 per second. This is reported in the si and so columns (449, 413 below). The sum of si and so is ~900 and assuming that means ~900kb and 4kb blocks, then there were more than 200 reads and writes per second for this activity. I hope that does not translate to 200 IOPs of disk activity.
    procs -----------memory---------- ---swap-- -----io----
    r  b   swpd   free   buff  cache   si   so    bi    bo
     1  3 530064 128388  13024 2434336  449  413   636 17316
    The first problem is to understand the numbers. I wanted to understand the output of vmstat, vmstat -a and vmstat -sa. The Linux vmstat man page is not exactly clear.
       Memory
           swpd: the amount of virtual memory used.
           free: the amount of idle memory.
           buff: the amount of memory used as buffers.
           cache: the amount of memory used as cache.
           inact: the amount of inactive memory. (-a option)
           active: the amount of active memory. (-a option)
    Better and possibly accurate definitions for these are (sizes are in 1 kb blocks)
    • swpd - amount of swap space in use
    • free - unallocated physical memory
    • buff - size of cache used for filesystem metadata blocks
    • cache - size of cache for filesystem data blocks
    • inact - amount of memory the kernel considers to be inactive and eligible to be written to swap space or to the filesystem. Inactive blocks may be dirty.
    • active - see inact
    RedHat provides more details on this here, here and here to explain vmstat output. And a related post on tuning Oracle on RedHat Linux is here.
    # vmstat -sa
         16226644  total memory
         16100852  used memory
         13576424  active memory
          2340580  inactive memory
           125792  free memory
            36956  buffer memory
          2417564  swap cache
           530104  total swap
           420700  used swap
           109404  free swap
     What do these mean? All numbers are for 1k blocks by default.
    • total memory - physical memory
    • used memory - allocated physical memory, the free column in vmstat output is total memory - used memory
    • active memory - amount of memory the kernel considers to be in active use
    • inactive memory - see active memory, may include dirty pages
    • buffer memory - size of the cache for file-system metadata
    • swap cache - RedHat describes this as the read-write cache for swap but the value matches the cache column in vmstat output and I think this is the size of the page cache used for files.
    • total swap - total swap space
    • used swap - used swap space, this is the swpd column in vmstat output
    • free swap - unused swap space
    So how can I fix my problem?
    • reduce the size of the InnoDB buffer cache. I know that there was no page-in and page-out activity when I ran the benchmark with innodb_buffer_pool_size=8G
    • tune Linux: echo 0 > /proc/sys/vm/swappiness. This eventually removed all page-in and page-out activity and increased performance by 50%.
    • use Solaris as some people have reported that this problem doesn't occur there
    • use O_DIRECT, but Kevin reported that writes still go through the page cache and this will eventually lead to page-out activity. For unmodified InnoDB there is a good reason to not use O_DRECT as one thread and synchronous IO is used to write dirty pages. This will become a bottleneck on a busy server and isn't effective on a server with many disks and many CPU cores. This is not a performance problem when buffered IO is used as the writes usually complete fast to the OS buffer cache.
    9

    View comments

  10. There is a post about the use of SET commands that don't replicate to allow an InnoDB table to be created on a master and a MyISAM table to be created on slaves.

    The problem is:
    set storage_engine=innodb;
    create table foo (i int);
    If the default storage engine is MyISAM, the the master will create an InnoDB table and the slave will create a MyISAM table. That is fine. But what happens when you use the binlog and a backup to prepare a new master? The new master will create a MyISAM table and that can be a disaster if you don't catch it.

    I think this feature is a bug. When a new master is prepared from a backup and binlog files, it will have the wrong table type because of this feature. The binlog must reflect the state of the master. Things done to make slaves differ must be done on the slave. And please, make it explicit when binlog events are not generated for a statement that changes persistent database state via set sql_log_bin=0 or a no_replicate clause on the SQL statement.

    There are too many special cases today in replication that increase the TCO of a MySQL deployment.
    2

    View comments

Loading