1. Why did malloc fail in InnoDB for me? The error message claims that ~14.2 GB would have been allocated by InnoDB, but innodb_buffer_pool_size=12GB and the server has 16 GB of RAM.  Once I resolved the problem at startup -- a large temporary allocation is done during crash recovery, the size of the mysqld virtual address space is ~13.7 GB.

    What is the overhead for innodb_buffer_pool_size in InnoDB? By overhead I mean the amount of memory used in Innodb that is determined by the size of the buffer pool. I ignore memory allocations that are determined by other my.cnf variables.

    The overhead after startup is ~8% of innodb_buffer_pool_size or 1462 bytes per 16 KB page on a 64-bit Linux 2.6 server. The overhead at startup is ~21% or 3537 bytes per page because of a temporary large allocation used for crash recovery.  The overhead for early 5.0 versions (<= 5.0.30 I think) is described by Percona. Domas has also written about this.  And Peter Z got this right in his post on basic optimization, but I wanted to provide a few more details.

    This lists the large allocations that depend on the size of the InnoDB buffer pool. They were measured when innodb_buffer_pool_size=12GB. The overhead excluding crash recovery code that runs at startup is 1462 bytes per 16kb page.
    • 12 GB - buf_pool->frame_mem - memory to cache disk pages
    • 708 MB - (944 bytes / 16kb page) - buf_pool->blocks - one instance of buf_block_t for each cached disk page. The size of the struct is 560 bytes and 3 os_event_struct (120 bytes each) instances are allocated separately for each buf_block_t. Assuming the 120 byte allocation is rounded up to 128 bytes, then the overhead for this is (560 + 3*128) == 944 bytes.
    • 6 MB - (8 bytes / 16kb page) - buf_pool->blocks_of_frames
    • 12 MB - (16 bytes / 16kb page) - buf_pool->page_hash
    • 195 MB- (259 bytes / 16kb page) - btr_search_sys->hash_index
    • 30 MB - (40 bytes / 16kb page) - lock_sys->rec_hash
    • 146 MB - (195 bytes / 16kb page) - dict_init
    • 1556 MB - (2075 bytes / 16kb page) - recv_sys_init - this is a temporary allocation used at startup and then deallocated.
    With an 8% overhead, the limit on a 64 GB server for innodb_buffer_pool_size is 59.25 GB. You should not use a value larger than this. You can exceed this limit if you have swap space, but that is likely to hurt performance and result in excessive page-in and page-out activity.

    This limit assumes that no memory is used by the rest of MySQL. You should use less than the limit to save space for the process image, thread stacks, sort areas, in-memory tables used for group by and order by processing, the MyISAM key cache, the OS buffer cache for the binlog and the OS buffer cache for MyISAM data blocks. Don't forget that InnoDB will do a large (0.12 * innodb_buffer_pool_size) temporary allocation at startup to run crash recovery.

    The per buffer pool page overhead is high. Can it be lower? There is one buf_block_t instance per page in the buffer pool and:
    • each buf_block_t has an InnoDB mutex_t and an InnoDB rw_lock_t. 
    • each rw_lock_t has a mutex_t. 
    • each rw_lock_t and mutex_t have an os_event_t (condition variable).
    • each mutex_t has a pthread_mutex_t. 
    So the overhead per buf_block_t is 1 rw_lock_t, 2 mutex_t and 3 os_event_t. And the overhead in terms of pthread objects is 5 pthread_mutex_t and 3 pthread_cond_t.

    The overhead can be reduced to 2% by using 64kb pages instead of 16kb pages. This might be a good thing to based on the 5 minute rule. You have to change UNIV_PAGE_SIZE and UNIV_PAGE_SIZE_SHIFT, build MySQL and then run a lot of tests to get this. To get mysql-test to pass you need to increase the default values for innodb_log_file_size and innodb_buffer_pool_size -- I multiplied them by 4, and you need to fix an assertion in srv0start.

    The size of buf_block_t would be smaller if the fields used to wake and sleep threads were moved from buf_block_t into the thread control block and other changes were made to the synchronization code that I will not describe here. InnoDB doesn't really have a thread control block. For threads handling user connections, trx_struct can be used. This save about 1/3 of the overhead.  buf_block_t can be at least 56 bytes smaller by using integer rather than long integer for some fields and removing at least one pointer from non-debug builds.

    InnoDB synchronization could be simplified to use one pthread_cond_t per mutex_struct and one per rw_lock_struct in place of os_event_t. This would save ~200 bytes per page.

    rw_lock_struct and mutex_struct use long integers for some fields for which integers are sufficient. This would save 16 bytes for mutex_lock_struct and 32 bytes for rw_lock_struct.  It would save 64 bytes per buf_block_t

    The InnoDB adaptive search index is ~20% of the overhead or 1.6% of innodb_buffer_pool_size.

    The InnoDB data dictionary is ~10% of the overhead or 0.8% of innodb_buffer_pool_size. I don't think the size of the InnoDB data dictionary needs to grow linearly with the size of the InnoDB buffer cache.
    2

    View comments

  2. I wrote this to remember what vmstat output means.

    I am running a benchmark with MySQL and Linux on a server with 16GB. I used innodb_buffer_pool_size=12G. The process size is 13.6GB. What else uses all of the memory? Domas investigated the problem and one of the causes is that the per buffer pool page overhead in Innodb is high.

    But a 13.6GB process shouldn't be a problem on a 16GB box, should it? Well, the Linux VM tends to generate too many page-ins and page-outs in this case. Don, Kevin, Dathan, Peter, Peter, and Peter have written about this. In my case, the sum of page-ins and page-outs were more than 500 per second. This is reported in the si and so columns (449, 413 below). The sum of si and so is ~900 and assuming that means ~900kb and 4kb blocks, then there were more than 200 reads and writes per second for this activity. I hope that does not translate to 200 IOPs of disk activity.
    procs -----------memory---------- ---swap-- -----io----
    r  b   swpd   free   buff  cache   si   so    bi    bo
     1  3 530064 128388  13024 2434336  449  413   636 17316
    The first problem is to understand the numbers. I wanted to understand the output of vmstat, vmstat -a and vmstat -sa. The Linux vmstat man page is not exactly clear.
       Memory
           swpd: the amount of virtual memory used.
           free: the amount of idle memory.
           buff: the amount of memory used as buffers.
           cache: the amount of memory used as cache.
           inact: the amount of inactive memory. (-a option)
           active: the amount of active memory. (-a option)
    Better and possibly accurate definitions for these are (sizes are in 1 kb blocks)
    • swpd - amount of swap space in use
    • free - unallocated physical memory
    • buff - size of cache used for filesystem metadata blocks
    • cache - size of cache for filesystem data blocks
    • inact - amount of memory the kernel considers to be inactive and eligible to be written to swap space or to the filesystem. Inactive blocks may be dirty.
    • active - see inact
    RedHat provides more details on this here, here and here to explain vmstat output. And a related post on tuning Oracle on RedHat Linux is here.
    # vmstat -sa
         16226644  total memory
         16100852  used memory
         13576424  active memory
          2340580  inactive memory
           125792  free memory
            36956  buffer memory
          2417564  swap cache
           530104  total swap
           420700  used swap
           109404  free swap
     What do these mean? All numbers are for 1k blocks by default.
    • total memory - physical memory
    • used memory - allocated physical memory, the free column in vmstat output is total memory - used memory
    • active memory - amount of memory the kernel considers to be in active use
    • inactive memory - see active memory, may include dirty pages
    • buffer memory - size of the cache for file-system metadata
    • swap cache - RedHat describes this as the read-write cache for swap but the value matches the cache column in vmstat output and I think this is the size of the page cache used for files.
    • total swap - total swap space
    • used swap - used swap space, this is the swpd column in vmstat output
    • free swap - unused swap space
    So how can I fix my problem?
    • reduce the size of the InnoDB buffer cache. I know that there was no page-in and page-out activity when I ran the benchmark with innodb_buffer_pool_size=8G
    • tune Linux: echo 0 > /proc/sys/vm/swappiness. This eventually removed all page-in and page-out activity and increased performance by 50%.
    • use Solaris as some people have reported that this problem doesn't occur there
    • use O_DIRECT, but Kevin reported that writes still go through the page cache and this will eventually lead to page-out activity. For unmodified InnoDB there is a good reason to not use O_DRECT as one thread and synchronous IO is used to write dirty pages. This will become a bottleneck on a busy server and isn't effective on a server with many disks and many CPU cores. This is not a performance problem when buffered IO is used as the writes usually complete fast to the OS buffer cache.
    9

    View comments

  3. There is a post about the use of SET commands that don't replicate to allow an InnoDB table to be created on a master and a MyISAM table to be created on slaves.

    The problem is:
    set storage_engine=innodb;
    create table foo (i int);
    If the default storage engine is MyISAM, the the master will create an InnoDB table and the slave will create a MyISAM table. That is fine. But what happens when you use the binlog and a backup to prepare a new master? The new master will create a MyISAM table and that can be a disaster if you don't catch it.

    I think this feature is a bug. When a new master is prepared from a backup and binlog files, it will have the wrong table type because of this feature. The binlog must reflect the state of the master. Things done to make slaves differ must be done on the slave. And please, make it explicit when binlog events are not generated for a statement that changes persistent database state via set sql_log_bin=0 or a no_replicate clause on the SQL statement.

    There are too many special cases today in replication that increase the TCO of a MySQL deployment.
    2

    View comments

  4. It almost never crashes for me now, but you might not be able to make the same claim. Well, if you use MyISAM rather than InnoDB you might be able to make the claim, but in that case you really need it to not crash as MyISAM might not recover to a transaction consistent state. The fix is not in the official release. This is almost a repeat of a recent post, but one thing has changed. We deployed the fix for bug 32149 and eliminated the cause of the majority of crashes from software bugs. We still get crashes on a daily basis, but hardware is the cause.

    What can you do?
    • update bug 32149 and ask for a fix
    • file a support ticket and ask for a fix
    • include a request for a fix of this bug in the holiday card you send to Marten
    But whatever you do, don't hold up the release of GA for 5.1.

    This is a very simple fix and is available here and here.
    3

    View comments

  5. In the spirit of one of my favorite technology analysts, I will review a product without using it. In case you missed it, MySQL has released the Query Analyzer as part of MySQL Enterprise. It rocks. We have (or had) a tool that provides similar features and it has been been great for keeping critical servers healthy. The tool generated reports of the top 10 queries over the past 1, 3 and 7 days to find the queries that used the most time. Occasionally, there would be a sudden change on the list of the top 10 queries and that almost always was the result of an application bug. Most of the problems were from queries that were run too frequently rather than queries with bad plans.

    Our tool was at first a total hack: Bash scripts sampled SHOW PROCESSLIST, Python stripped literals from queries and then Awk scripts aggregated the results. I ran it from my desktop on demand. Yet this tool found many problems that resulted in a significant reduction in load on servers. It was eventually replaced with something much better, but the replacement still isn't as pretty as the Query Analyzer.

    I like that the Query Analyzer makes query analysis available to more users. I also like that it doesn't rely on SHOW PROCESSLIST sampling as such sampling misses more short running queries than you might expect. It is good that it doesn't (or won't) require deployment of the MySQL Proxy to gather workload data as that will limit the use of it:
    • Many of us can't get applications to switch the port they use to connect to the proxy instead of a server, at least not switch quickly. 
    • It takes a long time to qualify software for use and I have yet to do that for the Proxy. Any software deployed between clients and the db server is yet another point of failure and opportunity for misconfiguration.
    • How does the proxy impact performance? Is it still single-threaded? Does it invoke Lua for every query or query result? Where are benchmark results for dbt2 or sysbench for the Proxy?
    What can we expect next? It would be great if the Query Analyzer supported SHOW USER_STATISTICS and SHOW TABLE_STATISTICS output. But that can't happen until those commands are supported by the server. They are available in builds from Percona and OurDelta.
    5

    View comments

  6. Recently there has been more energy devoted to the synergy between MySQL and Sun. Some of this has been great. For example, the 5.1 release cycle has not been plagued by performance problems as 5.0 was and Sun just happens to have a talented team of performance engineers working on this. Some of this has been double plus ungood (blog posts that are nothing more than a reference to another blog post). And some of this has potential, but I wish there were more there. The technical marketing push for MySQL + Sun (ZFS, DTrace, Solaris) needs more investment on the technical side and less on the marketing side.

    Don't tell us, show us with realistic benchmark results and customer case studies. A post about using OpenSolaris at SmugMug was hugely popular. We need more of the same. The post was popular because the content was very technical.

    DTrace is great. But examples seem to be limited to producing information I can already get from MySQL. Show me what I can't get from community MySQL branches such as OurDelta and Percona that have user, index and table monitoring. Where is the repository of DTrace scripts for MySQL?

    I think that ZFS can make some MySQL installations run better and faster. The features I am most interested in are compression and the ability to disable the InnoDB doublewrite buffer. Copy-on-write filesystems make it more likely that compression does not kill performance, so ZFS compression can be useful in the real world. Write throughput on a busy Innodb server should be much higher when the doublewrite buffer is disabled. When it is enabled, writes are done in two phases (write up to 128 16kb pages to a log file, sync, write up to 128 16kb pages in place using random IO, sync). But I have a few questions.

    • ZFS does not support O_DIRECT. Nor does it support the hints provided by directio. It supports O_DSYNC and O_SYNC so that synchronous writes can be done. When direct IO is used, filesystem prefetching on reads is not done and bad decisions about what to cache in the filesystem buffer cache are avoided. Linux has had problems for MySQL using InnoDB and a large buffer cache. Is Solaris is smarter about managing the ZFS buffer cache? Are there other ways to limit filesystem read prefetching?
    • ZFS uses copy-on-write rather than update-in-place. This can fragment files. InnoDB generates prefetch requests with the assumption that logically adjacent blocks within a 1MB extent are physically adjacent. Most of the benchmarks I have seen for ZFS + InnoDB are on files that have not been aged and are unlikely to have fragmentation. Is this a problem? Are there tools to measure and fix fragmentation?
    • Group commit is broken for Innodb when the binlog (replication log) is enabled. If the master is run with innodb_flush_trx_log_at_commit=1, then the Innodb commit log is flushed to disk per commit so the writes for the log will be smaller. What value should be used for recordsize and the Innodb transaction log?

    Who wants to go through the hassle of building MySQL on OpenSolaris? I have seen too many blog posts that describe too many problems.

    Is anyone running MySQL and Solaris in a public cloud? Sun's service isn't open to new users at this time. OpenSolaris can be used with Amazon EC2, but you have to register to get the OpenSolaris AMI. MySQL is extremely popular in private clouds and there is a huge opportunity to provide services and tools that make it easier to run MySQL in this setup.
    18

    View comments

Loading