Monday, October 27, 2008

New home in Launchpad for my patch branches

My Launchpad branches for features from the big Google patch have moved. They are now in the Patch for MySQL project. Hopefully, this is of limited interest as the purpose of these branches is to break down the big Google patch into usable pieces and then hope that others incorporate those into the official and community MySQL projects.
I also committed a few changes (revisions 2695 - 2698 for 5.0, revisions 2750 - 2753 for 5.1) to fix the performance regression the the faster rw-mutex patch. The server is ~25% faster than unmodified MySQL with the fix for the regression test provided by Percona.

Saturday, October 25, 2008

Crashes from Innodb long semaphore waits

In the unlikely case that there was a mysqld crash on servers that I use, a common cause was a long semaphore wait in InnoDB. MySQL bug 32149 has been open for this since November, 2007. We fixed this in our local branch, so this is rarely a problem for us now. The fixes are available in branches of MySQL 5.0 and 5.1 from Launchpad.
Does anyone else get this crash? The symptoms can be found in the database error log:
  • it contains the text long semaphore wait
  • it has text similar to the output SHOW INNODB STATUS
  • connections wait too long for the adaptive hash latch (from btr0sea.c)
  • a connection holds the adaptive hash latch -- search for holds adaptive hash latch

Friday, October 24, 2008

Hey! You! Get off of my cloud!

Failures of public services generate a lot of publicity. Every EC2 blip generates too many blog posts, including this one, about how the cloud cannot be trusted to run business critical functions. It is a difficult debate to have because availability numbers for cloud services are easy to get (there are companies devoted to reporting this) but public numbers for internal deployments are rare. What is the typical uptime for a deployment of server software from the large providers of server software?

We could start by reporting uptime numbers for MySQL deployments. But there is a big problem here. Who is willing to report a lousy number?

Saturday, October 18, 2008

MySQL Conference proposals that I want to see

These are proposals that I would love to see for the 2009 MySQL Conference:
  1. MySQL Cluster Data Warehouse Edition. HP converted Tandem NonStep into HP NewView. MySQL Cluster can do the same. I am sure this will happen, but probably not by next year.
  2. Running a cluster of database servers on 1 big SMP. Sun has the hardware and software to make this possible. The new Sun T2+ server has 32 cores and 256 hardware threads. Solaris has lots of features for partitioning a larger server. The use of SMP as a cluster has been written about before. This is a good fit for MySQL as it will be a while before MySQL can effectively use a large SMP and multi-threaded MySQL Cluster will be limited to small (less than 10) number of threads.
  3. MySQL performance on Intel SSD. Actually, this should be open to any SSD vendor that provides commodity pricing and great random-write performance. This platform significantly raises the bar for what we expect. Sun has written about using Flash as a cache between memory and spinning disk. It will be interesting to see how this turns out with respect to price performance.
  4. How we pushed changes from Drizzle back to mainline. Dare to dream. For now I will settle for the changes to compile with -Wall and remove compiler warnings.
  5. High-throughput reporting performance: MySQL, ZFS, Flash and Sun Thumper. Thumpers have been popular for running Greenplum Postgres, but have not garnered as much press for MySQL. I distinguish reporting from data warehouse workloads. In this case there will be many concurrent complex queries. The queries have enough selectivity that nested loops join is preferred over hash join. 48 disks in a box should give you 5000 to 10,000 IOPs. 

Friday, October 17, 2008

Performance regression in the SMP fix patch

The SMP fix patch has a performance regression. This was reported by Percona. They are very good at what they do. I get many ideas for improvements from comments by them and from Domas. The regression may be an issue for workloads with many concurrent updates. We will fix it.

Wednesday, October 15, 2008

Who needs hash join -- Intel SSD will soon be here

Intel SSD will soon be here. Does MySQL still need hash join?

SSD can be a huge deal for MySQL data warehouse deployments. Disk-bound joins between large tables can be slow because nested-loops join uses so many random reads. Hash join is one way to fix this. Decreasing the time for a random read is another.

Intel SSD can do a random read in less than 0.1 milliseconds. It also has great random write performance which has been an issue for SSD and the price is reasonable. Decreasing random read latency by 50X to 100X will make disk-bound joins much faster.

This should also help with:
  • disk-bound aggregation when an indexed MyISAM temp table is used
  • old-style external sort when the sort buffer contains (key, row pointer)

Tuesday, October 14, 2008

Faith-based upgrades

MySQL documentation states that replicating from a 5.0 master to a 5.1 slave should work. This is very different from stating that it does work. That section of the manual should enter the 5.1 no-use case competition.
In general (but not always), slaves running MySQL 5.1.x can be used with older masters, but not the reverse.
Most of us upgrade to a new version by running it on slaves for a long time before putting it on a master. Can we put it on critical slaves? After reading the documentation I am not sure.

An Enterprise database must claim that it supports replication from one release to the next. MySQL seems to support it in practice -- I am able to replicate from 5.0 to 5.1. Am I taking a risk by using 5.1 slaves or does the documentation need an update?

Saturday, October 11, 2008

More background IO threads for InnoDB in MySQL 5.1

I have ported code from the Google patch to support more background IO threads and make the InnoDB background IO limit configurable. More details on this are here. The code in in launchpad. More details on launchpad are here.

These changes can improve performance if your servers are IO bound and have many disks. They are more likely to be an issue if you use O_DIRECT for InnoDB or a remote file system such as NFS.

The 5.1 branch in launchpad has these patches:
show_patches.patch    SHOW PATCHES    1.0    Jeremy Cole    N/A    
innodb_io_threads.patch    INNODB IO THREADS    1.0    mcallaghan@google.com    GPL    More background IO threads
innodb_fast_mutex.patch    INNODB FAST MUTEX    1.0    benhandy@google.com    GPL    Faster Innodb mutex
innodb_io_capacity.patch    INNODB IO CAPACITY    1.0    mcallaghan@google.com    GPL    Background IO rate limits
show_global_mutex_stats.patch    SHOW GLOBAL MUTEX STATUS    1.0    mcallaghan@google.com    GPL    Displays mutex contention statistics

Friday, October 10, 2008

InnoDB faster mutex patch ported to MySQL 5.1

I have ported the Google patch for faster InnoDB mutexes to MySQL 5.1. The Google patch is for the InnoDB plugin. This is for the version of InnoDB in the 5.1 branch.

This improves performance on servers with more than 4 cores. It provides the following on supported platforms (x86, modern gcc)
  • uses atomic ops for the InnoDB mutex
  • replaces the InnoDB rw-mutex
  • disables the InnoDB memory heap and memory heap mutex

The code is in Launchpad
  • more details on my launchpad branch are here 
  • more details on Bazaar are here

Monday, October 6, 2008

More background IO threads for InnoDB

I extracted code to support a configurable number of background IO threads for InnoDB. The code is available in a branch of MySQL 5.0. This adds 3 my.cnf variables:
  • innodb_read_io_threads - the number of threads for read prefetch requests made by InnoDB. Only 1 thread was used prior to this patch.
  • innodb_write_io_threads - the number of threads for writing dirty pages. Only 1 thread was used prior to this patch. If InnoDB is run with O_DIRECT then it usually helps to make this value larger than 1.
  • innodb_max_merge_io - the maximum number of requests for adjacent pages to merge into one large IO operation. The default is 64. You don't need to change this, but if you do, please post the results.
The my.cnf variable innodb_file_io_threads is already ignored by Innodb on Unix-like systems when async IO is not used -- which includes Linux.

The output from the FILE I/O section of SHOW INNODB STATUS has been extended to include statistics on IO operations. There is also one line per IO thread. The added output includes:
  • reads, writes - the number of read or write requests done by this thread. One request is for one 16kb Innodb page except for the log thread.
  • requests - the number of read or write system calls used. As reads and writes for adjacent pages may be merged into one IO, the number of requests is always less than or equal to the number of reads and writes.
  • io secs - the number of seconds to process the reads and writes.
  • io msecs/request - average time for an IO request in milliseconds, (io secs * 1000) / (requests)
  • max_io_wait - the maximum time in milliseconds for an IO operation

--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread) reads 0 writes 0 requests 0 io secs 0.000000 io msecs/request 0.000000 max_io_wait 0.000000
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 30 requests 30 io secs 0.000570 io msecs/request 0.019000 max_io_wait 0.043000
I/O thread 2 state: waiting for i/o request (read thread) reads 414 writes 0 requests 36 io secs 0.004601 io msecs/request 0.127806 max_io_wait 0.210000
I/O thread 3 state: waiting for i/o request (read thread) reads 337 writes 0 requests 26 io secs 0.003806 io msecs/request 0.146385 max_io_wait 0.298000
I/O thread 10 state: waiting for i/o request (write thread) reads 0 writes 1137 requests 75 io secs 0.022748 io msecs/request 0.303307 max_io_wait 5.213000
I/O thread 11 state: waiting for i/o request (write thread) reads 0 writes 259 requests 11 io secs 0.005803 io msecs/request 0.527545 max_io_wait 2.524000

The state of the 5.0 branch after the latest patch:

show patches;
File    Name    Version    Author    License    Comment
show_patches.patch    SHOW PATCHES    1.0    Jeremy Cole    N/A    
innodb_io_threads.patch    INNODB IO THREADS    1.0    mcallaghan@google.com    GPL    More background IO threads
innodb_fast_mutex.patch    INNODB FAST MUTEX    1.0    benhandy@google.com    GPL    Faster Innodb mutex
innodb_io_capacity.patch    INNODB IO CAPACITY    1.0    mcallaghan@google.com    GPL    Background IO rate limits
innodb_disallow_writes.patch    INNODB DISALLOW WRITES    1.0    jtolmer@google.com    GPL    Prevent writes to Innodb

InnoDB background IO

I ported code from the v2 Google patch to a branch of MySQL 5.0 on launchpad. Two new my.cnf variables have been added:
  • innodb_io_capacity - specifies the IO capacity of the server in IOPs. The default inherited from existing InnoDB behavior is 100. This is used to determine the rate of background IO. Prior to this patch, background IO was rate limited assuming the server could do 100 IOPs. It can help to increase the background IO rate on servers with many disks. This makes it easier for the background IO threads to keep up with a server during high rates of row changes. When this is increased the rates of the following may increase:
    • flushing dirty pages
    • applying changes from the insert buffer
    • physically removing deleted rows
  • innodb_extra_dirty_writes - the default for this is ON. When ON, dirty pages may be flushed to disk if the server is idle even when innodb_max_dirty_pages_pct has not been reached. When this is set to off, dirty pages may be written to disk less frequently.
A few other changes have been included in this:
  • the order of output for SHOW INNODB STATUS has been changed so that transactions are printed last. The output from SHOW INNODB STATUS is limited in size and I prefer that the active transaction list get truncated rather than the other sections.
  • A new section has been added to SHOW INNODB STATUS output that displays the busy-wait time for InnoDB spin locks and statistics on the work done by the main background IO thread.
  • The main background IO thread synchronously forces the InnoDB transaction log to disk less frequently when innodb_flush_log_at_trx_commit != 1. I think this code might need to be changed so that the log flush is not done at all rather than getting done asynchronously in that case. Regardless, the main background IO thread forces the log to disk once per second in all cases.
The new variables can be displayed by SHOW VARIABLES.
show variables like 'Inno%';
Variable_name   Value
innodb_extra_dirty_writes       ON
innodb_io_capacity      1000
Example output added to SHOW INNODB STATUS
----------
BACKGROUND THREAD
----------
srv_master_thread loops: 1 1_second, 0 sleeps, 0 10_second, 0 background, 0 flush
srv_master_thread log flush: 0 sync, 0 async
srv_wait_thread_mics 0 microseconds, 0.0 seconds
spinlock delay for 5 delay 20 rounds is 4 mics

Thursday, October 2, 2008

Innovation: MySQL Cluster, Oracle Exadata, H-Store

MySQL Cluster continues to be ahead of the game, but we don't always hear about it as Cluster developers are either too busy or too modest.

Oracle announced Exadata last week. It provides extremely high-performance disk access and runs software on the storage servers to perform selection and projection at the storage server. Oracle has discovered the value of condition pushdown. It is only a matter of time before they figure out the value of batch key access and query fragment evaluation. I wonder if you can buy the Exadata hardware without Oracle software? And I wonder when MySQL will use Cluster as the basis for a data warehouse server.

Mikael Ronstrom described work in progress to support multi-threaded Cluster servers. This has many things in common with H-Store which is a radically different design for a DBMS to scale on multi-core architectures. Well, it is different when you don't consider Cluster.

Features from the Google patch at Launchpad.net

I have published a few features from the Google patch in bzr branches at launchpad.net. There is one branch for 5.0 and one for 5.1. More features will be published over time. The 5.0 branch has:
  • show patches - provides SHOW PATCHES to display installed patches. Courtesy of  Jeremy Cole at Proven Scaling and others at Percona
  • innodb disallow writes - provides SET GLOBAL INNODB_DISALLOW_WRITES to temporarily block any file system changes by InnoDB (so you can take a backup). Courtesy of Justin Tolmer at Google.
  • innodb fast mutex - makes the InnoDB mutex and rw-mutex scale better on SMP and disables the InnoDB malloc heap as glibc (ptmalloc) or mtmalloc are more than good enough. Courtesy of Ben Handy at Google.
The 5.1 branch has:
  • show patches - provides SHOW PATCHES to display installed patches. Courtesy of  Jeremy Cole at Proven Scaling and others at Percona
  • show global mutex stats - provides SHOW GLOBAL MUTEX STATUS to display mutex contention stats for mutexes other than ones internal to InnoDB. Courtesy of Mark Callaghan at Google.
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.