1. Falcon has a busy-wait loop that isn't very busy when it gets removed by an optimizing compiler. This makes it much slower on highly-concurrent CPU bound workloads. Fixing this can double performance in some cases.

    The binaries used in the graphs that follow are:
    • 5131_inno - 5.1.31 + tcmalloc + InnoDB 1.0.3 plugin + InnoDB
    • 6010_falcon_orig - 6.0.10-alpha + tcmalloc + Falcon
    • 6bzr_falcon_orig - mysql-6.0-falcon in launchpad + tcmalloc + Falcon
    • 6010_falcon_fix - 6010_falcon_orig with fix in SyncObject::backoff
    • 6bzr_falcon_fix - 6bzr_falcon_orig with fix in SyncObject::backoff
    Results with and without the fix:
    One odd result for Falcon is that the code in 6.0.10 is much faster than the code in launchpad for concurrent joins, but the code in launchpad is much faster than the code in 6.0.10 for sysbench.

    To put this in context, there is still a big difference between InnoDB and Falcon as can be viewed in the throughput in queries per hour for concurrent joins and the throughput in transactions per second for my sysbench.

    A bug is open for this.
    0

    Add a comment

  2. Paul's article on PBXT performance prompted me to test it again on one of my benchmark servers. I use a 16-core server with a modified version of sysbench. The modified sysbench does 2X the work per transaction as normal sysbench, so be careful if you compare these results to your results.

    I have tested PBXT before and it scaled great but had more overhead than InnoDB. Well, it still scales great but the overhead problem has been fixed. PBXT is very fast on this benchmark, faster than any enhanced version of InnoDB that I can produce at this point. Granted this is just one benchmark, but it is a great achievment.

    Results are here. This displays sysbench throughput (transactions per second) for 1 to 64 concurrent users. All servers are compiled with gcc -O2. The versions listed on the chart are:
    • 5037goog - MySQL 5.0.37 + the Google patch + tcmalloc, using InnoDB
    • 5131pbxt - MySQL 5.1.31 + tcmalloc + the latest PBXT code, using PBXT
    • 5131inno - MySQL 5.1.31 + tcmalloc + the 1.0.3 plugin for InnoDB, using InnoDB
    • 5075myisam - MySQL 5.0.75 + tcmalloc using MyISAM
    Most or all of the early work on InnoDB, PBXT and perhaps MyISAM were done by one person. Is that what it takes to produce a great storage engine?

    Update: results from Falcon compared to the InnoDB 1.0.3 plugin.
    • 5131_inno - 5.1.31 + tcmalloc + InnoDB 1.0.3 plugin + InnoDB
    • 6bzr_facon - mysql-6.0-falcon from launchpad + tcmalloc + Falcon
    • 6010_facon - 6.0.10-alpha + tcmalloc + Falcon
    • 609_falcon - 6.0.9-alpha + tcmalloc + Falcon
    Update: results from the anonymous DBMS p.

    Update: details on the test

    sysbench command line:
    •  sysbench with changes from me to support --oltp-secondary-index
    • --test=oltp
    • --oltp-read-only
    • --oltp-table-size=2000000
    • --max-time=60
    • --max-requests=0 
    • --mysql-table-engine=$e 
    • --db-ps-mode=disable 
    • --mysql-engine-trx=yes
    •  --oltp-secondary-index
    • --num-threads={1 .. 64}
    my.cnf settings:
    • table_cache=2048
    • join_buffer_size=64K
    • record_buffer=128K
    • sort_buffer=2M
    • skip-name-resolve
    • max_heap_table_size=100M
    • tmp_table_size=100M
    • max_tmp_tables=100
    • max_connections=500
    • thread_cache_size=100
    • pbxt-record-cache-size=1G
    • pbxt-index-cache-size=1G
    • innodb_flush_log_at_trx_commit = 2
    • innodb_log_files_in_group=2
    • innodb_log_buffer_size=10M
    • innodb_additional_mem_pool_size=100M
    • innodb_lock_wait_timeout=50
    • innodb_data_file_path = innodb_data1:200M:autoextend
    • innodb_buffer_pool_size=1000M

    4

    View comments

  3. There are 3 talks at the MySQL Conference that cover technology in the Google patch:
    • High Availability and Scalability Patches from Google - describes changes made to improve SMP performance and make replication easier to manage. The changes include the SMP patches, global transaction IDs, hiearchical replication and binlog event checksums.
    • MySQL Row Change Event Extraction and Publish - describes changes made to generate a row-change log on a replication slave that records DDL and all row changes from the replication workload. The log is complete, correct and easy to parse.
    • MySQL Performance on EC2 - describes MySQL performance when deployed in a cloud, or using features typically used in a cloud (virtualization, network attached storage). It will describe the performance bottlenecks and improvements from the Google patch.
    I have a session at the MySQL Camp organized by Sheeri. The session is from 9am to 12noon on Monday and is described as a hackfest. We will modify, build, test and debug MySQL for the (small) features of your choice. I can also show how features can be extracted from the big Google patch. So bring source code and a laptop.

    I have a talk at the Percona Performance Conference on Thursday at 10:50am. The title is TBD, but I have proposed the following: Life of a dirty page: the disk IO architecture of InnoDB and what can be done to improve it.
    3

    View comments

  4. I have a few more numbers for the impact of the Google patches on database reload performance with InnoDB. More numbers and details are here. Results are listed for:
    • MySQL 5.0.37 + Google v1 patch. The v1 patch fixes a serious performance regression in the parser since fixed by MySQL in 5.0.54.
    • MySQL 5.0.37 + Google v2 patch. The InnoDB 1.0.3 plugin has performance and code similar to this.
    • MySQL 5.0.37 + Google v3 patch. The v3 patch has a few changes beyond the v2 patch to reduce mutex contention in InnoDB.
    This lists the time to reload a production server with ~130GB of real data for a complex schema.
    • 49931 seconds for MySQL 5.0.37 + Google v1 patch
    • 8298 seconds for MySQL 5.0.37 + Google v2 patch
    • 7248 seconds for MySQL 5.0.37 + Google v3 patch
    The speedups are 5.9X and 6.8X for the v2 and v3 Google patches compared to the v1 patch. MySQL 5.0.37 + the Google v1 patch is about as fast as official recent versions of MySQL 5.0.
    0

    Add a comment

  5. A system displays linear scale-up when it has twice the throughput with twice the resources. I spend a lot of time testing MySQL on SMP servers to validate the performance features we add to it. I rarely get linear scale-up from it. Performance usually reaches a maximum or declines when the server is saturated. Yet here is a result that displays linear scale-up up to a point. I never get results like this. How was it done?
    1. Determine the maximum throughput (MaxTP) for the server with N clients
    2. ReducedMaxTP = MaxTP * 0.75
    3. PerClientTP = ReducedMaxTP / N
    4. Run tests for 1 to N concurrent clients where each client generates PerClientTP transactions per second
    At this point, the graph of throughput versus concurrent clients should display linear scale-up. That is for X concurrent clients the system should perform X * PerClientTP transactions per second.

    But what does it mean? I think this is the result of a test with a lot of think time. I also want to know how the system performs with less think time.

    Update:

    Using this approach I can demonstrate linear scale-up on CPU bound workloads for a 16-core server with sysbench readonly. This requires the Google or Percona patches. The results are here.
    2

    View comments

  6. I published the V3 Google patch. This has:
    • global transaction IDs
    • batch key access
    • much better SMP performance for SHOW USER_STATS
    • more changes to improve SMP performance. Changes were made to reduce contention on log_sys->mutex and buffer_pool->mutex.
    • checksums on binlog events
    It probably has many other features, but I have yet to go through my release notes to remember everything that has been changed. Community and official MySQL developers are the intended audience for this. I am thrilled that Percona has been able to enhance some of the code released in previous patches, and that some of these changes made it into the InnoDB 1.0.3 plugin and that semi-sync replication will be in MySQL 6.0 (after a lot of work by MySQL to implement the right way).

    Update 1 -- Domas asked about the overhead from using crc32 rather than adler32 for checksums. I did not measure any difference. Using checksums reduces performance by 3% to 7%. Details on the change and performance results are here.

    Update 2 -- the latest SMP changes by Ben Handy improve sysbench readwrite performance by 20% at high levels of concurrency. My results are here.

    Update 3 -- we backported batch key access from MySQL 6. Details are here.
    6

    View comments

  7. A post with links that work is here.
    0

    Add a comment

  8. InnoDB just got faster on benchmarks and in the the rate at which they get important features into code we can deploy today. The added the SMP patch from Google into the 1.0.3 plugin for MySQL 5.1.31. Ben Handy will describe this at the MySQL Conference. More results on a older version of the patch are here.

    Pardon my chart generation problems. These links were broken.

    But don't take my word. Let me show you. This has results for sysbench readonly on a 16-core server for MySQL 5.1.31 with the InnoDB 1.0.2 plugin, the InnoDB 1.0.3 plugin and the InnoDB 1.0.3 plugin with tcmalloc. The y-axis is the throughput in transactions per second and more is better.

    This has results for loading tables concurrently on a 16-core server for the same 3 binaries. The y-axis is the time to finish the load and less is better.

    This has results for concurrent joins for the same 3 binaries. The y-axis is the time to finish the queries and less is better.
    0

    Add a comment

  9. This will be presented at the MySQL Conference by Justin, the author and an expert on MySQL replication. It is a great feature. You should go to the talk.

    It is almost time for another big Google patch. This one has global transaction IDs, which is suddenly a hot topic.  Continuent supports something similar to this with a very different implementation and no changes to the MySQL source. There are now two methods to make slave failover easy. Failover with master-master can be automated. But once you need more than two servers for read scaleout, your options are master-master-master-master-master-master-master-... or 1 master with many slaves. The former is a bad idea as each master in the chain adds replication delay. The latter makes slave failover hard.

    Slave failover is the process of changing the master used by the slave when a new server becomes the master. If this is a planned change, then this can be done with a short downtime. When this is an unplanned change, it is not fun. The problem is that a slave uses the filename and offset of the master's binlog. When a new server becomes the master, the new server may use different filenames. But even if it doesn't the filename/offset the slave used on the old master cannot be used on the new master.

    There are a few solutions to this:
    • switch and hope - run CHANGE MASTER to switch the slave to the new master and pretend that lost or extra transactions don't matter. Transactions are lost on the slave when it was behind the new master relative to the replication stream of the old master. The slave will have extra transactions when it was ahead of the new master relative to the replication stream of the old master.
    • check and lose - don't run CHANGE MASTER on slaves that did not stop replication on the old master at the same point as the new master. These slaves must be restored from a backup taken on the new master and the backup/restore must complete before another failover is done.
    • check and fix - promote the slave with the most transactions from the old master to be the new master. Manually apply transactions from it to other slaves when the other slaves were behind.
    I am not fond of any of these solutions for unplanned failover. A new alternative is to use global transactions IDs (or to motivate Percona to port/enhance this and then use it).

    With global transaction IDs, every binlog event has a global ID. The ID is a 64-bit integer that is incremented for each group of binlog events and is maintained on failover. Slaves store both the filename/offset of their current master and the global ID. On failover, a slave can connect to the new master using the global ID. This requires no manual intervention. That solves one problem.

    Slaves that may become new masters should be run with log-slave-updates to record binlog events received from the master in their binlog. These events are stored with the global ID from the master on which the event originated. If there is a master failover, the new master can serve binlog events for global IDs that originated on the old master. Thus, a slave that is behind can connect to the new master and get the events it is missing. That solves the other problem.

    Finally, a backup taken from the old master (using InnoDB hot backup) can be restored using the new master as long as the backup includes the global ID from the old master. I think SHOW MASTER STATUS was augmented to include the global ID.

    As an added bonus, checksums were added to binlog events.
    5

    View comments

  10. I have updated my MySQL 5.0 branch in launchpad  to use more from Google Perftools including CPU profiling and integration with exported tcmalloc functions. It took me and others to figure out how to get this done. You might want to stop reading now if you don't build your own MySQL binaries.
      On some binaries, tcmalloc makes MySQL much faster for workloads with a lot of concurrency running on big SMP servers. I claim that this is true for some binaries, not all binaries. And whether this is true seems to be determined by whether the other obvious bottlenecks have been fixed in the binary. For example, tcmalloc improves performance much more for the high-performance XtraDB binaries than for official versions of MySQL.

      The first step is to use the tcmalloc library. I use tcmalloc_minimal rather than tcmalloc and suggest you do the same. To use tcmalloc without rebuilding you may be able to set LD_PRELOAD_LIBRARY. I prefer to link against it. To do that, run configure and then build mysqld:

      ./configure --with-mysqld-ldflags=-L/path/to/google-perftools \
      --with-mysqld-libs=-ltcmalloc_minimal

      I confirm that this really linked tcmalloc by running 'nm mysqld | grep -i tcmalloc'. In some cases for me, it wasn't linking it because libtool used a command line with -lc prior to -ltcmalloc_minimal as a result of odd dependencies listed in the *.la files local to my environment used by libtool. So if something goes wrong, that may be the problem.

      Google Perftools provides hierarchical profiling. If you aren't using DTrace or have not compiled your Linux kernel and all system libraries with sufficient flags to get hierarchical profiles from oProfile, then you need perftools. Details on it are here and here. Unlike tcmalloc, support for CPU profiling is not transparent. You need to call an exported function from libprofiler at thread start time. I have modified MySQL to do that for all threads started in mysqld.cc which includes some background threads and all user threads. I chose not to do this for the background threads started by InnoDB. Code and more details (see README.patch) are in my 5.0 branch revision 2702.

      Finally, I provided access to a few features from perftools (see revision 2703 for the code):
      • RELEASE MEMORY is a SQL command that returns cached memory from tcmalloc back to the OS
      • SHOW MEMORY STATUS is a SQL command that prints the data returned from MallocExtension::GetStats. This is a lot of data on the state of memory allocated by tcmalloc.
      • tcmalloc_max_thread_cache_size is a my.cnf parameter that sets the size of the memory cache shared by all threads.
      2

      View comments

    Loading