1. There are several types of performance monitoring. Some of them are made easier when workload metrics are summarized by database user, table and index. The Google patch added SHOW TABLE_STATISTICS, SHOW INDEX_STATISICS and SHOW USER_STATISTICS to MySQL several years ago. Since then they have been ported and improved in the Facebook patchMariaDB and Percona Server. I think that Eric Bergen has also ported it. I guess you could say the community has spoken.

    I like these features because they have been extremely useful to me. They are also simple to use and have little impact on performance. Most of what you need to know about them can be determined by listing the columns in the tables. But more than anything else they are easy to use.

    The Facebook patch has these columns for user_statistics:
    USER_NAME
    BINLOG_BYTES_WRITTEN
    BYTES_RECEIVED
    BYTES_SENT
    COMMANDS_DDL
    COMMANDS_DELETE
    COMMANDS_HANDLER
    COMMANDS_INSERT
    COMMANDS_OTHER
    COMMANDS_SELECT
    COMMANDS_TRANSACTION
    COMMANDS_UPDATE
    CONNECTIONS_CONCURRENT
    CONNECTIONS_DENIED_MAX_GLOBAL
    CONNECTIONS_DENIED_MAX_USER
    CONNECTIONS_LOST
    CONNECTIONS_TOTAL
    DISK_READ_BYTES
    DISK_READ_REQUESTS
    DISK_READ_SVC_USECS
    DISK_READ_WAIT_USECS
    ERRORS_ACCESS_DENIED
    ERRORS_TOTAL
    MICROSECONDS_CPU
    MICROSECONDS_WALL
    QUERIES_EMPTY
    ROWS_DELETED
    ROWS_FETCHED
    ROWS_INSERTED
    ROWS_READ
    ROWS_UPDATED
    ROWS_INDEX_FIRST
    ROWS_INDEX_NEXT
    TRANSACTIONS_COMMIT
    TRANSACTIONS_ROLLBACK

    The Facebook patch has these columns for table_statistics:
    TABLE_SCHEMA
    TABLE_NAME
    TABLE_ENGINE
    ROWS_INSERTED
    ROWS_UPDATED
    ROWS_DELETED
    ROWS_READ
    ROWS_REQUESTED
    ROWS_INDEX_FIRST
    ROWS_INDEX_NEXT
    IO_READ_BYTES
    IO_READ_REQUESTS
    IO_READ_SVC_USECS
    IO_READ_SVC_USECS_MAX
    IO_READ_WAIT_USECS
    IO_READ_WAIT_USECS_MAX
    IO_READ_OLD_IOS
    IO_WRITE_BYTES
    IO_WRITE_REQUESTS
    IO_WRITE_SVC_USECS
    IO_WRITE_SVC_USECS_MAX
    IO_WRITE_WAIT_USECS
    IO_WRITE_WAIT_USECS_MAX
    IO_WRITE_OLD_IOS
    IO_INDEX_INSERTS

    The Facebook patch has these columns for index_statistics:
    TABLE_SCHEMA
    TABLE_NAME
    INDEX_NAME
    TABLE_ENGINE
    ROWS_INSERTED
    ROWS_UPDATED
    ROWS_DELETED
    ROWS_READ
    ROWS_REQUESTED
    ROWS_INDEX_FIRST
    ROWS_INDEX_NEXT
    IO_READ_BYTES
    IO_READ_REQUESTS
    IO_READ_SVC_USECS
    IO_READ_SVC_USECS_MAX
    IO_READ_WAIT_USECS
    IO_READ_WAIT_USECS_MAX
    IO_READ_OLD_IOS

    1

    View comments

  2. There have been interesting discussions in the PostgreSQL community about adding support for index only scans. On several occasions people were curious about how InnoDB supports this. A recent post by the InnoDB team is an excellent overview. A brief summary of that post and other material is:

    • records in the clustered (primary) index store hidden columns (DB_TRX_ID, DB_ROLL_PTR)
    • records in the non-clustered (secondary) index do not store hidden columns
    • records in clustered and non-clustered indexes have a delete-mark flag
    • records are not updated in the secondary index, they are delete-marked on delete, inserted on insert and delete-marked/inserted on update
    • delete-marked records are removed from indexes by the purge thread when it is safe to do so
    When a secondary index page is read, if the max transaction ID on the page is less than the max transaction ID for which all transactions are visible to the reading transaction (low-water mark, up_limit_id), then the page can be used as is and the page read is index-only. If this condition is not true, then for any entry read from this page the record is read from the clustered index page to determine whether the index entry is visibile. In that case the secondary index read is not index only. Index only matters because when things are not index only there can be an additional random disk read to the clustered index for each entry read from the secondary index.

    The max transaction ID for which all transactions are visible to the reading transaction is described as the low-water mark and assigned to the up_limit_field in the read view (read_view_struct). This is the max transaction ID for which there are no unresolved transactions when the reading transaction starts. If there is a long-open transaction when the reading transaction starts, then up_limit_id will be less than the transaction ID of the long-open transaction.

    I began to read the code for this today as I want to add a counter for the number of secondary index page reads that are and are not index only. If you want to read the code too the function lock_sec_rec_cons_read_sees determines whether all entries on a secondary index page are definitely visible to a transaction (read view).

    If you are interested in this topic, I recommend these books:

    5

    View comments

  3. I make bad jokes about Monty-style code. I don't like to read or modify it. I love to run it in production. It never crashes. It never leaks memory. I exaggerate a little bit but not too much. It is remarkably stable.

    This is an amazing accomplishment. Alas some of the senior developers who did that at MySQL have since left. Another senior developer left this week. I hope this trend does not continue.

    Everyone can be replaced. Smart people can be found. But smart and productive people are not as widely available and it takes a while to figure out how things are done in MySQL. I know because I have made a lot of mistakes while trying to make things better.

    Perhaps the Google response is appropriate. Everyone gets a 10% raise and a $1000 holiday bonus.

    I did not submit this to http://planet.mysql.com
    3

    View comments

  4. The Facebook database teams will describe how MySQL is used at Facebook. Join us on Tuesday, November 2. The performance, operations and engineering teams will describe work in progress to keep MySQL happy. The event is from 7pm to 9pm (PDT or UTC -7).
    8

    View comments

  5. Packt Publishing gave me a copy of MySQL 5.1 Plugins Development to review. The book is written by two people who are MySQL experts -- Sergei Golubchik and Andrew Hutchings. I know Sergei because he answers a lot of my questions on the MySQL Internals mail list. I wasn't in the best mood when I opened the e-book as I worked late last night. My mood was much better after reading a few pages. The book is amazing. The content, presentation and editing are excellent. The book is full of relevant examples that show you how to build plugins including source code and tips on compiling. The text carefully explains all of the steps required to build each of the plugins.

    I wish I had this book several years ago. It would have saved me a lot of trouble. But from the quality of the book I suspect that they have been working on it for several years.

    Thank you Sergei and Andrew. This is a great addition to the MySQL community.

    The target audience for this book is someone who wants to write code that will run within the mysqld process. This includes user defined functions (UDF) to be called by SQL statements, daemon plugins that can run code using a dedicated thread in the server process, information schema plugins to expose data in INFORMATION_SCHEMA, full-text parser plugins and storage engine plugins. There are examples for three storage engines. The first engine is simple and only supports read-only tables. The second example supports read-write tables without indexes and uses HTML as the file format. The final example builds a storage engine using Tokyo Cabinet.

    My team has put a lot of things into the mysqld process. Some of those changes were even useful. This book would have prevented us from making a few mistakes. It even has an example to export the output from getrusage via SHOW STATUS. I added a similar patch to MySQL, but I don't think my change was as nice.
    1

    View comments

  6. Congratulations!

    Can we please tone down the marketing? It is great to see a community project grow, especially in the MySQL family. But you are judged by successful deployments, not by the class libraries used by your project.

    Drizzle is only more reliable than MySQL when it keeps the replication log and InnoDB in sync during crash recovery. But it does not do that today. Official MySQL supports this on a master via sync_binlog. MySQL slaves do this today via rpl_transaction_enabled which is available in the Facebook and Google patches. I think it is also available in Percona Server, MariaDB and XtraDB.

    I have been told that Drizzle was designed for multi-core scalability. I am not sure what that means on real workloads or benchmarks. I know that there are some bottlenecks in InnoDB that have nothing to do with the code above it (MySQL, Drizzle). I also know that MySQL has made huge strides this year and MySQL 5.5 far exceeds my expectations. Alas I cannot validate whether "designed for scalability" results in performance that is better than MySQL 5.5 as I cannot build Drizzle on my servers due to the  dependency on gcc 4.2 or greater.
    7

    View comments

  7. I have broken my promise to stop writing about this. Sorry, but I had to correct my mistake. I ran three micro-benchmarks: get by primary key, get by secondary key and update by primary key. MySQL had a higher peak QPS for all of them. Alas, the results for get by primary key were skewed because pymongo, the Python driver for MongoDB, uses more CPU than MySQLdb, the Python driver for MySQL. The client host was saturated during the test and this limited peak QPS to 80,000 for MongoDB versus 110,000 for MySQL.

    I repeated one test using two 16-core client hosts with 40 processes per host. For that test the peak QPS on MongoDB improved to 155,000 while the peak for MySQL remained at 110,000. That is an impressive result. The results for get by secondary key and update by primary key are still valid as the server host saturated on those tests.

    Now I must consider rewriting the test harness in Java, C or C++ or I could add MongoDB support to sysbench. I prefer Python. In addition to under-reporting MongoDB peak performance it also under-reports MySQL performance. I am able to get 180,000 peak QPS using sysbench on one 16-core client host and mysqld on another versus 110,000 using the Python equivalent.
    11

    View comments

  8. This is the end of my public performance comparison of MongoDB versus MySQL, at least for the next few weeks. For these tests I used a 16-core x86_64 client host and an 8-core x86_64 server host, MySQL 5.1.50 with the Facebook patch, MongoDB 1.7.0 and Python clients. Ping between the client and server hosts was ~200us.

    I tested 5 configurations:

    • inno.5150fb.b0.s0 - MySQL 5.1.50, the Facebook patch, binlog disabled
    • inno.5150fb.b1.s0 - MySQL 5.1.50, the Facebook patch, binlog enabled, sync_binlog=0
    • inno.5150fb.b1.s1 - MySQL 5.1.50, the Facebook patch, binlog enabled, sync_binlog=1
    • mongo.170.safe - MongoDB 1.7.0 and safe updates
    • mongo.170.unsafe - MongoDB 1.7.0 and unsafe updates
    Note that with unsafe updates the client does not wait for the server to respond. It sends requests as fast as it can or until the buffer between the client and server is full. When a sufficient number of concurrent clients are used and the clients run for enough time the buffer becomes full and unsafe updates do not improve performance.

    InnoDB has much better throughput when the binlog is disabled.  The clients update rows selected at random from 2M rows. InnoDB is able to handle more of that load concurrently than MongoDB as a reader-writer lock is used that prevents concurrency within the database. Note that while InnoDB allows more of the update to be done concurrently, it isn't perfect as there are several global mutexes in MySQL/InnoDB including LOCK_open, kernel_mutex and the InnoDB buffer pool mutex. When the binlog is enabled by sync_binlog is disabled even more global mutexes are used. Finally, when the binlog is enabled and sync_binlog=1 then group commit is not enabled and updates are rate limited by the performance of fsync. In this case fsync is fast as a HW RAID card with battery backed cache was used.
    This is output from PMP that demonstrates one source of mutex contention in mongod. The pileup occurs at mongo::MongoMutex which apprently implements the reader-writer lock.

         48 mongo::connThread,thread_proxy,start_thread,clone
         37 pthread_cond_wait@@GLIBC_2.3.2,boost::condition_variable::wait,mongo::MongoMutex::lock,mongo::receivedUpdate,mongo::assembleResponse
         10 recv,mongo::MessagingPort::recv,mongo::MessagingPort::recv
          2 
          1 select,mongo::Listener::initAndListen,mongo::listen,mongo::_initAndListen,mongo::initAndListen,main,select
          1 select,mongo::Listener::initAndListen
          1 pthread_cond_wait@@GLIBC_2.3.2,mongo::FileAllocator::Runner::operator(),thread_proxy,start_thread,clone
          1 nanosleep,mongo::DataFileSync::run,mongo::BackgroundJob::thr,thread_proxy,start_thread,clone
          1 nanosleep,mongo::ClientCursorMonitor::run,mongo::BackgroundJob::thr,thread_proxy,start_thread,clone
          1 nanosleep
          1 mongo::webServerThread,thread_proxy,start_thread,clone
          1 mongo::SnapshotThread::run,mongo::BackgroundJob::thr,thread_proxy,start_thread,clone
          1 mongo::interruptThread,thread_proxy,start_thread,clone
          1 mongo::BtreeBucket::findSingle,mongo::ModSetState::createNewFromMods,mongo::_updateObjects,mongo::updateObjects,mongo::receivedUpdate,mongo::assembleResponse
    Source code for MongoDB queries is listed below. The code to setup MongoDB and MySQL is described in the previous posts.



    def query_mongo(host, port, pipe_to_parent, requests_per, dbname, rows, check, testname, worst_n, id):
      conn = pymongo.Connection(host, port)
      db = conn[dbname]
      signal.signal(signal.SIGTERM, sigterm_handler)

      gets = 0
      stats = SummaryStats(worst_n)

      while True:
        for loop in xrange(0, requests_per):

          target = random.randrange(0, rows)

          s = time.time()
          try:
            r = db.c.update({'_id': target}, {'$inc': {'k': 1 }}, safe=True)
            assert r['updatedExisting'] == True
            assert r['ok'] == 1

            stats.update(s)

            gets += 1

          except:
            assert got_sigterm

    The my.cnf settings for MySQL except for the values of log_bin and sync_binlog:
    innodb_buffer_pool_size=2000M
    innodb_log_file_size=100M
    innodb_flush_log_at_trx_commit=2
    innodb_doublewrite=1
    innodb_flush_method=O_DIRECT
    innodb_thread_concurrency=0
    innodb_max_dirty_pages_pct=80
    innodb_file_format=barracuda
    innodb_file_per_table
    innodb_deadlock_detect=0

    max_connections=2000
    table_cache=2000

    key_buffer_size=2000M

    innodb_doublewrite=0

    7

    View comments

  9. This continues the silly benchmark series and compares performance from concurrent clients that fetch by secondary key. The previous post compared fetch by primary key. The test setup was the same as before. Clients were run on a 16-core x86 server. The servers (mongod, mysqld) were run alone on 8-core and 16-core x86 servers. The tests were run for servers that were 1ms and 200us apart according to ping. The database server saturates earlier when the client is only 200us away. That is to be expected.

    InnoDB tables are clusters on the primary key and a query that fetches all columns by PK only has to read data from one leaf block of the PK index. When all columns are fetched by secondary key then the secondary index leaf node and PK index leaf node must be read. As all data was cached for this test that does not make a big difference. Were data not cached the extra IO used to read the PK index leaf node would be significant.

    This displays throughput on the 16-core server. MongoDB saturates earlier on the 16-core server than on the 8-core server. From vmstat output this appears to be mutex contention on the server but I cannot provide more details on where that occurs as the mongod binary I downloaded has been stripped.
    This displays throughput on the 8-core server. Peak QPS for MongoDB is much better than on the 16-core server.
    This displays response time for the 16 core server.
    The test was repeated using a hosts that were 200us apart according to ping. The database host was an 8-core server in this test. The peak QPS is similar to the previous tests but the servers saturate with fewer concurrent clients. The results are here and have been updated to include results for MongoDB 1.6.2 and 1.7.0.

    Source code to setup the MySQL table:

    def setup_mysql(host, db, user, password, engine, rows):
      filterwarnings( 'ignore', category = MySQLdb.Warning )
      conn = connect_mysql(host, db, user, password)
      conn.autocommit(True)
      cursor = conn.cursor()
      cursor.execute('drop table if exists bm')
      cursor.execute('create table bm (id int primary key, sid int, k int, c char(120), pad char(60), key sidx(sid)) engine=%s' % engine)

      vals = []
      for x in xrange(0, rows):
        sx = str(x)
        lsx = len(sx)
        row = '(%d, %d, %d, "%s", "%s")' % (x, x, x, sx+'x'*(120-lsx), sx+'y'*(60-lsx))
        vals.append(row)

        if len(vals) == 1000:
          r = cursor.execute('insert into bm values %s' % ','.join(vals))
          vals = []
          print '... row %d, result %s' % (x, r)

      if vals:
        r = cursor.execute('insert into bm values %s' % ','.join(vals))
        vals = []
        print '... row %d, result %s' % (x, r)


    Source code to query the MySQL table:


    def query_mysql(host, db, user, password, pipe_to_parent, requests_per, rows, check, testname, worst_n, id):
      conn = connect_mysql(host, db, user, password)
      conn.autocommit(True)
      cursor = conn.cursor()

      gets = 0
      stats = SummaryStats(worst_n)

      while True:
        for loop in xrange(0, requests_per):

          target = random.randrange(0, rows)

          s = time.time()
          cursor.execute('select id, k, c, pad from bm where sid = %d' % target)
          sel_rows = cursor.fetchall()
          stats.update(s)

          if len(sel_rows) != 1:
            print 'No rows for %d' % target
            assert False
          if sel_rows[0][0] != target:
            print 'id is %s and should be %s' % (sel_rows[0][0], target)
            assert False

          gets += 1

    Source code to setup the MongoDB collection:


    def setup_mongo(host, port, dbname, rows):
      conn = pymongo.Connection(host, port)
      conn.drop_database(dbname)
      db = conn[dbname]

      for x in xrange(0, rows):
        sx = str(x)
        lsx = len(sx)
        db.c.save({'_id':x, 'sid':x, 'k':x, 'c':sx+'x'*(120 - lsx), 'pad':sx+'y'*(60 - lsx)})
        if x % 1000 == 0:
          print '... row %d' % x

      db.c.create_index('sid')

    Source code to query the MongoDB table:

    def query_mongo(host, port, pipe_to_parent, requests_per, dbname, rows, check, testname, worst_n, id):
      conn = pymongo.Connection(host, port)
      db = conn[dbname]

      gets = 0
      stats = SummaryStats(worst_n)

      while True:
        for loop in xrange(0, requests_per):

          target = random.randrange(0, rows)

          s = time.time()
          o = db.c.find_one({'sid': target})
          stats.update(s)

          assert o['_id'] == target
          if check:
            assert o['k'] == target
            sx = str(o['_id'])
            lsx = len(sx)
            assert o['c'] == sx+'x'*(120-lsx)
            assert o['pad'] == sx+'y'*(60-lsx)

          gets += 1

    13

    View comments

  10. This is yet another silly benchmark because the results are likely to be misused. The results probably do not matter to you. I like MongoDB. It has many things in common with MySQL (ease of use, rapid iteration, give customers what they want). It would be more interesting to me were it to use embedded InnoDB as the backend.

    Please read the update as I was able to get much better throughput from MongoDB once I realized that the client host was saturated on the CPU.

    I want to evaluate the performance claims. Several of the MongoDB versus the world benchmarks were run on Mac laptops. I love my Mac laptop, but I don't use it to confirm database server throughput.

    My first test uses concurrent clients to determine the load at which the server saturates. I run this test in two modes: cached and not cached. No disk IO should be done during the cached test. Most requests should require disk IO during the uncached test. I am not sure that I will run the uncached test for MongoDB as it memory maps the database file and there is no way to limit the amount of memory it will use so the MongoDB uncached test requires a database much larger than RAM. That will take a long time to setup and I don't know whether many MongoDB deployments run with databases much larger than RAM.

    The test is interesting to me because it has found problems that limit throughput on workloads that I care about. Other tests should be done after understand the results from this test. I suspect that for many people peak QPS on highly-concurrent workloads is not a priority.

    Each client does a sequence of calls (SELECT or HANDLER for MySQL, find_one for MongoDB) where each calls fetches one row by specifying the value of an indexed column. The column has a primary key index for MySQL and used the Collections create_index method for MongoDB. The id value to fetch is randomly selected.

    The code for the benchmark is a work in progress, so I have yet to publish it. I used Python, the MySQLdb driver for MySQL and the pymongo driver for MongoDB. The clients are separate processes forked via the Python multiprocessing module to avoid the penalty from the Python GIL. Some of the code used for MongoDB is listed at the end of this post.

    I only report peak QPS at this time. The results ignore average and worst-case response time. It is not good to ignore them which is another reason why this might be a silly benchmark. Eventually the code will be updated to measure that.

    The clients and DBMS (mysqld, mongod) ran on separate 16-core servers. The test was run for 1, 2, 4, 8, 16, 24, ..., 152, 160 concurrent clients. Tests were run for MyISAM and InnoDB for MySQL. The tests used MySQL 5.1.50 unmodified (5150orig) and 5.1.50 with the Facebook patch (5150fb). There are a few changes in the Facebook patch that make a huge difference for peak QPS. I expect official MySQL to have these changes soon. Peak QPS was much higher for MySQL than for MongoDB. In order of peak QPS:
    1. Peak QPS on MySQL 5.1.50 with the Facebook patch exceeds 100,000 and QPS increases to 128 concurrent clients.
    2. Peak QPS on MySQL 5.1.50 unmodified is between 70,000 and 80,000 and QPS increases to 80 concurrent clients.
    3. Peak QPS on MongoDB 1.7.0 is 40,000 and QPS increases to 64 concurrent clients.
    MongoDB appears to saturate on mutex contention on the server at 64 concurrent clients. Alas, the binary is stripped and I cannot run PMP to determine where the problem occurs. My diagnosis is based on vmstat output.

    This displays throughput MySQL 5.1.50 with the Facebook patch, MySQL 5.1.50 unmodified and MongoDB 1.7.0. All ran on a 16-core Nehalem server -- 16 cores with hyperthreading enabled.

    This displays average and 98th percentile response time for MySQL 5.1.50 with the Facebook patch and MongoDB 1.7.0. Both ran on a 16-core Nehalem server -- 16 cores with hyperthreading enabled. Results are from the same test as used for the previous graph.
    I repeated the tests using an 8-core Nehalem server. It is the same as the previous server except that hyperthreading was disabled. The graphs below display throughput. MySQL is able to handle more concurrent clients before saturating on the server.

    This is the response time graph for the 8-core server.
    The servers for the tests above were 1ms apart according to ping. I repeated the test for servers that have a 200us ping time but I don't want to inline another graph. It is linked in the next sentence. The QPS for get by primary key with an interesting and unexplained spike in QPS for MongoDB near 152 concurrent clients. Despite having a similar QPS at that point, MongoDB average response time was about 1.5X that for MySQL. This includes results for MongoDB versions 1.6.2 and 1.7.0.

    The code to setup the collection for MongoDB is:
    def setup_db(host, port, dbname, rows):
      conn = pymongo.Connection(host, port)
      conn.drop_database(dbname)
      db = conn[dbname]  
      for x in xrange(0, rows):
        sx = str(x)
        lsx = len(sx)
        db.c.save({'_id':x, 'k':x, 'c':sx+'x'*(120 - lsx), 'pad':sx+'y'*(120 - lsx)})
        if x % 1000 == 0:
          print '... row %d' % x
    And the code to query MongoDB:
    def query_process(host, port, pipe_to_parent, requests_per, dbname, rows, check, id):
      conn = pymongo.Connection(host, port)
      db = conn[dbname]
      gets = 0
      while True:
        for loop in xrange(0, requests_per):
          target = random.randrange(0, rows)
          o = db.c.find_one({'_id': target})
          assert o['_id'] == target
          if check:
            assert o['k'] == target
            sx = str(o['id'])
            lsx = len(sx)
            assert o['c'] == sx+'x'*(120-lsx)
            assert o['pad'] == sx+'y'*(120-lsx)

          gets += 1

        if pipe_to_parent.poll():
          msg = pipe_to_parent.recv()
          print 'Received: %s' % msg
          pipe_to_parent.send({'gets' : gets})
          pipe_to_parent.close()
          return
    The my.cnf settings for MySQL 5.1:
    plugin-load=innodb=ha_innodb_plugin.so;innodb_trx=ha_innodb_plugin.so;innodb_locks=ha_innodb_plugin.so;innodb_lock_waits=ha_innodb_plugin.so;innodb_cmp=ha_innodb_plugin.so;innodb_cmp_reset=ha_innodb_plugin.so;innodb_cmpmem=ha_innodb_plugin.so;innodb_cmpmem_reset=ha_innodb_plugin.so
    innodb_buffer_pool_size=2000M
    innodb_log_file_size=100M
    innodb_flush_log_at_trx_commit=2
    innodb_doublewrite=1
    innodb_flush_method=O_DIRECT
    innodb_thread_concurrency=0

    innodb_file_format=barracuda
    innodb_file_per_table

    max_connections=2000
    table_cache=2000
    key_buffer_size=2000M
    33

    View comments

Loading