Sunday, March 28, 2010

Fast reads or fast scans?

MyISAM is frequently described and marketed as providing fast reads when it really provides fast index and table scans. This is a more narrow use case as fast reads implies great performance for most queries while fast scans implies great performance for single-table queries that are index only or do a full table scan.

MyISAM caches index blocks but not data blocks. There can be a lot of overhead from re-reading data blocks from the OS buffer cache assuming mmap is not used. InnoDB and PBXT are 20X faster than MyISAM for some of my tests. However, I suspect that mutex contention on the key cache is also a factor in the performance differences.

While there are many claims about the great performance of MyISAM. There are not as many examples that explain when it is fast. Alas, the same marketing technique is being repeated with NoSQL to the disadvantage of MySQL.
Tests were run on a server that reports 16 CPU cores. The full test configuration is described elsewhere. For this test I modified the sysbench oltp test to do a self-join query. I will publish the code soon. The schema for the test is:
CREATE TABLE sbtest (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  k int(10) unsigned NOT NULL DEFAULT '0',
  c char(120) NOT NULL DEFAULT '',
  pad char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  KEY k (k)
) ENGINE=InnoDB;
The self-join query uses a range predicate that selects a fixed number (1, 10, 100, 1000 or 10000) of rows. This is an example that selects 1000 rows.
SELECT t1.c, t2.c FROM sbtest t1, sbtest t2
WHERE t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id
Tests were run using MySQL 5.1.45 for MyISAM, InnoDB plugin 1.0.6 and PBXT 1.1. Results are in queries per second for 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 concurrent clients. I do not report results for 512 and 1024 clients to avoid long lines in this post.

The performance of MyISAM is much worse compared to InnoDB and PBXT as the number of rows selected grows from 1 to 10,000.

Queries per second when the between predicate selects 1 row:
  6843  13157  24552  46822  62588  57023  46568  30582  18745 innodb
  6164  13627  25671  48705  63741  59217  48300  30964  18866 pbxt
  6354  12061  23373  44284  50778  49546  44412  30444  18827 myisam

Queries per second when the between predicate selects 10 rows:
  4240   8466  16387  33221  53902  39599  36214  28026  18084 innodb
  4802   8835  17688  35917  57461  47691  41578  29087  18558 pbxt
  3890   7129  12512  16450  12272  12304  12441  12448  11304 myisam

Queries per second when the between predicate selects 100 rows:
  1842   3455   7249  14842  20206  13875  13471  12942  12344 innodb
  2113   3522   7893  13411  18597  18905  18694  18123  12301 pbxt
  1608   2260   2263   1899   1371   1399   1451   1468   1442 myisam

Queries per second when the between predicate selects 1000 rows:
   380    654   1222   2023   2487   1866   1791   1794   1942 innodb
   303    641   1149   1699   2044   2069   2072   2063   2056 pbxt
   232    248    227    189    141    143    149    148    148 myisam

Queries per second when the between predicate selects 10000 rows:
    43     70    130    213    254    199    194    196    199 innodb
    49     69    123    182    213    216    216    216    216 pbxt
    24     24     23     19     14     14     15     15     15 myisam

MyISAM is at a disadvantage because it does not cache data blocks, so I changed the query to be index only and it is listed below. This did not make MyISAM faster. I think the bottleneck is contention on the key cache mutex.
SELECT t1.id, t2.id FROM sbtest t1, sbtest t2
WHERE t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id
Queries per second for range 1000 using the index only query:
   457    706   1354   2146   2596   2044   1918   1887   1953 innodb
   576    837   1386   1681   2058   2094   2103   2095   2087 pbxt
   353    244    223    190    140    142    147    146    146 myisam

Results for MySQL 5.0.84 are similar to 5.1.45 for the range 1000 query:
   390    642   1241   2045   2547   1891   1825   1813   1930 innodb
   303    239    225    189    140    141    147    146    146 myisam

The query plan for the basic query:
explain  SELECT t1.c, t2.c
from sbtest t1, sbtest t2
where t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 1072
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: func
         rows: 1
        Extra: Using where; Using index
2 rows in set (0.01 sec)

The query plan for the index only join:
explain  SELECT t1.id, t2.id
from sbtest t1, sbtest t2
where t1.id between 1916457 and 1917456 and t2.id = 2000000 - t1.id
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 978
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: func
         rows: 1
        Extra: Using where; Using index
2 rows in set (0.00 sec)

6 comments:

  1. Thanks for your sharing this result. I'm excited at PBXT's score rather than MyISAM's.

    ReplyDelete
  2. MyISAM is faster for OLAP-kind applications that use queries against star-schema. In this case its efficiency in index and range scans wins.

    Sometimes, if you want to cache the data as well, you may use covering index, i.e. include all selected columns in the index itself. If columns are small (ints) it is not expensive.

    ReplyDelete
  3. @alz - what does a star-schema have to do with it? To me star-schema means joins between fact and dimension tables. From my results here, MyISAM performance is not good on joins.

    If MyISAM does good on single table and index scans used to populate the caches used by an OLAP server that runs on top of MySQL, then we have made the same claim -- MyISAM == fast scans.

    ReplyDelete
  4. Could you try the MyISAM tests with the patch available at http://peter.stardoll.com/?

    It should make the key_cache scale better with the number of threads connected. Also it would be interesting to see how it compares when doing scans over a larger portion of the index.

    Which malloc did you use?

    ReplyDelete
  5. @Peter - I am not sure if I will get to your patch. I might hack up a test to simulate the benefit by using multiple MyISAM tables with a key cache per table.

    I used malloc from glibc 2.5

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.