Tuesday, September 1, 2009

Blame it on MyISAM

I reviewed most of the changes from the v4 Google patch today. My head hurts now. During this review I checked whether bugs fixed in the patch have also been fixed in recent releases of official MySQL. I am happy that most of them have been fixed. But some changes will never be accepted, such as the one that added support for INF for FLOAT/DOUBLE columns.

The default value of sql_mode is the empty string. You probably want to change that before your applications come to depend on it. When it is the empty string, invalid values are coerced to valid values on INSERT and UPDATE and a warning is returned. Applications usually ignore the warnings. The coercion includes:
  • INT values that are too big are set to the maximum value of an INT. The same is done for BIGINT
  • INF is changed to MAX_DOUBLE or MAX_FLOAT for a DOUBLE/FLOAT column
  • varchar and LOB columns are truncated to not exceed the maximum length
  • invalid DATE values are accepted
What needs this behavior? MyISAM. For a storage engine that doesn't do rollback, one way to handle invalid data during an INSERT or UPDATE statement is to coerce it to valid values and proceed with the statement. I am not fond of this approach. An alternative for data warehouse workloads is to use an exception table to log rows with invalid data and avoid corrupting non-exception tables.

MyISAM has also made replication semantics and internals much more complex. For example, what is written to the binlog in this case, and has this behavior changed between releases?
begin;
insert into Innodb_table values (1);
insert into Myisam_table values (1);
rollback;
I think that MyISAM has its place. It does fast table scans, but InnoDB is much faster on just about everything else. I am just not thrilled with the impact it has had on MySQL. It can be used for tasks where a table or partition is loaded once and then made readonly after the insert. This is a good fit for data warehouse tasks. Although it would be better were multi-core performance improved and the key cache expanded to include data blocks. MyISAM can also be used for scratch tables on a slave.

Drizzle avoided these problems by limiting MyISAM to temporary tables.

11 comments:

  1. Innodb uses far more diskspace than it need, impacting IO bound loads. Not to mention idioticly slow count(*).

    ReplyDelete
  2. @Anon - disk space is cheap, disk IOPs have not been (until recently). If you have read my blog then you know my opinion on fast select count(*) for MyISAM -- http://mysqlha.blogspot.com/2009/08/fast-count-for-innodb.html. It is only fast for queries that are of little use, and there are ways to make it exact and fast in InnoDB (kill concurrency as MyISAM has done) or to make it inexact and fast. Besides, I see lots of application behavior that I want to get fixed, but I have yet to see an app that depends on fast select count(*).

    ReplyDelete
  3. Take disk IOPs on Amazon EC2, for example. Smaller data means better throughput. Totally different if you are not IO bound.

    I also would have a hard time without ALTER TABLE ORDER BY to reorganize the data on the disk according to a key that may not be the primary key.

    How is InnoDB on count(*) from table where=prefix of some secondary key?

    ReplyDelete
  4. All IO bounded-ness is not created equal. If you are IO bound on a table or index scan than compressing a table to half of its original size makes a big difference. If you are IO bound on random IO, then the benefit is much less.

    MyISAM is very fast for single-table queries. Add concurrency or join to a table without using an index and performance falls off of a cliff. Use a query that must go to the base table to fetch a few columns and performance drops off a cliff. You can avoid these performance problems by duplicating all columns from your base table in indexes, but at that point MyISAM doesn't use much less space.

    ReplyDelete
  5. i think your blog-post only scratches on the surface wrt to the problems myisam has caused for the mysql-server

    ReplyDelete
  6. For starters, there is the use of MyISAM for dictionary tables, so that updates to them are not transactional.

    ReplyDelete
  7. Is there real _multicore_ support in Mysql on Linux/RedHat? The whole day today i tried to answer this question. We are having a 16 core (4CPUxQuad core) system that seems to idle most of the time except one core. mysqld process is 100% and the "top" sais is 93% idling. MySQL docs say it is multicore, I wonder ... Maybe on Windows...
    Bogdan

    ReplyDelete
  8. I don't wonder. It is there, but 16 cores is not the ideal server for MySQL.

    You could benefit a lot from a few hours of good MySQL consulting. Hire someone from Pythian, Percona, MySQL, ProvenScaling, Ronald Bradford, ...

    ReplyDelete
  9. Where I work at GenieDB we've written a replicating storage engine, and we went for "fast an inexact" COUNT(*)s on tables; since there's some asynchrony in when a row is actually written to any one disk in a replicated system, or COUNT(*) may not reflect very recent writes. When we do real physical writes that add or remove rows, we update that table's row count (on that node). But writes may physically happen on any particular node in some unusual orders.

    This isn't a problem in practice, as applications really shouldn't be relying on COUNT(*). We're not doing transactions, so an application that does a COUNT(*) on a table would be getting outdated information anyway; other processes might add or remove rows even while the COUNT(*) query is returning results, so it's out of date already. We just take advantage of that to add a little more scope for out-of-datedness, in order to have very fast COUNT(*) so that administrators can quickly tell how large their tables are - for a table that's not currently changing it'll be perfectly accurate, anyway; the inexactness is purely a matter of time lag.

    This is a pragmatic decision we've made... it'll probably not be what *everybody* wants, but we figure this is the best approach on average :-)

    ReplyDelete
  10. That sounds good and you reinforce an important point. Even when count(*) is exact, it is exact as of a point in time in the past and might not be exact right now on a table that gets a lot of inserts and deletes.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.