If you must, the following will make SELECT COUNT(*) FROM FOO fast for InnoDB:
- Install INSERT and DELETE triggers to maintain the row count for the InnoDB table in a metadata table. This will kill concurrency on the table, just like MyISAM.
- Modify the parser to accept SELECT ESTIMATED_COUNT(*) and evaluate this internally to use the row count estimate provided by the storage engine. Is the estimate good enough? If an exact answer is needed and the table is not locked after the exact value is computed, then the exact value will quickly become incorrect.


A decent trick for helping to maintain concurrency is to have several rows instead of just a single one in the metadata table. Then when you insert/delete you could use either a random row or a mod of the auto_increment to pick which row. To get the count(*) you would then sum them up.
ReplyDeleteThe best number of rows will depend upon how many insert/deletes you have going on.
I agree with you that this isn't normally all that interesting of a query though.
Amira (a storage engine project I worked on a few years ago) maintains a per-transaction, per-table count of how many rows were inserted and how many were deleted. It stored as part of the table metadata the exact count of how many rows are in the table. When a transaction is committed, this value would be updated. By doing that, it was always able to return instantly the precise number of rows in each table.
ReplyDeleteAs the row count deltas commute after insert and delete, you could do this without locking the count field. But that is a lot of work to make 'select count(*) from foo' fast. Applying that technology elsewhere would be interesting.
ReplyDeleteMark,
ReplyDeleteMight be better to do something like:
SET SQL_INNODB_ESTIMATED_COUNT=true;
SELECT COUNT(*) FROM FOO;
It has the downside that you need to reset this on the connection afterwards but it doesn't require a parser modification.
I have a list of things I need from InnoDB and this is certainly on there....
It's really amazing that these things have gone unfixed for so long.
Thank god for Percona and Google for bringing new blood to the table :)
Now that Oracle has bought Sun, and therefore MySQL, there may be more than just the Oracle engineers working on InnoDB (it seems like all the important fixes are still done by Heikki and other Oracle engineers). Hopefully MySQL will be able to give new blood to the InnoDB table.
ReplyDeleteAlso, I think you mean having an actual filter that uses an index, as opposed to "a WHERE clause" -- if all you needed was a WHERE clause, you can just say
SELECT COUNT(*) FROM foo WHERE 1=1;
But I don't believe that's the case.
Of course, it all depends on how fast you need it too, and what you need exactly.
ReplyDeleteAFAICS, the trick described by Mark (triggers, and a custom patch to perform essentially a query rewrite) seems like something you'd do in case you really need instant performance. If you are using MySQL for analytic queries, you could consider using Mondrian (http://mondrian.pentaho.org/) in conjunction with the aggregate designer (http://sourceforge.net/projects/mondrian/files/aggregation%20designer/). This generates aggregate tables (not just count, any aggregate) at multiple levels, and through Mondrian, queries are re-written to take advantage of them (of course, you need to schedule loading of these agg. tables)
Another solution to consider is flexvies, by Justin Swanhart (http://sourceforge.net/projects/flexviews/)
regards,
Roland
Obviously count(*) without a where is bad practice in application code. but for administrators it can be useful as a simple way to verify that tables are roughly consistent -- at least as much as you can do without requiring an expensive and time consuming checksum.
ReplyDeleteMost other RDBMs systems can do point in time (MVCC) count(*) relatively fast. I dont see why innodb couldnt as well.
There seems to be a general lack of understanding of the problem in these comments.
ReplyDelete"Most other RDBMs systems can do point in time (MVCC) count(*) relatively fast."
No, they can't. The problem is that with MVCC, count means different things to different people, because any two transactions can return *different* numbers for count(*) and *both* be correct. Welcome to the magic of concurrent transactional engines.
Now, Oracle is somewhat faster than say Postgres on this, because they do an index scan (when available) to retrieve the count rather than a sequential scan of the table; they can do this because Oracle has some level of row visibility in the indexes, while Postgres doesn't. Since Innodbs MVCC implementation is more akin to Oracle's, they should be able to do this as well (and maybe they already do), but you will never get a count as fast as myisam.
Also a note to Sheeri; correct, you must use a where clause that is indexed; this allows index scans to be used for the counting, which is almost always a win. Postgres implements this even without visibility, since it will often be faster to grab the rows from the index and then verify visibility with the heap.
BTW, Harrison's trick is one I've implemented before (or similar), and it's a pretty good way to get around the problem. I normally just have every insert/delete mark a +1 or -1, then use sums to get count, with periodic rollup to keep things small. Not ideal, but works.
I was hoping for a PostgreSQL response.
ReplyDeleteI don't care much for the MyISAM version of a fast count(*) query. I want the Oracle version that handles complex boolean clauses with bitmap indexes.
I expect that InnoDB chooses the smallest index to scan. InnoDB can determine row visibility from the secondary index scan. The count is done in the SQL layer above InnoDB (or any other storage engine). Guessing again, I think that empty rows (no columns) are returned by the storage engine for this query. But someone who works on the layer above the storage engines for MySQL should correct or confirm that.
Why does "Explain select count(1) from tableX" return the rowCount immediately, but running the actual query takes forever and a day?
ReplyDeleteExplain uses an estimate of the number of rows in the table.
ReplyDeleteIf you need to count number of rows but up to a certain limit (to improve performance) you can use the limited count solution. It worked fine for me. I needed it for showing list with a roller of max 10 pages.
ReplyDeleteRead more at http://www.mysqldiary.com/limited-select-count/
Instead of:
ReplyDeleteSELECT count(*) FROM table_X
Could a new indexed table field of lets say bit(1) be a solution?
SELECT count(bit) FROM table_X
or
SELECT count(*) FROM table_X WHERE bit = 1
Yes I know this will use up more memory, but I would rather have that then using some triggers updating fields in another table.