Tuesday, November 23, 2010

How are index-only scans implemented in InnoDB?

There have been interesting discussions in the PostgreSQL community about adding support for index only scans. On several occasions people were curious about how InnoDB supports this. A recent post by the InnoDB team is an excellent overview. A brief summary of that post and other material is:

  • records in the clustered (primary) index store hidden columns (DB_TRX_ID, DB_ROLL_PTR)
  • records in the non-clustered (secondary) index do not store hidden columns
  • records in clustered and non-clustered indexes have a delete-mark flag
  • records are not updated in the secondary index, they are delete-marked on delete, inserted on insert and delete-marked/inserted on update
  • delete-marked records are removed from indexes by the purge thread when it is safe to do so
When a secondary index page is read, if the max transaction ID on the page is less than the max transaction ID for which all transactions are visible to the reading transaction (low-water mark, up_limit_id), then the page can be used as is and the page read is index-only. If this condition is not true, then for any entry read from this page the record is read from the clustered index page to determine whether the index entry is visibile. In that case the secondary index read is not index only. Index only matters because when things are not index only there can be an additional random disk read to the clustered index for each entry read from the secondary index.

The max transaction ID for which all transactions are visible to the reading transaction is described as the low-water mark and assigned to the up_limit_field in the read view (read_view_struct). This is the max transaction ID for which there are no unresolved transactions when the reading transaction starts. If there is a long-open transaction when the reading transaction starts, then up_limit_id will be less than the transaction ID of the long-open transaction.

I began to read the code for this today as I want to add a counter for the number of secondary index page reads that are and are not index only. If you want to read the code too the function lock_sec_rec_cons_read_sees determines whether all entries on a secondary index page are definitely visible to a transaction (read view).

If you are interested in this topic, I recommend these books:

5 comments:

  1. "When a secondary index page is read, if the max transaction ID on the page is less than the max transaction ID visible to the reading transaction, then the page can be used as is and the page read is index-only."

    Should this say "...is less than the min transaction ID not visible to the reading transaction"? If not, can you explain further?

    ReplyDelete
  2. I think my explanation is confusing. I will update the post. See the definition of "up_limit_id" below.

    ---

    The code is:
    max_trx_id = page_get_max_trx_id(page_align(rec));
    return(ut_dulint_cmp(max_trx_id, view->up_limit_id) < 0);

    ut_dulint_cmp(a,b) returns -1 if a < b, 0 if equal, +1 if a > b

    The function comment is -- returns TRUE if certainly sees, or FALSE if an earlier version of the clustered index record might be needed

    And from read_view_struct:
    trx_id_t low_limit_id;
    /*!< The read should not see any transaction
    with trx id >= this value. In other words,
    this is the "high water mark". */

    trx_id_t up_limit_id;
    /*!< The read should see all trx ids which
    are strictly smaller (<) than this value.
    In other words,
    this is the "low water mark". */

    ReplyDelete
  3. Nice write up :)

    "records are not updated in the secondary index, they are delete-marked on delete, inserted on insert and delete-marked/inserted on update"

    I was not aware of this fact, but it's surely expected due to its structure. A secondary index on InnoDB is just like a composite index which ends with PRIMARY KEY value. So, any updates/deletes on PRIMARY KEY means replacement of the entry on the secondary index; the key value has changed. In that case, the record on clustered index must be sorted as well. OTOH, any updates on non-PK part within a clustered index doesn't cause any updates on secondary indexes.

    Cheers!

    ReplyDelete
  4. Heikki made some interesting and good design decisions here that we are just starting to figure out. InnoDB saves space by not having hidden columns per entry in the non-clustered index. I suspect that this doesn't generate many disk reads for the row from the clustered index but I have added counters to my patch for mysql to get more data.

    ReplyDelete
  5. I guess this is also high affected by transaction isolation level.
    MYSQL default REPEATABLE READ is too strict IMO. It should follow Oracle use READ-COMMITTED as default and let people raise to REPEATABLE READ if application logic require it.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.