LevelDB is here. It might make an interesting storage engine for MySQL as it has many performance benefits at the cost of a few limitations: no multi-statement transactions and MyISAM-like concurrency. I doubt we will ever get a production quality LevelDB storage engine for MySQL because the storage engine API is hard, so hard that such projects require funding. This is unfortunate.

LevelDB might be a great fit for MongoDB. MongoDB doesn't need multi-statement transactions. Both are limited by 1-writer or N-reader concurrency, but writes to database files are much faster with LevelDB because it doesn't do update in place. So LevelDB doesn't lose as much performance for IO-bound workloads by doing 1-writer or N-readers and my guess is that this could make MongoDB much better at supporting IO-bound workloads.
14

View comments

  1. There're few interesting design choices to make.

    Though it is easy to have shared block cache, per-table write buffers is a different topic, so either tables have to coexist in single read-write lock environment, or there're multiple transaction environments per table each having their own logs/compaction threads/buffers/etc.

    One could pick something in the middle and have per-database.

    Of course, global environment could work too for various workloads, and would have most transactional consistency.

    ReplyDelete
  2. How does this compare to Aria? It seems (after a quick glance) fairly similar.

    ReplyDelete
  3. Aria is B-Tree, LevelDB is LSM-Tree, very different!

    ReplyDelete
  4. MySQL storage engine API isn't hard. It's just arcane. It should be possible to wrap this quite quickly.

    ReplyDelete
  5. Domas: And that's the only difference?

    ReplyDelete
  6. Antony - you might be able to quickly produce a proof-of-concept that kind of works. That doesn't help much. If it were easy we would have more production quality storage engines, we don't. PBXT had great potential and a talented developer. Who runs that in production? No engines have emerged from the community. A few have emerged from funded startups but that process has not been easy.

    I have read handler.h, I have maintained a read-only engine that was widely used in production. The API is difficult, messy, poorly documented or not documented, .... The protocols are not documented for operations that occur across multiple calls. I have read a few times on internals@ that people should consult MyISAM source to figure out how things are supposed to work.

    ReplyDelete
  7. LevelDB has many concurrency issues which actually place it in the same league with Redis. Works nice in a single thread. If someone wants to get maximum from available hardware one needs to shard DB and run several processes/threads in parallel. Each with its own DB shard. Its not easy task.

    Another leveldb feature: vast amount of storage files. From I/O performance point of view, the lesser files - better performance. Unfortunately, there is a trade off between number of SST files and query latency variation: the larger single storage file is - the more time will require to compact it, hence leveldb availability will be limited during this time due to quite primitive internal concurrency control.

    This is probably a good fit for single user storage but currently we do not have any evidence that it can scale nicely beyond couple dozens of GB of data.

    ReplyDelete
  8. Another note:

    To my best knowledge, although LevelDB design was inspired by BigTable tablet design - its totally different product and is not used in the current BigTable.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. The results here provide more details on the behavior that you describe:
    https://github.com/m1ch1/mapkeeper/wiki/MySQL-vs.-LevelDB

    ReplyDelete
  11. I would like to see a LevelDB plugin as well but when I think about trying to use it in production my main concern is trying to allocate memory between InnoDB and LevelDB. But however I decide to split it will be wrong more often than not for dynamic workloads. MySQL needs a shared memory pool for storage engines. Until then, I would rather see an LSM based index added to InnoDB.

    ReplyDelete
  12. I have tried TokuDB and it works great!! But I will check LevelDB to put some comment here. In the meantime would you have anything to say about TokuDB?

    ReplyDelete
  13. The development progress of LeveDB storage engine can be tracked here: https://code.launchpad.net/~maria-captains/maria/mysql-5.6-leveldb and https://mariadb.atlassian.net/browse/MDEV-3841

    How do you think about it's progress?
    many won't fix issues.

    ReplyDelete
Many startups depend on MySQL. Check out the list of keynote speakers from past and present MySQL User Conferences. There are things to make better and new features to add like better support for dynamic schemas (documents) and auto-sharding.
My kids watched the new Lego movie today and spent the rest of the day repeating "Everything is amazing". I spent a few hours reading MongoDB documentation to help a friend who uses it. Everything wasn't awesome for all of us. I try to be a pessimist when reading database documentation.
Pardon the rant but the winter storm has kept me in a hotel away from home for a few days. I exchanged email this week with someone pitching a solution to a problem I don't have (MySQL failover). But by "I" I really mean the awesome operations team with whom I share an office.
What is a modern database? We have some terms that wander between marketing and technical descriptions - NewSQL, NoSQL. We have much needed work on write-optimized database algorithms - Tokutek, LevelDB, RocksDB, HBase, Cassandra. We also get reports of amazing performance.
The UC schedule has been published and there are several talks from the database teams at Facebook.

Small Data and MySQL by Domas Mituzas- small data is another name for OLTP. Given the popularity of big data we think that small data also deserves attention.
Big downtime gets a lot of attention in the MySQL world. There will be some downtime when you replace a failed master. With GTID in MariaDB and MySQL that time will soon be much smaller. There might be lost transactions if you use asynchronous replication.
I looked at the release notes for 5.6.14 and then my bzr tree that has been upgraded to 5.6.14. I was able to find changes in bzr based on bug numbers. However for the 5 changes I checked I did not see any regression tests.
Google search results for mariadb trademark are interesting. I forgot so much had been written about this in the past. Did the trademark policy ever get resolved? This discussion started in 2010.
There aren't many new files under mysql-test for 5.6.14. Is this compression or something else? Many bugs were fixed per the release notes.
I have been wondering what the Foundation has been up to. I had high hopes for it and even contributed money but it has been very quiet. Fortunately I learned that it has been busy making decisions, maybe not in public, but somewhere.
These are results for sysbench with a cached database and concurrent workload. All data is in the InnoDB buffer pool and I used three workloads (select only, handler only, update only) as described here.
I used linkbench to compare MySQL/InnoDB 5.1, 5.6 and 5.7. After a few improvements to linkbench and to the InnoDB my.cnf variables I was able to get much better QPS than before (about 1.5X better). I was ready to try 5.7 because it reduces contention from the per-index latch.
Several members of the small data team at FB will be at MySQL Connect this weekend. It would be interesting to learn that someone else has used Linkbench. I use it in addition to sysbench.
My co-workers will speak about big and small data at XLDB. Jeremy Cole and the Tokutek founders are also speaking.
The Facebook MySQL teams are presenting at MySQL Connect:

Lots and lots of small data by Harrison Fisk  MySQL 5.6 at Facebook by Yoshinori Matsunobu Panel session on MySQL 5.6 with Mark Callaghan

A lot of work has been done by the teams this year and even more is planned for the future.
I used a simple workload with sysbench to determine the rate at which InnoDB can read blocks from disk. The workload is read-only and each query fetches 1 row by PK. The workload was IO-bound with a 2G InnoDB buffer pool and 32G database.
For some workloads the InnoDB performance model is very simple. It must flush dirty pages as fast as storage will allow. With a few more bug fixes InnoDB in MySQL 5.6 should be able to do that.
Percona published Linkbench results. It looks like they set innodb_io_capacity to a large value but did not increase innodb_lru_scan_depth. I think that is a mistake. This is a mistake that I have made and that I hope you can avoid.
MySQL 5.6.11 is here with many useful bug fixes. Not so good news - you won't be able to read about those bugs beyond the brief text in the release notes as many of the bug reports are behind the support paywall.
I spent more time looking at the performance regressions in MySQL 5.6 for single-threaded read-only workloads and I can get 5.6 closer to 5.1 by disabling a few options when building MySQL 5.6 -- optimizer tracing & profiling.
This post describes MySQL 5.6 performance for a workload that is update-only and single-threaded. My previous post was for a read-only workload. I think there is a performance regression in MySQL 5.6 for a workload dominated by fast queries even when the performance schema is disabled.
LinkBench is a benchmark that models the social graph OLTP workload at a busy social networking site. It is implemented for MySQL but there is abstraction layer so it should be possible to implement drivers for other products like HBase, Cassandra, MongoDB and Postgres.
Peter Zaitsev wrote about the importance of single-threaded performance and expressed concern that there might be regressions in MySQL 5.6. Not much has been published on it so I will repeat tests that I ran for high-concurrency workloads using sysbench with IO-bound and cached workloads.
I read a post about the need to defragment MongoDB databases. I liked the title but prefer to avoid a DBMS that requires routine maintenance. Alas I don't know what the author means by routine. There is a big difference between having to defragment a database once a week versus once a quarter.
We hope to see you at the MySQL user conference. Once again Yoshi outdoes Domas (3 talks versus 2). Attend to learn more about their rivalry. Nizam has new results for making InnoDB compression better and he will describe his recent work at the UC.
I discussed my previous post on O_DIRECT_NO_FSYNC with the InnoDB team and they fixed my understanding of a few parts of the code that contribute to the stalls I have been reporting.  We also discussed a problem I have been ignoring.
I am speaking about MySQL at OpenWest. I have a keynote on May 2 and a Q&A session on May 3.
MySQL 5.6 has a new option for innodb_flush_method. When O_DIRECT_NO_FSYNC is used then fsync is not done after writes. This was a response to feature request 45892. Unfortunately this option as implemented and documented is not safe to use with XFS.
For this post on MySQL 5.6 performance I used sysbench for a cached & update-only workload. The previous post on an update-only but not cached workload is here. For this workload each query updates 1 row by primary key. The database size is ~32GB and the data was cached in a 64GB InnoDB buffer pool.
Online DDL for InnoDB is another great feature new in MySQL 5.6. I don't think it is ready to replace online schema change frameworks (from Facebook, Percona Toolkit and Openark) for workloads that require minimal downtime for busy tables - but work can be done on InnoDB to change my opinion.
The third performance test I am doing compares MySQL 5.6 (5.6.10) with MySQL 5.1 for an update-only workload with an IO-bound database. The configuration is similar to what I used for the IO-bound & read-only tests.
A lot of work was done to make InnoDB faster for MySQL 5.6. The results are especially obvious for IO-bound read-only workloads. I have many performance tests to run and started with the easy ones.
I just listened to a talk at the New England Database Summit and learned about a new DBMS that supports extremely high QPS for data in main memory sharded over several nodes with synchronous commit across multiple copies to support HA.
One of my peers described managing backups at scale. It didn't make it to planetmysql because of the way it was posted. It is worth reading.
But all things are not equal. A solution that avoids failover because synchronous replication is done between multiple masters can be awesome, but it also behaves differently and the awesomeness comes at a cost. For some users the benefit far outweighs the cost and Galera is amazing.
Regression tests for some bugs are not published in new MySQL releases. This was reported by the MariaDB project and Ronald. There are also delays in updating bzr after recent releases, but there have been delays in the past.
Someone who knows a lot about storage asked me whether we can spend IOPS to improve compression.

When deploying a pure-flash server I want a database engine that optimizes for compression as more compression means less flash must be purchased.
Do you have replication slaves that ignore --relay-log-space-limit? This bug has been open for a while on the bug system that isn't http://bugs.mysql.com. I was forced to fix it myself to avert a mutiny. I am amused that the variable named ignore_log_space_limit is part of the problem.
Loading