Thursday, April 30, 2009

Vendor lock in and MySQL documentation

Part of the sales pitch for MySQL is that there is less risk of vendor lock in. This is repeated frequently on their marketing here, here, here and here. The explanation is that the source code for MySQL is available with a GPL license and if you are unhappy with MySQL the company you can continue using MySQL the product and get support elsewhere.

Documentatation does not have a similar license. You can decide whether this creates the risk of vendor lock in. Details are here.
  • We cannot edit it.
  • We have limited rights to publish it.
Isn't it in the best interests of Sun/MySQL to address this issue and reassure potential customers?

Arjen, Sheeri, Baron and the lead for the MySQL docs team have also written about this.

Tuesday, April 28, 2009

InnoDB on IO bound workloads

I ran the iibench test using a server with 2 CPU cores, 2 disks in SW RAID 0 and 1 MB stripe, 2G RAM and XFS. If you just want a summary, it is that software changes can make InnoDB run much faster on the same hardware. There is a lot of opportunity -- but certainly not enough to catch TokuDB.

The binaries tested are:
There were two tests:
  1. Time to insert 50M rows into an empty table
  2. Time to insert several million rows into a table with 50M rows
I disabled the innodb doublewrite buffer for all tests as I want to compare the results to a server that doesn't use that level of safety.

The my.cnf parameters for 5.0.77 are:
innodb_buffer_pool_size=1G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_max_dirty_pages_pct=20
innodb_doublewrite=0
 The my.cnf parameters for the v3 Google patch are:
innodb_buffer_pool_size=1G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_io_capacity=250
innodb_read_io_threads=2
innodb_write_io_threads=2
innodb_max_dirty_pages_pct=20
innodb_ibuf_max_pct_of_buffer=10
innodb_ibuf_reads_sync=1
innodb_doublewrite=0
The my.cnf parameters for XtraDB are:
innodb_log_file_size=1900M
innodb_buffer_pool_size=1G
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_io_capacity=250
innodb_use_sys_malloc=0
innodb_read_io_threads=2
innodb_write_io_threads=2
innodb_max_dirty_pages_pct=20
innodb_ibuf_max_size=100M
innodb_ibuf_active_contract=1
innodb_ibuf_accel_rate=200
innodb_doublewrite=0
All performance results are anonymous for binaries X, Y and Z. Maybe I can monetize my performance testing effort by doing this. I probably need a new disk soon.

The first result is the time to insert 50m rows into an empty table measured in seconds. The difference is not that signficant. However, the results can be misleading. 5.0.77 has much more pending work at test end (dirty pages and insert buffer entries). That also made 5.0.77 much slower near the end of the test, but I will save the graphs for the next set of results. The test is run using the run_ib found in the link for iibench at the start of this page. The command line is:
bash run.sh 1 $path no root pw test no innodb 50000000 innodb yes yes $binary
And the results are:
  • 24478 seconds -- binary X
  • 21016 seconds -- binary Y
  • 37146 seconds -- binary Z
The second result is the time to insert 3,380,000 rows into a table that starts with 50m rows on a cold server (no entries in the insert buffer, no dirty pages, server restarted). Queries are continuously run by 4 threads concurrent with the inserts. The test is run using run_ib from the iibench link at the top of this page. The command line is:
bash run.sh 1 $path no root pw test no innodb 10000000 innodb no no $binary
And the results are:
  • 38673 seconds -- binary X
  • 11143 seconds -- binary Y
  • 21018 seconds -- binary Z
Finally, the graph for row insert rate over time. Note that the graphs for binaries Y and Z don't extend to the right because they inserted the 3.3M rows much faster.

Monday, April 27, 2009

Hack on Drizzle, get paid!

Did you know that Rackspace has a cloud offering? I didn't. The name is Mosso.

Someone from Rackspace/Mosso on the drizzle-discuss mailing list offered to hire a person full time to work on Drizzle. Curious? Find the post on the mailing list.

Drizzle has a lot of potential for making it easier to run a DBMS server on the cloud. There are a few things that need to be done differently from traditional MySQL replication. Drizzle has started over and has removed the code inherited from MySQL. Their focus is a clean API (right Jay). There should be a lot of interesting work that can be done.

Why must you insult us Matt?

Surely there must be a better way to get your point across. Well, I assume this is an insult because you don't appear to be too hippie-ish. But if you really are the expert in open-source business that you claim to be, you probably can do better than to describe part of the MySQL community as an open-source hippie commune that displays hippie-esque tendencies, unless they self-identify as that. Although that will be a fair description once we start holding the user conferences at Burning Man and Country Fair.

I haven't linked the blog in question because I don't want to promote it.

Sunday, April 26, 2009

Slides for MySQL User Conference 2009 Talks

Slides for my talks. Code described here is in the v3 Google patch.



Saturday, April 25, 2009

What is next?

In his keynote Baron reminded us that we need to focus on what we can do to improve community MySQL rather than wait for things to get done by the corporate owners. What will you do?

Many of us will continue to add high-end features to MySQL. It will great if those features make it into an official MySQL release. It will be a great business opportunity for the community if they do not.

In the short term, I have some things to do:
  • run IO bound tests (iibench) for PBXT and provide the results to the PBXT team with a comparison to the v3 Google patch
  • run IO bound tests (iibench) for XtraDB and provide the results to Percona with a comparison to the v3 Google patch
  • publish more documentation for features in the v3 Google patch (support for roles, more changes to improve IO performance, more details on row-change logging)
  • read the docs and evaluate embedded InnoDB 
Other people on my team also plan to share more details and code:
  • Ben is working on a backport of the pool-of-threads code to MySQL 5.0. While the backport itself to 5.0.37 might not help those using 5.1 or recent 5.0 versions, we have also fixed the SMP performance problems in the pool-of-threads code and that change is isolated to a single file. Others will be able to use it (but only on Linux as it uses epoll directly).
  • Justin may publish a patch for a recent version of 5.0 that only includes the changes for global group IDs, binlog event checksums and crash safe replication that works for all storage engines.
Much of the big patch is not easy to consume. But there is an answer to that. If you really want the feature then you can hire someone to port it to a recent 5.0 or 5.1 release. Rumor has it that Percona has done just that with several features. This makes me happy and proud.

InnoDB has continued to add valuable features to their releases. The most recent is embedded InnoDB. It can be used to do some very interesting things. But first I must read the docs. They have also added a lot of new functionality to the 5.1 branch via the InnoDB plugin. This includes fast index creation, compression and SMP performance improvements. This is a big deal as the standard response and practice from MySQL is that new features cannot go into a production branch.

Friday, April 24, 2009

New storage engines for MySQL -- rocket science or great engineering?

There were several new storage engine vendors at the MySQL Conference. I spoke with people from Virident, Tokutek and Schooner at length about their technology. Their products are impressive and I look forward to more details on performance from them including read-intensive and write-intensive workloads. Tokutek has already published performance results on an insert intensive workload and then worked with me to improve the InnoDB results and improve the test code so others can run it. The code and results are here.

One test I want all of them to run is to run a write-intensive workload so that InnoDB accumulates many dirty pages in the buffer pool and many entries in the insert buffer, kill mysqld and then determine how long it takes the server to perform crash recovery. This should be compared between InnoDB on commodity hardware, InnoDB on Virident and Schooner hardware and TokuDB on commodity hardware. I suspect that the results will be impressive for the new storage engines.

I use the term rocket science because a lot of vendors will have you believe that they have something special. In this case, I believe that each of the vendors really do have something special. But of course, more results will help us understand what they can do. Each of them have also chosen a path that doesn't require a huge investment on their part to build a product as they have limited their software and hardware investments to areas where they have a lot of value to add and that makes it more likely that they can deliver on their promises.
  • TokuDB is software-only. But this is really clever software. They have implemented a new algorithm that significantly reduces random IO for write-intensive workloads. There have been algorithms that do this. For example, Log-Structured Merge Trees. I have even published a paper on this at VLDB. But TokuDB may be much better than previously known approaches.
  • most of the hardware expense for Virident is isolated to one component that implements industry standard interfaces and can plug into commodity servers. Their software investment is focused on improving pieces of MySQL/InnoDB to leverage their hardware. They have been able to improve on the work of others in the InnoDB developer community.
  • Schooner uses mostly commodity hardware with value-added in the integration of that hardware. Their software investment in MySQL also appears to be focused on improving pieces of it rather than replacing it and they have been able to improve on the work of others.
TokuDB allows a small server to handle a much larger workload. This has many benefits including reduced power consumption and less need to shard or add shards to a large MySQL deployment. I think they also use much less disk space than InnoDB. Their technical staff explained the math behind the algorithms that justify their performance. Math is hard so this took some time, but eventually I kind of understood and I believe in their results. Their approach will enable many more optimizations in the future.

The servers from Virident and Schooner are optimized for InnoDB, so it should be easy for existing users to try them out. I expect ridculously high throughput results from both of them. Schooner hardware is easier to understand as they provide much better IO performance (and many other benefits). They also appear to have designed a balanced system so that peak and actual performance won't be too far away. Oracle has done this with the Exadata machine and it is very nice to see a similar effort from Schooner.

Virident uses NOR Flash to provide fast access with byte-level accessing (as opposed to reading a disk page at a time). This takes more time to understand. It is almost as if they have reduced the InnoDB page size to the size of a row, so much less data is transferred when reading rows randomly. MySQL loves to access rows randomly and reading less data means less effort is wasted and there should be less contention on shared resources with many-core and multi-core servers.

Thursday, April 23, 2009

Cool things you can almost do with replication

We added support for row-change logging to MySQL 5.0. The logged data is similar to row-based replication with changes to the output that make it much easier to parse. Gene Pang describes this work at 2pm at the conference.

What might be done with this data?
  • replicate row changes to a data store that is not MySQL (Teradata, HBase/Hypertable, memcached)
  • materialized view maintenance
  • change notification
And I talk at the Percona Performance Conference at 10:50am today on the InnoDB IO architecture.

Wednesday, April 22, 2009

Where is the Calpont code?

Calpont has a talk on their MPP column-store storage engine for MySQL at 2PM today. The talk title is  Open Source Columnar Storage Engine. It sounds interesting, especially if the source will be available as many people can try it out. But the source isn't available today. Where is the source?

Note, Calpont doesn't mind that I am asking about this in public.

Other questions I have include:
  1. Does it implement the condition pushdown interface?
  2. Will it implement the batch key access interface?

Really cool features in the Google patch

Justin and Ben talk today at 4:25pm on features for SMP performance and high availability.

Ben is an expert on InnoDB internals related to SMP performance. He designed and implemented the faster rw-mutex changes that are now in the 1.0.3 InnoDB plugin and MySQL 5.4 and have made MySQL much faster on SMP servers. More recently he changed InnoDB to significantly reduce mutex contention on the transaction log and buffer pool mutexes. This makes InnoDB 20% faster on sysbench and other read-write workloads. Right now he is finishing the backport of the pool-of-threads code from MySQL 6 to 5.0 and making it scale on SMP.

Justin is a replication expert. He added support for global transaction IDs to automate slave failover, made replication slaves crash-safe, added checksums for binlog events and fixed many bugs in replication.

They both have done very interesting work. And we don't just build these features, we also run them in production soon after adding them.

InnoDB on high IOPs servers (SSD)

How does InnoDB do on high IOPs servers? Thanks to SSD, many of us will soon have such servers. I will provide more details in my talks:
The summary is that InnoDB is very effective at using the read capacity of a high IOPs server. It has problems at using enough of the write capacity. Alas, this is InnoDB and the problem is easy to fix. Many of the fixes are in the v3 Google patch. Many are also in the Percona patches and builds. Although there may be one fix in the v3 Google patch that Percona has yet to implement.

Note that I may mention the P word a few times in my talk today and tomorrow.

Sunday, April 19, 2009

Hack MySQL at MySQL Camp on Monday

I host the MySQL Hackfest at MySQL Camp on Monday. Bring your laptop setup to build MySQL. I will have a few CDs with MySQL 5.0.37, the v3 Google patch for MySQL 5.0.37 and a few other releases of MySQL 5. Possible projects that can be started (but maybe not finished) include:
  • Parse the output from the row-change log and convert it to protocol buffers.
  • Apply the output from the row-change log to another data store (RDBMS, HBase, Hypertable)
  • Add a new SHOW command
  • Add a new SQL function

Thursday, April 16, 2009

Everything I know about running MySQL on Amazon EC2...

... I learned from excellent documentation written by Eric Hammond. For someone just starting out with EC2 (me), the documentation is great. It has made it very easy for me to try things out and repeat many of the performance tests I run for the v2 Google patch.

I needed to setup RAID on the local storage -- not a problem.

I needed to setup RAID over several EBS volumes -- not a problem. Just be sure to follow the instructions and use XFS rather than ext2 or ext3 (which don't allow concurrent writes to a file).

Clouds like EC2, whether internal or external, represent an interesting environment in which to deploy MySQL. Many of the requirements for a stable MySQL environment in the past might not be true anymore. For example, it isn't so easy to guarantee that the server is always shutdown cleanly (yes, MyISAM and replication state files, I am looking at you). But once a few of the problems are solved, then it becomes much easier for a few people to manage a large number of servers.

Tuesday, April 14, 2009

Talks at the (free) MySQL Camp

David Lutz will talk about Predicting Performance with Queueing Models at the (free) MySQL Camp on Thursday at 2PM. I will be there. Good performance testing includes tests and an explanation of the results. The test results are frequently much better when the explanation includes a performance model. I usually skip the model (math is hard). My published results would be more useful were I to include the models. I have a few books by Neil Gunther gathering dust at home that I really need to read.

The v3 Google patch now includes support for row-change logging. The code has a few bugs that we are fixing ASAP but it is ready for testing. There is a talk on the this at the regular conference. We can try it out at the MySQL Hackfest at the (free) MySQL Camp on Monday morning. Row-change logging generates one text line per changed row that describes the change. It is easy to parse and can be used to:
  • Maintain a copy of the MySQL table in another RDBMS or in a scalable data structure such as HBase or Hypertable.
  • Implement a change notification service.
  • Maintain materialized views.

Monday, April 13, 2009

Battle of the hot boxes

Two companies are in the process of launching MySQL appliances: Schooner Infotech and Virident. Both incorporate flash in some form and it will be interesting to find out what value they add. I also want to know how they overcome some of the performance limits in InnoDB for multi-core and high-IOPs servers. The problems on multi-core servers are well known. The problems on high-IOPs servers are slowly becoming understood and there are fixes in the v3 Google patch and in Percona binaries.

Do the appliances use official versions of MySQL? The schooner web site states that InnoDB 1.0.3 is used, which goes a long way towards improving performance on servers with 8+ cores.

I will describe some of the IO problems and improvements during my talks at the Percona Performance Conference and the MySQL Conference.

Friday, April 10, 2009

MySQL has a new storage engine for DB2?

MySQL has a new storage engine for DB2 on IBM System i. System i used to be known as iSeries which used to be known as AS/400.  With a name change rate of 1 per decade, it has at least 3 more decades as a viable system (2020 -- renamed to the i, 2030 -- renamed to i, 2040 -- renamed to '').

I would describe this as a hybrid engine as it provides an interface to an existing DBMS and is tightly integrated with the existing DBMS. By comparison, the Federated storage engine also provides an interface to an existing DBMS, but is not tightly integrated. The integration should greatly increase usability and performance. Performance can be further improved by adding support for condition pushdown for MySQL 5.1 and by implementing the batch key access interface when the engine is available in MySQL 6.

A few more comments:
  • My favorite line from the wikipedia page is on the System i everything is an object. Profound.
  • Where is the storage engine independent test suite that makes it easy to test new engine? The current suite is hardwired to use MyISAM and InnoDB for most tests.
  • Where is the virtual machine image that will allow us to try out System i with MySQL and the new engine using our Mac/Win/Linux boxes?

Thursday, April 9, 2009

An obscure bug

We have begun running a tool that determines whether tables on slaves and masters match. This tool is similar to mk-table-checksum with a few optimizations from the Google patch:
We soon found an InnoDB table with rows in the secondary index that were missing from the PRIMARY index. How can this be? My guess is that one or more disk writes to the PRIMARY index were lost. InnoDB stores an LSN and checksum on each page. That page for which writes were lost had a valid old version on disk. When that page is read, the checksum is still valid and there is no way to determine that the LSN is correct unless the last write for that page is recorded in the current transaction log or the doublewrite buffer.

ZFS would detect this error as it stores a page checksum separate from the disk page. RAID 10 might help. If writes for one of the copies was lost, then as long as your are lucky enough to read the page from the good side of the mirror, you will get the good data. But that might not be comforting. What else would help to detect and or correct this?

Monday, April 6, 2009

Master-master replication and crash recovery

The slave SQL thread executes binlog events from the relay log to keep a slave in sync with the master. Prior to row-based replication, binlog events were SQL statements. The slave SQL thread records its state in the relay-log.info file. The state includes the file offset of the next binlog event to execute, so it is important that this state be correct to avoid skipping a transaction or running a transaction multiple times on the slave. The slave SQL thread does the following in a loop:
  1. replay all binlog events from a transaction
  2. commit the transaction to the storage engines that participated
  3. write new state to the relay-log.info file
Unfortunately, if the mysqld server crashes after step 2 and before step 3 then it will run the last transaction twice (before the crash, after the crash). This may fail with a duplicate key error or it may not and leave the slave inconsistent with the master with little evidence left behind for the DBA to notice the problem.

This is bug 26540 if you want to express interest in a fix. This is also fixed in some cases for InnoDB by transactional replication which has made it into Percona. By some cases, I mean that it does not protect transactions that update MyISAM tables, at least not in the Google patch. I have not reviewed the Percona code.

This problem gets more interesting for master-master replication. In that case a server writes a binlog and a relay log and the update sequence is:
  1. replay all binlog events from a transaction
  2. XA prepare for the binlog
  3. XA prepare for InnoDB (assuming InnoDB is used)
  4. write the XID to the binlog (commit)
  5. commit the transaction to the storage engines that participated
  6. write new state to the relay-log.info file
In this configuration, the server uses internal XA to coordinate the update of the binlog and commit to the storage engines.  There are three interesting crashes that can occur:
  • before step 4 - This is not a problem. The prepared InnoDB transaction is rolled back during crash recovery and then run when the slave SQL thread starts.
  • between step 4 and step 5 - This is a problem. The prepared InnoDB transaction is committed during crash recovery but relay-log.info is not updated. Note that transactional replication does not correct the mismatch so the last transaction will be run again when the slave SQL thread starts. Running the same transaction multiple times may cause replication to halt or may corrupt your database.
  • between step 5 and step 6 - This problem is fixed by transactional replication.
So there is a new problem that we need to fix for servers that will soon run with --log-slave-updates (not because I will use master-master replication).  Some of the changes described in my previous post can fix this new problem, at least they do in our stress test framework.

Friday, April 3, 2009

Making replication more robust

I have been fortunate to work with very productive people. Two of them, Wei and Justin, are experts in MySQL replication internals. Wei implemented mirror binlogs, transactional replication and semi-sync replication. Our current expert, Justin, has made replication much more robust and implemented binlog event checksums and global transaction IDs. Global transaction IDs make hierarchical replication much easier to manage.

Justin has a talk at the MySQL Conference. He also has a new idea that may allow us remove the InnoDB specific code we added for transactional replication while preserving the functionality. As the InnoDB specific code has been very difficult to get right, I hope we can do it. The idea is to run slaves with --log-slave-updates set, enable another feature to only log updates from the replication stream but not from users connected to the slave, and use existing code in MySQL (internal XA) to keep the binlog and storage engines in sync on the slave. The slave SQL thread state that is currently stored in a separate file (relay-log.info) can be generated on demand from the last complete transaction stored in the binlog on the slave. In practice it will be retained but get regenerated when the slave is not shutdown cleanly.

This requires extra work on the slave as it must write a binlog for changes made from the replication stream. But we already require that because of other features we recently added. And even if we didn't, the CPU and IO overhead of writing a binlog is low for our servers.

It is critical for us that slave replication state be recoverable. This helps us in production and development. We add many features to replication and must be sure that they work. One of our tests uses a continuous stream of insert statements and two servers setup as a master and slave. The slave is killed and restarted at random points in time during the test. At the end of the test we confirm that the contents of the tables on the slave and master match. We have found and fixed many replication bugs because of this test and that has made our servers much easier to manage.
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.