I can look at the lists of open bugs. Open bugs are not a problem for 5.0.37 as we know what they are and have figured out how to avoid them. Known and unknown bugs may be a problem for 5.0.67 (or for 5.1), so we need a framework to evaluate them. Lists of open bugs help me determine what features I should not use. Deciding what I can use requires more work.
It would help to see aggregated metrics on bug counts and support tickets per version and per feature. Giuseppe provided some insight into the 5.1 release but I realize this data might not be published.
I get some email from MySQL/Sun telling me to upgrade to 5.1 now and use events, partitioning and RBR. I am not sure if I can use that as the basis. As long as the email doesn't become a deluge, this is not a problem. It would be bad if I had to setup spam filters for things related to MySQL.
Otherwise, a lot of testing and monitoring is done to determine whether an upgrade can be done. This includes:
- crash rate monitoring
- testing queries for correctness and performance
- performance testing with benchmarks
- additional QA testing
- data drift testing
- long-term monitoring
Testing should be done to determine that query performance does not degrade during an upgrade and that query results are correct. During an upgrade from MySQL 4.0 to 5.0, many queries had parse errors because the parser stopped parsing ambiguous join conditions. Also, numeric literals changed from type Double to Decimal and that changed the results for some queries. We use a tool named the Query Tester to test for these problems. It samples queries from production using SHOW PROCESSLIST and then runs the query on two servers (one running the old version of MySQL and one running the new version). The results are compared for correctness and performance. This tool has made it much easier to evaluate upgrades and has found many (but unfortunately not all) problems.
I also run performance benchmarks to look for obvious performance problems. One of the tests is the time required to reload a database using the output from mysqldump. This found a major performance regression in the parser for MySQL 5.0.37 that has since been fixed. I also run sysbench and a few custom benchmarks. Eventually I will run dbt-2.
We depend on the regression tests provided by MySQL. It is great that 5.1 has many more of these. We have also written many stress tests using a separate framework. Much of our focus has been on replication. Some stress tests can't be done for replication because slave replication state is not crash proof. But it is gettting closer. We have a few tests that crash the slave at random places to confirm that it does the right thing during recovery. Many of the tests are run with valgrind. This has found several problems in the Google patch and a few minor problems elsewhere.
Eventually, the new binary is used on a few slave servers to confirm that it is stable. At that point queries are done to confirm that the data on the slave matches the data on the master. During the upgrade from MySQL 4.0 to 5.0 this was not easy to guarantee. Most of the problems were from the change for numeric literals from type Double to Decimal. A few WHERE clauses had to be rewritten to compensate for this and it took a lot of work to identify the statements in the replication log that caused the problem. This is one reason to use statement based replication or change row-based replication to include the original SQL statement. I am not sure how we could have identified the problem using only the output from RBR.
Once the new binary is used in production, long-term monitoring of the server can begin to confirm that it doesn't misbehave. You need to include as many variables as possible so that you can debug problems after the fact. It is good to know that there are no memory leaks and that query response time doesn't degrade significantly. A likely problem is that query performance will change significantly for a few queries. Depending on the complexity of your applications, you might not be able to spot this before using the new binary in production. If you have SHOW USER_STATISTICS you can monitor performance per account and determine which accounts have this problem. If you do query profiling by sampling queries from SHOW PROCESSLIST or use the Query Analyzer, then you may be able to find the problem queries.
Another form of monitoring that should be done is searching the database error log for new error messages. I am not aware of good tools for this. A combination of ssh, Python and Perl can solve it.
One thing that is missing is error monitoring. MySQL does not count and export errors via SHOW STATUS. I want to know the number of errors that have occurred per account and globally. I think this can be done in a future SHOW USER_STATISTICS patch. Not all errors indicate a problem in MySQL, but I still want to monitor this. There may be applications that have a high rate of lock wait timeouts or deadlocks.