I have been wondering what the Foundation has been up to. I had high hopes for it and even contributed money but it has been very quiet. Fortunately I learned that it has been busy making decisions, maybe not in public, but somewhere. And at Percona London we will be told why it forked MariaDB prior to 5.6 and reimplemented a lot of features.
In other news the Percona London lineup looks great and I appreciate that Oracle is part of it.
-
Many startups depend on MySQL. Check out the list of keynote speakers from past and present MySQL User Conferences. There are things to make better and new features to add like better support for dynamic schemas (documents) and auto-sharding. But existing deployments aren't in crisis mode and it is being used for new deployments. Maybe I have an agenda in writing this but anyone writing the opposite is likely to have their own agenda.
There is a problem in MySQL land that doesn't get enough attention. When startups using MySQL get really big then they try to hire developers to work on MySQL internals (see Facebook, Google, Twitter, LinkedIn, etc). And I mean try because there is much more demand than supply in the US. There is a lot of talent but most is in other countries. While remote development has been successful in the MySQL community most of that is for traditional software development. Hacking on MySQL at a fast growing startup is far from traditional. You occasionally need to make MySQL better right now to fix serious problems in production. You won't get 6 month development cycles unless you have a large team. This requires a strong relationship with the operations team. This also requires personal experience with operations. That is harder to get while working remotely. Remote development can work. I am an example of that, but I am only 1 hour from the office by plane and I already had strong relationships with the operations team.
A great solution is for someone already working at the startup to begin hacking on MySQL. This is great because it grows the supply of MySQL internals expertise and MySQL will get better at a faster rate. My teams at Facebook and Google grew in that manner. The problem is that it can be very hard to grow the team size from 0 to 1. The first person won't have a mentor. If the first person is also new to mature DBMS code then they might be in for a surprise. In my case 9 years between Informix and Oracle increased my tolerance level but not everyone has that background.
I think we can make it easy for the first person on the team by providing training and mentorship. A bootcamp in April during/after/before the UC is one way to do that and there can be remote mentorship once per 2 weeks after that. The experts who can teach the class (like gurus from MariaDB) will be in town. Are people interested in this? I don't expect this to be free. If we expect professional training then we need to pay the professionals. But I hope that free training materials can eventually be produced from the effort.
Beyond getting paid for professional services there would be other benefits were MariaDB to lead the program. This can increase their community of users and hackers.16View comments
-
My kids watched the new Lego movie today and spent the rest of the day repeating "Everything is amazing". I spent a few hours reading MongoDB documentation to help a friend who uses it. Everything wasn't awesome for all of us. I try to be a pessimist when reading database documentation. If you spend any time near production then you spend a lot of time debugging things that fail. Being less than optimistic is a good way to predict failure.
One source of pessimism is database limits. MongoDB has a great page to describe limits. It limits index keys to less than 1025 bytes. But this is a great example that shows the value of pessimism. The documentation states that values (MongoDB documents) are not added to the index when the index key is too large. An optimist might assume that an insert or update statement fails when the index key is too large, but that is not the specified behavior.
As far as I can tell, prior to MongoDB 2.5.5 the behavior was to not add the document to the index when the indexed column exceeded 1024 bytes. The insert or update would succeed but the index maintenance would fail. Queries that used the index after this can return incorrect results.
A quick search of the interwebs shows that people were aware of the problem in 2011. I can reproduce the problem on my Ubuntu 12.10 VM. Why do we tolerate problems like this? Maybe this isn't a big deal and the real problem is that new risks (from a system I don't know much about) are worse than risks in software that I have been using for years. But corruption (either permanent or via incorrect query results) has been a stop the world bug for me -- as in you do nothing else until the problem has been fixed. Why have MongoDB users tolerated this problem for years?
While it is great that the bug appears to have been fixed, database vendors should understand that FUD takes much more time to go away. See all of the stories about transactions and MySQL that lived on long after InnoDB became viable. And note that some of the MySQL FUD was self-inflicted -- see the section of the MySQL manual on Atomic Operations.
Found this code in 2.4.9 to explain how key-too-large is handled. A careful reader might figure out that index maintenance isn't done. Nice optimization. I spoke to one user who doesn't like the old behavior but doesn't want to break apps with the new behavior that fails inserts/updates with too large keys. Indexes on a prefix of a key would help in that case.
template< class V >
void BtreeBucket<V>::twoStepInsert(DiskLoc thisLoc,
IndexInsertionContinuationImpl<V> &c,
bool dupsAllowed) const
{
if ( c.key.dataSize() > this->KeyMax ) {
problem() << "ERROR: key too large len:" << c.key.dataSize()
<< " max:" << this->KeyMax << ' '
<< c.key.dataSize() << ' '
<< c.idx.indexNamespace() << endl;
return; // op=Nothing
}
insertStepOne(thisLoc, c, dupsAllowed);
}
/** todo: meaning of return code unclear clean up */
template< class V >
int BtreeBucket<V>::bt_insert(const DiskLoc thisLoc, const DiskLoc recordLoc,
const BSONObj& _key, const Ordering &order,
bool dupsAllowed,
IndexDetails& idx, bool toplevel) const
{
guessIncreasing = _key.firstElementType() == jstOID && idx.isIdIndex();
KeyOwned key(_key);
dassert(toplevel);
if ( toplevel ) {
if ( key.dataSize() > this->KeyMax ) {
problem() << "Btree::insert: key too large to index, skipping "
<< idx.indexNamespace() << ' ' << key.dataSize()
<< ' ' << key.toString() << endl;
return 3;
}
}
10View comments
-
Pardon the rant but the winter storm has kept me in a hotel away from home for a few days. I exchanged email this week with someone pitching a solution to a problem I don't have (MySQL failover). But by "I" I really mean the awesome operations team with whom I share an office. The pitch got off to a bad start. It is probably better to compliment the supposed expertise of the person to whom you are pitching than to suggest they are no different than COBOL hackers working on Y2K problems.
Unfortunately legacy is a bad word in my world. Going off topic, so is web scale. I hope we can change this. The suggestion that MySQL was a legacy technology was conveyed to me via email, x86, Linux and a laptop. Most of those have been around long enough to be considered legacy technology. DNA and the wheel are also legacy technology. Age isn't the issue. Relevance is determined by utility and efficiency.
Remember that utility is measured from a distance. It is easy to show that one algorithm can do one narrow operation much faster than as implemented in existing products. But an algorithm shouldn't be confused with a solution. A solution requires user eduction, documentation, skilled operations, trust, client libraries, backup, monitoring and more.0Add a comment
-
What is a modern database? We have some terms that wander between marketing and technical descriptions - NewSQL, NoSQL. We have much needed work on write-optimized database algorithms - Tokutek, LevelDB, RocksDB, HBase, Cassandra. We also get reports of amazing performance. I think there is too much focus on peak performance and not enough on predictable performance and manageability.
Building a DBMS for production workloads is hard. Writing from scratch is an opportunity to do a lot better than the products that you hope to replace. It is also an opportunity to repeat many mistakes. You can avoid some of the mistakes by getting advice from someone who has a lot of experience supporting production workloads. I worked at Oracle for 8 years, wrote some good code (new sort!) and fixed a lot of bugs but never got anywhere near production.
Common mistakes include insufficient monitoring and poor manageability. Monitoring should be simple. I want to know where something is running and not running (waiting on IO, locks). I also want to drill down by user and table -- user & table aren't just there for access control. I am SQL-centric in what follows. While there are frequent complaints about optimizers making bad choices I can only imagine how much fun it will be to debug load problems when the query plan is hidden away in some external application.
The best time to think about monitoring is after spending too much time debugging a problem. At that point you have a better idea about the data that would have made things easier. One example of missing monitoring was the lack of disk IO latency metrics in MySQL. In one case not having them made it much easier to not notice that the oversubscribed NFS server was making queries slow via 50 millisecond disk reads.
Monitoring should be cheap so that it can always be enabled and from this I can understand the average costs and spot changes in load from the weekly push. But I also need to debug some problems manually so I need to monitor both sessions that I know are too slow (get the query plan for a running SQL statement) and to find sessions/statements that are too slow (dump things into the slow query when certain conditions are met). Letting me do EXPLAIN for statements in my session is useful, but I really need to do EXPLAIN from statements in production - if the optimizer uses sampling I want to see the plan they get and if temp tables are involved I have no idea what will be in their temp tables. MariaDB (and MySQL) recently added support to show the query plan for statements that are currently running. This is even more useful when the query plan and performance metrics can be dumped into the slow query log when needed.
The goal for monitoring performance problems is to eliminate the use of PMP. When I want to understand why a system is running slower than expected I frequently look at thread stacks from a running server. I hope one day that pstack + awk is not the best tool for the job on a modern database. I was debugging a problem with RocksDB while writing this. The symptom was slow performance for a read-only workload with a database that doesn't fit in cache. I have seen the problem previously and was quickly able to figure out the cause -- the block cache used too many shards. Many problems are easy to debug when you have previously experienced them. This can be restated as many problems are expensive to debug for most users because they don't have full time database performance experts.
The focus on peak performance can be at odds with manageability. The faster way to peak performance is via tuning options and too often these options are static. Restarting a database in production to change an option is a bad idea. Dynamic options are an improvement. Using adaptive algorithms in place of many options is even better. And if you add options, make sure the defaults are reasonable.
Predictable performance is part of manageability. How does your modern database behave when confronted with too much load? It helps when you can classify your workload as high-priority and best-effort and then shed load from the best-effort users. Alas this requires some way to distinguish users. In theory you want the hi-pri users to get their work done and then let best-effort users compete for the spare capacity. This requires some notion of SLA for the hi-pri users and spare capacity for the DBMS. These are hard problems and I have not used a great solution for load shedding.
This could be a much larger rant/document but I must return to my performance debugging.
6View comments
-
The UC schedule has been published and there are several talks from the database teams at Facebook.
- Small Data and MySQL by Domas Mituzas- small data is another name for OLTP. Given the popularity of big data we think that small data also deserves attention.
- Asynchronous MySQL by Chip Turner - this describes the work done by Chip's team to implement an async MySQL client API. The feature is in the FB patch, widely used at FB and is integrated with HHVM.
- Performance Monitoring at Scale by Yoshinori - this will explain how to be effective when monitoring many servers with few people. It is easy to get distracted by false alarms.
- MySQL 5.6 at Facebook by Yoshinori - Yoshi will share many stories about what it took to get 5.6 into production. This included a bit of debugging and performance testing, bug fixes from upstream and a lot of work from the MySQL teams at FB.
- Global Transaction ID by Evan, Yoshinori, Santosh - at last global transaction IDs have arrived (for people not using the Google patch). Learn what it took to get this ready for production.
- InnoDB Defragmentation by Rongrong - learn about the work by Rongrong to reduce the amount of space wasted from fragmentation.
- MySQL Pool Scanner by Shlomo - MPS is one of the key tools created by our automation experts (aka operations gurus) that make it possible to manage many machines with few people.
0Add a comment
-
Big downtime gets a lot of attention in the MySQL world. There will be some downtime when you replace a failed master. With GTID in MariaDB and MySQL that time will soon be much smaller. There might be lost transactions if you use asynchronous replication. You can also lose transactions with synchronous replication depending on how you define lose. I don't think this gets sufficient appreciation in the database community. If the higher commit latency from sync replication prevents your data service from keeping up with demand then update requests will timeout and requested changes will not be done. This is one form of small downtime. Whether or not you consider this to be a lost transaction it is definitely an example of lousy quality of service.
My future project, MarkDB, might have a mode where it never loses a transaction. This is really easy to implement. Just return an error on calls to COMMIT.
2View comments
-
I looked at the release notes for 5.6.14 and then my bzr tree that has been upgraded to 5.6.14. I was able to find changes in bzr based on bug numbers. However for the 5 changes I checked I did not see any regression tests. For the record, I checked the diffs in bzr for these bugs: 1731508, 1476798, 1731673, 1731284, 1730289.
I think this is where the MySQL Community team can step up and help the community understand this. Has something changed? Or did the tests move over here?2View comments
-
Google search results for mariadb trademark are interesting. I forgot so much had been written about this in the past. Did the trademark policy ever get resolved? This discussion started in 2010.
- http://openlife.cc/blogs/2010/november/leaving-monty-program-and-mariadb
- http://mariadb.com/kb/en/mariadb-trademark-policy
- http://blog.mariadb.org/mariadb-draft-trademark-policy-available
- http://monty-says.blogspot.com/2010/12/proposal-for-mariadb-trademark-policy.html
- http://blog.mariadb.org/mariadb-and-trademark
- http://www.skysql.com/about/legal/trademarks
2View comments
-
There aren't many new files under mysql-test for 5.6.14. Is this compression or something else? Many bugs were fixed per the release notes.
diff --recursive --brief mysql-5.6.13 mysql-5.6.14 | grep "Only in"
Only in mysql-5.6.14/man: ndb_blob_tool.1
Only in mysql-5.6.14/mysql-test/include: have_valgrind.inc
Only in mysql-5.6.14/support-files: mysql.5.6.14.spec
Only in mysql-5.6.14/unittest/gunit: log_throttle-t.cc
Only in mysql-5.6.14/unittest/gunit: strtoll-t.cc
Only in mysql-5.6.13/packaging/rpm-uln: mysql-5.6-stack-guard.patch
Only in mysql-5.6.13/support-files: mysql.5.6.13.spec2View comments
-
I have been wondering what the Foundation has been up to. I had high hopes for it and even contributed money but it has been very quiet. Fortunately I learned that it has been busy making decisions, maybe not in public, but somewhere. And at Percona London we will be told why it forked MariaDB prior to 5.6 and reimplemented a lot of features.
In other news the Percona London lineup looks great and I appreciate that Oracle is part of it.3View comments
-
Someone I know used to make jokes about their plans to run MySQL 4.0 forever. It wasn't a horrible idea as 4.0 was very efficient and the Google patch added monitoring and crash-proof replication slaves. I spent time this week comparing MySQL 5.7.2 with 5.6.12 and 5.1.63. To finish the results I now have numbers for 4.1.22. I wanted to include 4.0 but I don't think it works good when compiled with a modern version of gcc and I didn't want to debug the problem. The result summary is that 4.1.22 is much faster at low concurrency and much slower at high concurrency. Of course we want the best of both worlds -- 4.1.22 performance at low concurrency and 5.7.2 performance at high. Can we get that?
I used sysbench for single-threaded and high concurrency workloads. The database is cached by InnoDB. All of the QPS numbers are in previous posts except for the 4.1.22 results. I only include the charts and graphs here as the differences between 4.1.22 and modern MySQL stand out.
Single thread results
For all of the charts the results for 4.1.22 are at the top. The first result is for a workload that fetches 1 row by primary key via SELECT. MySQL 4.1.22 is much better than modern MySQL.
The next result is for a workload that fetches 1 row by primary key via HANDLER. MySQL 4.1.22 is still the best but the difference is smaller than for SELECT.The last result is for a workload that updates 1 row by primary key. The database uses reduced durability (no binlog, no fsync on commit). Modern MySQL has gotten much slower. Bulk loading a database with MySQL might be a lot slower than it was in 4.1.22.Concurrent results
MySQL 4.1.22 looked much better than modern MySQL on the single-threaded results. It looks much worse on the high-concurrency workloads and displays a pattern that was well known back in the day -- QPS collapses once there are too many concurrent requests.
Here is an example of that pattern for SELECT by primary key.
This is an example of the collapse for fetch 1 row by primary key via HANDLER.
The final example of the collapse is for UPDATE 1 row by primary key. Note that 5.1.63 with and without the Facebook patch also collapses.
5View comments
-
Many of my write-intensive benchmarks use reduced durability mode (fsync off, binlog off) because that was required to understand whether other parts of the server might be a bottleneck. Fortunately real group commit exists in 5.6 and 5.7 and it works great. Results here compare the performance between official 5.1, 5.6 and 5.7 along with Facebook 5.1. I included FB 5.1 because it was the first to have group commit and the first to use that in production. But the official version of real group commit is much better, as is the MariaDB version. Performance for the same workload without group commit is here.
I compared 5 binaries:- orig5612.gc - MySQL 5.6.12 with group commit
- orig572.gc - MySQL 5.7.2 with group commit
- fb5163.gc - MySQL 5.1.63 and the FB patch with group commit
- fb5163.nogc - MySQL 5.1.63 and the FB patch without group commit
- orig5163 - MySQL 5.1.63 without group commit
This graphs displays performance for an update-only workload. The client is 8 sysbench processes that run on one host with mysqld on another host. The total number of clients tested was 8, 16, 32, 64, 128 and 256 where the clients are evenly divided between the sysbench processes. The test database was cached by InnoDB and 8 tables with 8M rows each were used. The clients were evenly divided between the 8 tables. Each transaction is an auto-commit UPDATE that changes the non-indexed column for 1 row found by primary key lookup. The binlog was enabled and InnoDB did fsync on commit. The test server has 24 CPUs with HT enabled and storage is fast flash.This table has all of the results from the test.
For comparison I include these results from tests that disabled the binlog and fsync on commit. In these tests performance of 5.1.63 collapses under concurrency. I did not debug the cause but using the binlog and fsync on commit improved performance. Note also that 5.6 and 5.7 can do ~70k TPS in a reduced durability configuration and ~50k in a durable configuration. So durability costs about 2/7 of the peak.
binary 8 16 32 64 128 fb5163.noahi 28543 41978 31758 14480 10208 orig5163.noahi 27714 47582 35231 14936 10308 fb5612.noahi 25935 47862 69679 75730 71983 orig5612.noahi 26920 51842 73288 78757 72966 orig5612.psdis 27138 50902 71711 77208 71674 orig5612.psen 26576 48000 69451 75729 70466 orig572.noahi 26089 48382 73190 83373 75456 orig572.psdis 25090 48368 71795 82348 75060 orig572.psen 25375 45751 69154 79023 70585 0Add a comment
-
These are results for sysbench with a cached database and concurrent workload. All data is in the InnoDB buffer pool and I used three workloads (select only, handler only, update only) as described here. The summary is that MySQL sustains much higher update rates starting with 5.6 and that improves again in 5.7. Read-only performance also improves but to get a huge increase over 5.1 or 5.6 you need a workload with extremely high concurrency.
The tests used one server for clients and another for mysqld. Ping between the hosts takes ~250 microseconds. The mysqld host has 24 CPUs with HT enabled. Durability was reduced for the update test -- binlog off, no fsync on commit and host storage was fast for the writes/fsyncs that had to be done. The names used to describe the binaries is described here. Each test was repeated for 8, 16, 32, 64 and 128 concurrent clients.
You might also notice there is a performance regression in the FB patches for MySQL 5.6. I am still trying to figure that out. The regression is less than the one in 5.6/5.7 when the PS is enabled but I hope we can get per-table and per-user resource monitoring with less overhead.
SELECT by PK
binary 8 16 32 64 128 fb5163.noahi 31228 59099 128500 184677 192537 orig5163.noahi 32450 67192 126999 183784 193457 fb5612.noahi 28996 58856 118444 168934 175622 orig5612.noahi 33207 59713 124882 176286 184590 orig5612.psdis 27963 63654 123344 174108 180259 orig5612.psen 29192 57937 116613 160917 164578 orig572.noahi 30053 62649 121101 171441 180280 orig572.psdis 30835 62925 117282 165293 171528 orig572.psen 31869 58030 117433 156647 160074 HANDLER by PK
binary 8 16 32 64 128 fb5163.noahi 34613 73636 156946 239452 207223 orig5163.noahi 38014 83000 152202 223349 133286 fb5612.noahi 34552 83458 152313 243776 266989 orig5612.noahi 36064 84524 158341 246033 276491 orig5612.psdis 38537 71292 159299 242109 272497 orig5612.psen 34997 82608 151322 228510 249636 orig572.noahi 33260 73790 161773 242909 280488 orig572.psdis 34244 71770 151687 239340 272236 orig572.psen 37723 72841 153221 226125 248008 UPDATE by PK
binary 8 16 32 64 128 fb5163.noahi 28543 41978 31758 14480 10208 orig5163.noahi 27714 47582 35231 14936 10308 fb5612.noahi 25935 47862 69679 75730 71983 orig5612.noahi 26920 51842 73288 78757 72966 orig5612.psdis 27138 50902 71711 77208 71674 orig5612.psen 26576 48000 69451 75729 70466 orig572.noahi 26089 48382 73190 83373 75456 orig572.psdis 25090 48368 71795 82348 75060 orig572.psen 25375 45751 69154 79023 70585 3View comments
-
I used sysbench to measure the performance for concurrent clients connecting and then running a query. Each transaction in this case is one new connection followed by a HANDLER statement to fetch 1 row by primary key. Connection create is getting faster in 5.6 and even more so in 5.7. But enabling the performance schema with default options significantly reduces performance. See bug 70018 if you care about that.
There are more details on my test setup in previous posts. For this test clients and server ran on separate hosts and ping takes ~250 usecs between them today. Eight sysbench processes were run on the client host and each process created between 1 and 16 connections to mysqld. The database is cached by InnoDB and the clients were divided evenly between the tables. Each table has 8M rows.
These are results in TPS for 8, 16, 32, 64 and 128 concurrent clients. Each transaction is connect followed by a HANDLER fetch. The binaries orig572.psen and orig5612.psen use the performance schema with default options for MySQL 5.7.2 and 5.6.12. Throughput is much worse compared to the same code without the PS. All binary names are explained here.
binary 8 16 32 64 128 fb5163.noahi 4041 8084 16323 16380 16066 orig5163.noahi 4026 7848 15587 15912 15741 fb5612.noahi 4004 7425 23125 24570 24688 orig5612.noahi 4027 7601 26155 28021 28091 orig5612.psdis 4008 7643 25640 27517 27631 orig5612.psen 4205 9366 21197 21456 21592 orig572.noahi 4172 9248 28613 39451 39721 orig572.psdis 4025 7612 27600 37963 38044 orig572.psen 4001 7870 18437 22982 23240
And this chart has data for some of the binaries.
1View comments
-
I used sysbench to understand the changes in connection create performance between MySQL versions 5.1, 5.6 and 5.7. The test used single-threaded sysbench where each query created a new connection and then selected one row by PK via HANDLER. The database was cached by InnoDB and both the single client thread and mysqld ran on the same host. The tests were otherwise the same as described in a previous post.
The summary is that connection create has gotten faster in MySQL 5.6 and 5.7 but enabling the performance schema with default options reduces that by about 10% for a single threaded workload. Bug 70018 is open to reduce this overhead. The memory consumed per increment of max_connections by the PS might also be interesting to you.
binary QPS fb5163.noahi 2087 orig5163.noahi 2122 fb5612.noahi 2656 orig5612.noahi 2775 orig5612.psdis 2706 orig5612.psen 2468 orig572.noahi 2687 orig572.psdis 2611 orig572.psen 2427 1View comments
-
This isn't a new message but single-threaded performance continues to get worse in 5.7.2. There have been regressions from 5.1 to 5.6 and now to 5.7. I skipped testing 5.5. On the bright side there is progress on a bug I opened for this and MySQL seems to be very interested in making things better. The regressions for UPDATE and SELECT are much worse than for HANDLER so I assume the optimizer accounts for much of the new overhead.
The performance schema with default instrumentation appears to have a higher overhead than the Facebook patch. The critical monitoring for the FB patch is per-user and per-table statistics. While it is always nice to reduce the size of the FB patch, and switching back to the PS for that would reduce it, I don't think that will happen until the PS becomes more efficient.
The graphs below have results for 5.1 at the top, 5.6 in the middle and 5.7 on the bottom. This makes it easier to see the regressions over time.
I tested 3 workloads using sysbench: select, handler and update. The select workload fetches all columns in one row by primary key using SELECT. The handler workload does the same using HANDLER instead of SELECT. The update workload updates a non-indexed column in 1 row by primary key. For all of the tests the database was cached by InnoDB. The tests used 1 sysbench process and 1 table (sbtest1) and all processes ran on the same server. Only one client connection (1 thread) was used during each test. Durability was reduced for the update test -- no binlog, no fsync on commit
MySQL 5.1.63, 5.6.12 and 5.7.2 were tested in several configuration -- with/without the adaptive hash index (AHI below) and with/without the performance schema (PS below). When the PS is enabled only the default options are used. The results use the following binary names:
- fb5163.noahi - 5.1.63, Facebook patch, AHI off
- fb5163.ahi - 5.1.63, Facebook patch, AHI on
- orig5163.noahi - 5.1.63, AHI off
- orig5163.ahi - 5.1.63, AHI on
- fb5612.noahi - 5.6.12, Facebook patch, AHI off
- fb5612.ahi - 5.6.12, Facebook patch, AHI on
- orig5612.noahi - 5.6.12, AHI off, PS not compiled
- orig5612.ahi - 5.6.12, AHI on, PS not compiled
- orig5612.psdis - 5.6.12, AHI off, PS compiled but disabled
- orig5612.psen - 5.6.12, AHI off, PS compiled & enabled
- orig572.noahi - 5.7.2, AHI off, PS not compiled
- orig572.ahi - 5.7.2, AHI on, PS not compiled
- orig572.psdis - 5.7.2, AHI off, PS compiled but disabled
- orig572.psen - 5.7.2, AHI off, PS compiled & enabled
Fetch 1 row by SELECT via PK
binary QPS fb5163.noahi 10087 fb5163.ahi 10918 orig5163.noahi 10230 orig5163.ahi 10614 orig5612.noahi 9412 orig5612.ahi 9511 orig5612.psdis 9334 orig5612.psen 8702 orig572.noahi 9128 orig572.ahi 9607 orig572.psdis 8573 orig572.psen 8572 Fetch 1 row by PK via HANDLERbinary QPS fb5163.noahi 14560 fb5163.ahi 14832 orig5163.noahi 14679 orig5163.ahi 15535 orig5612.noahi 15068 orig5612.ahi 14638 orig5612.psdis 13840 orig5612.psen 14433 orig572.noahi 13960 orig572.ahi 14799 orig572.psdis 14434 orig572.psen 13697 Update 1 row by PK
binary QPS fb5163.noahi 7947 fb5163.ahi 8130 orig5163.noahi 8184 orig5163.ahi 8273 orig5612.noahi 6813 orig5612.ahi 6893 orig5612.psdis 6613 orig5612.psen 6395 orig572.noahi 6350 orig572.ahi 6306 orig572.psdis 6131 orig572.psen 5984 2View comments
-
I used linkbench to compare MySQL/InnoDB 5.1, 5.6 and 5.7. After a few improvements to linkbench and to the InnoDB my.cnf variables I was able to get much better QPS than before (about 1.5X better). I was ready to try 5.7 because it reduces contention from the per-index latch. All tests below use reduced durability (no binlog, no fsync on commit) and more details on the my.cnf options are at the end of this page. The tests were very IO-bound as the databases were ~600GB at test start prior to fragmentation and the InnoDB buffer pool was 64GB.
The summary is that 5.7.2 has better performance than 5.6 and 5.1 and much less mutex contention. The test server has 32 cores with HT enabled and a fast flash device. InnoDB was doing about 40,000 page reads & writes per second.
- 11281 QPS -> MySQL 5.1.63
- 23079 QPS -> MySQL 5.6.12
- 24710 QPS -> MySQL 5.7.2
Mutex contention for 5.7.2
This was collected using the performance schema during an 1800 second test run with 64 client connections. The nsecs_per column is the average number of nanoseconds per attempt to lock the mutex or rw-lock. The seconds column is the total number of seconds attempting to lock it.
+---------------------------------------------+---------+--------+
| event_name |nsecs_per| seconds|
+---------------------------------------------+---------+--------+
| wait/synch/rwlock/innodb/index_tree_rw_lock | 19543.3 | 3456.1 |
| wait/synch/mutex/innodb/log_sys_mutex | 2071.3 | 385.8 |
| wait/synch/rwlock/innodb/hash_table_locks | 165.7 | 184.5 |
| wait/synch/mutex/innodb/fil_system_mutex | 328.3 | 113.6 |
| wait/synch/mutex/innodb/redo_rseg_mutex | 1766.4 | 84.9 |
| wait/synch/rwlock/sql/MDL_lock::rwlock | 430.2 | 73.9 |
| wait/synch/mutex/innodb/buf_pool_mutex | 264.5 | 72.7 |
| wait/synch/rwlock/innodb/fil_space_latch | 27216.1 | 53.8 |
| wait/synch/mutex/sql/THD::LOCK_query_plan | 167.0 | 50.7 |
| wait/synch/mutex/innodb/trx_sys_mutex | 394.9 | 41.5 |
+---------------------------------------------+---------+--------+
Mutex contention for 5.6.12
This was collected using the performance schema during an 1800 second test run with 64 client connections. The total number of seconds stalled for index_tree_rw_lock and buf_pool_mutex is much higher compared to 5.7.2.+---------------------------------------------+--------------------+| event_name | nsecs_per| secs |+---------------------------------------------+----------+---------+| wait/synch/rwlock/innodb/index_tree_rw_lock | 144148.0 | 24491.5 || wait/synch/mutex/innodb/buf_pool_mutex | 5439.9 | 1531.8 || wait/synch/mutex/innodb/log_sys_mutex | 1821.6 | 349.7 || wait/synch/rwlock/innodb/hash_table_locks | 112.4 | 240.9 || wait/synch/mutex/innodb/fil_system_mutex | 234.9 | 83.2 || wait/synch/rwlock/sql/MDL_lock::rwlock | 373.1 | 66.1 || wait/synch/mutex/innodb/trx_sys_mutex | 332.3 | 50.6 || wait/synch/mutex/sql/THD::LOCK_thd_data | 159.6 | 46.7 || wait/synch/mutex/innodb/os_mutex | 190.3 | 42.9 || wait/synch/mutex/sql/LOCK_table_cache | 325.4 | 35.9 |+---------------------------------------------+------------+-------+my.cnf options
This lists the my.cnf options for 5.7.2 and 5.6.12table-definition-cache=1000
table-open-cache=2000
table-open-cache-instances=1
max_connections=2000
key_buffer_size=200M
metadata_locks_hash_instances=256
query_cache_size=0
query_cache_type=0
skip_log_bin
max_allowed_packet=16000000
innodb_buffer_pool_size=64G
innodb_log_file_size=1900M
innodb_buffer_pool_instances=8
innodb_io_capacity=16384
innodb_lru_scan_depth=2048
innodb_checksum_algorithm=CRC32
innodb_flush_log_at_trx_commit=2
innodb_thread_concurrency=0
innodb_flush_method=O_DIRECT
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
innodb_adaptive_hash_index=0
innodb_doublewrite=0
innodb_flush_neighbors=0
innodb_use_native_aio = 117View comments
-
Several members of the small data team at FB will be at MySQL Connect this weekend. It would be interesting to learn that someone else has used Linkbench. I use it in addition to sysbench. After some effort tuning InnoDB and a few changes to the source I was able to almost double the Linkbench QPS but I really need MySQL 5.7 as the per-index latch for InnoDB indexes is the primary bottleneck.
In addition to networking at conferences, I recently spent a day looking at networking in MySQL 5.1 and 5.6. A good overview is the output from strace -c -p $PID where $PID is a thread busy with sysbench read-only queries for a cached database. Below I describe the results from MySQL 5.1.63 and 5.6.12 using official MySQL and the Facebook patch. Each result is from a sample of about 10 seconds (give or take a few seconds).
Official MySQL 5.1.63
This strace output is from official MySQL 5.1.63. There are two interesting things in these results. The first is frequent calls to sched_setparam and all of them return an error. That is bug 35164 which was fixed in MySQL 5.6. Removing the calls in 5.1 improved performance by about 0.3% on my test server. That isn't a big deal but I am happy the code is gone. The second interesting result is the high number of calls to fcntl. I filed feature request 54790 asking for them to be removed. There were a big problem for performance on older Linux kernels that used a big kernel mutex for some of the fcntl processing. See this post for details on the impact. This is not a performance problem on the kernels I have been using recently.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
40.30 0.964307 17 57447 17100 read
36.58 0.875208 22 40348 40348 sched_setparam
11.30 0.270419 8 34199 fcntl
9.12 0.218152 11 20174 write
2.51 0.060160 20 3021 621 futex
0.19 0.004601 29 156 sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00 2.392847 155345 58069 total
Facebook 5.1.63
This strace output is from the Facebook patch for MySQL 5.1.63. It still has the frequent errors from calls to sched_setparam. But instead of too many calls to fcntl it has too many calls to setsockopt. That was a good tradeoff on some Linux kernels as described in this post but it doesn't matter on recent kernels.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.15 0.673638 42 15851 read
28.17 0.411157 26 15850 15850 sched_setparam
11.73 0.171289 11 15851 setsockopt
6.93 0.101176 13 7925 write
4.61 0.067359 26 2628 615 futex
2.41 0.035137 39 896 sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00 1.459756 59001 16465 total
MySQL 5.6.12
This result is the same for both official MySQL and the Facebook patch. Hooray, the calls to sched_setparam are gone! There are many calls to recvfrom that get errors. I assume these are the non-blocking calls that return no data. There are also many calls to poll. I prefer to see fewer calls to poll and hacked on MySQL to do blocking calls to recv. That made the poll calls go away but didn't have a significant impact on performance. Perhaps it will help in the future when other bottlenecks are removed.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.61 1.743483 108 16120 poll
33.75 1.118461 59 19055 sendto
9.71 0.321903 6 54229 16121 recvfrom
3.92 0.130002 19 6846 1716 futex
0.01 0.000353 118 3 sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00 3.314202 96253 17837 total
Server-side network reads
The pattern for server-side network reads is to first do a non-blocking read and if that doesn't return the expected amount of data then a read with timeout is done. I don't have all of the history behind the design decision but my guess is that there had to be support for interrupting reads on shutdown. The implementation of read with timeout has changed over time and it can be hard to figure out some of the code unless you look at the preprocessor output.
- in official MySQL 5.1.63 read with timeout was usually implemented by doing a blocking read and then the alarm thread would unblock the thread when the timeout was reached or on shutdown. There was also an option to not use the alarm thread (see -DNO_ALARM). This suffered from frequent calls to fcntl which was a problem when Linux required a big kernel mutex to process them.
- in the Facebook patch for MySQL 5.1.63 the code was changed to set a read timeout on the socket via setsockopt and this worked with -DNO_ALARM.
- in MySQL 5.6 the recv call is used to read data from a socket and the code does non-blocking reads and then does a wait with timeout via poll until there is more data.
Reference
I frequently refer to my old blog posts to research problems so I pasted some of the network stack that is used during server-side socket reads. The call stack is:
my_net_read()
-- net_read_packet()
---- net_read_packet_header()
------ net_read_raw_loop()
---- net_read_raw_loop() to get body
And the interesting code for net_read_raw_loop() is listed below:
net_read_raw_loop has:while (count){size_t recvcnt= vio_read(net->vio, buf, count);/* VIO_SOCKET_ERROR (-1) indicates an error. */if (recvcnt == VIO_SOCKET_ERROR){/* A recoverable I/O error occurred? */if (net_should_retry(net, &retry_count))continue;elsebreak;}/* Zero indicates end of file. */else if (!recvcnt){eof= true;break;}count-= recvcnt;buf+= recvcnt;}vio_read is:size_t vio_read(Vio *vio, uchar *buf, size_t size){ssize_t ret;int flags= 0;/* If timeout is enabled, do not block if data is unavailable. */if (vio->read_timeout >= 0)flags= VIO_DONTWAIT;/* this is VIO_DONTWAIT == MSG_DONTWAITand with tracing all calls to mysql_socket_recv have read_timeout > 0 and use MSG_DONTWAIT */while ((ret= mysql_socket_recv(vio->mysql_socket, (SOCKBUF_T *)buf, size, flags)) == -1){int error= socket_errno;/* The operation would block? */if (error != SOCKET_EAGAIN && error != SOCKET_EWOULDBLOCK)break;/* Wait for input data to become available. */if ((ret= vio_socket_io_wait(vio, VIO_IO_EVENT_READ)))break;}DBUG_RETURN(ret);}int vio_socket_io_wait(Vio *vio, enum enum_vio_io_event event){int timeout, ret;DBUG_ASSERT(event == VIO_IO_EVENT_READ || event == VIO_IO_EVENT_WRITE);/* Choose an appropriate timeout. */if (event == VIO_IO_EVENT_READ)timeout= vio->read_timeout;elsetimeout= vio->write_timeout;/* Wait for input data to become available. */switch (vio_io_wait(vio, event, timeout)){case -1:/* Upon failure, vio_read/write() shall return -1. */ret= -1;break;case 0:/* The wait timed out. */ret= -1;break;default:/* A positive value indicates an I/O event. */ret= 0;break;}return ret;}int vio_socket_io_wait(Vio *vio, enum enum_vio_io_event event){int timeout, ret;DBUG_ASSERT(event == VIO_IO_EVENT_READ || event == VIO_IO_EVENT_WRITE);/* Choose an appropriate timeout. */if (event == VIO_IO_EVENT_READ)timeout= vio->read_timeout;elsetimeout= vio->write_timeout;/* Wait for input data to become available. */switch (vio_io_wait(vio, event, timeout)){case -1:/* Upon failure, vio_read/write() shall return -1. */ret= -1;break;case 0:/* The wait timed out. */ret= -1;break;default:/* A positive value indicates an I/O event. */ret= 0;break;}return ret;}int vio_socket_io_wait(Vio *vio, enum enum_vio_io_event event){int timeout, ret;DBUG_ASSERT(event == VIO_IO_EVENT_READ || event == VIO_IO_EVENT_WRITE);/* Choose an appropriate timeout. */if (event == VIO_IO_EVENT_READ)timeout= vio->read_timeout;elsetimeout= vio->write_timeout;/* Wait for input data to become available. */switch (vio_io_wait(vio, event, timeout)){case -1:/* Upon failure, vio_read/write() shall return -1. */ret= -1;break;case 0:/* The wait timed out. */ret= -1;break;default:/* A positive value indicates an I/O event. */ret= 0;break;}return ret;}int vio_io_wait(Vio *vio, enum enum_vio_io_event event, int timeout){int ret;short DBUG_ONLY revents= 0;struct pollfd pfd;my_socket sd= mysql_socket_getfd(vio->mysql_socket);memset(&pfd, 0, sizeof(pfd));pfd.fd= sd;/* Set the poll bitmask describing the type of events.The error flags are only valid in the revents bitmask. */switch (event){case VIO_IO_EVENT_READ:pfd.events= MY_POLL_SET_IN;revents= MY_POLL_SET_IN | MY_POLL_SET_ERR | POLLRDHUP;break;case VIO_IO_EVENT_WRITE:case VIO_IO_EVENT_CONNECT:pfd.events= MY_POLL_SET_OUT;revents= MY_POLL_SET_OUT | MY_POLL_SET_ERR;break;}/* Wait for the I/O event and return early in case of error or timeout */switch ((ret= poll(&pfd, 1, timeout))){case -1:break; /* return -1 on error */case 0:/* Set errno to indicate a timeout error. */errno= SOCKET_ETIMEDOUT;break;default:/* Ensure that the requested I/O event has completed. */DBUG_ASSERT(pfd.revents & revents);break;}DBUG_RETURN(ret);}static inline ssize_tinline_mysql_socket_recv(MYSQL_SOCKET mysql_socket, SOCKBUF_T *buf, size_t n, int flags){ssize_t result;/* Non instrumented code */result= recv(mysql_socket.fd, buf, IF_WIN((int),) n, flags);return result;}1View comments
-
My co-workers will speak about big and small data at XLDB. Jeremy Cole and the Tokutek founders are also speaking. I hope to learn many interesting things there including my fate (1, 2, 3, 4), whether I am doing things right or wrong and what database technology might be used by future extremely large science experiments. Oh, you probably missed it but we are doing it all wrong. See the slides/abstract from NEDS 2013 on "The Traditional Wisdom is All Wrong". Excessively strong claims without any attempt to understand web-scale data management doesn't make a great paper.
- Small Data at Peta Scale by Domas Mituzas and Harrison Fisk
- Beyond Hadoop - Building the Analytics Infrastructure at Facebook by Ravi Murthy
- The MySQL Ecosystem at Scale by Jeremy Cole
- Data Structures and Algorithms for Big Databases by Michael Bender and Bradley Kuszmaul
5View comments
-
The Facebook MySQL teams are presenting at MySQL Connect:
- Lots and lots of small data by Harrison Fisk
- MySQL 5.6 at Facebook by Yoshinori Matsunobu
- Panel session on MySQL 5.6 with Mark Callaghan
A lot of work has been done by the teams this year and even more is planned for the future. We have a lot more people helping to make MySQL better. That is a nice change. Now we just need to get the new people to speak & write about their work in public.1View comments
View comments