I am running a benchmark with MySQL and Linux on a server with 16GB. I used innodb_buffer_pool_size=12G. The process size is 13.6GB. What else uses all of the memory? Domas investigated the problem and one of the causes is that the per buffer pool page overhead in Innodb is high.
But a 13.6GB process shouldn't be a problem on a 16GB box, should it? Well, the Linux VM tends to generate too many page-ins and page-outs in this case. Don, Kevin, Dathan, Peter, Peter, and Peter have written about this. In my case, the sum of page-ins and page-outs were more than 500 per second. This is reported in the si and so columns (449, 413 below). The sum of si and so is ~900 and assuming that means ~900kb and 4kb blocks, then there were more than 200 reads and writes per second for this activity. I hope that does not translate to 200 IOPs of disk activity.
procs -----------memory---------- ---swap-- -----io----The first problem is to understand the numbers. I wanted to understand the output of vmstat, vmstat -a and vmstat -sa. The Linux vmstat man page is not exactly clear.
r b swpd free buff cache si so bi bo
1 3 530064 128388 13024 2434336 449 413 636 17316
MemoryBetter and possibly accurate definitions for these are (sizes are in 1 kb blocks)
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)
- swpd - amount of swap space in use
- free - unallocated physical memory
- buff - size of cache used for filesystem metadata blocks
- cache - size of cache for filesystem data blocks
- inact - amount of memory the kernel considers to be inactive and eligible to be written to swap space or to the filesystem. Inactive blocks may be dirty.
- active - see inact
# vmstat -saWhat do these mean? All numbers are for 1k blocks by default.
16226644 total memory
16100852 used memory
13576424 active memory
2340580 inactive memory
125792 free memory
36956 buffer memory
2417564 swap cache
530104 total swap
420700 used swap
109404 free swap
- total memory - physical memory
- used memory - allocated physical memory, the free column in vmstat output is total memory - used memory
- active memory - amount of memory the kernel considers to be in active use
- inactive memory - see active memory, may include dirty pages
- buffer memory - size of the cache for file-system metadata
- swap cache - RedHat describes this as the read-write cache for swap but the value matches the cache column in vmstat output and I think this is the size of the page cache used for files.
- total swap - total swap space
- used swap - used swap space, this is the swpd column in vmstat output
- free swap - unused swap space
- reduce the size of the InnoDB buffer cache. I know that there was no page-in and page-out activity when I ran the benchmark with innodb_buffer_pool_size=8G
- tune Linux: echo 0 > /proc/sys/vm/swappiness. This eventually removed all page-in and page-out activity and increased performance by 50%.
- use Solaris as some people have reported that this problem doesn't occur there
- use O_DIRECT, but Kevin reported that writes still go through the page cache and this will eventually lead to page-out activity. For unmodified InnoDB there is a good reason to not use O_DRECT as one thread and synchronous IO is used to write dirty pages. This will become a bottleneck on a busy server and isn't effective on a server with many disks and many CPU cores. This is not a performance problem when buffered IO is used as the writes usually complete fast to the OS buffer cache.


I'm pretty sure O_DIRECT bypasses the page cache for both reads and writes. I suspect that in the referenced anecdote, the buffered I/O was due to the mysqldump/restore access to the archive file (not the database).
ReplyDelete@JVS - that is good to know. I am still hesitant to use O_DIRECT. Do you know anything about this problem -- http://dammit.lt/2008/08/11/notes-from-land-of-io.
ReplyDeleteFor LucidDB, we always use libaio+O_DIRECT, with a lot of both async read (for prefetch) and async write (eagerly as we fill persistent pages, and lazily for temp data pages like hash partitions). I've never seen serialization on ext3 for writes to a preallocated file. However, when it's necessary to grow a data file, then things do get very slow as writes fill in the holes left by ftruncate, so that's probably the same issue as what Domas mentions.
ReplyDeleteTo address this, we provide a data file preallocation utility along with LucidDB; it uses posix_fallocate to fill the holes. Unfortunately, this requires opening the file without O_DIRECT, so it's necessary to take the server down while it runs (OK for benchmarks, but not so good for production use). When we add tablespace support, this would be one reason to allow them to span multiple datafiles (so a new one can be preallocated and then brought online before the old one fills up).
Using XFS is a nicer option, but it lacks critical mass; dunno if ext4 is going to fix this.
BTW... it looks like the SplitLRU kernel patch has totally solved our paging problem. We're now able to allocate 27/32GB without paging AT ALL....
ReplyDeleteWe've had this in production for >=60 days now without any problems.
It appears to raise another problem though. When I first attempted to allocate 31/32GB the box would lock the InnoDB process and it wouldn't respond. After a few minutes the box would respond again.
I didn't have much time to debug the problem though so there might be a straight forward solution.
@JVS O_DIRECT does bypass the page cache but the MySQL binary logs are not opened with O_DIRECT (only the InnoDB files).
ReplyDelete@Kevin - By 'allocate 31 GB' do you mean that you set innodb_buffer_pool_size=31G?
ReplyDelete@JVS - my tests were on preallocated files (without any holes, etc).
ReplyDelete@Domas - in that case, there may be a big difference between libaio and non-libaio when using O_DIRECT. To verify that, I adapted one of our low-level unit tests which performs random async writes on a 1000-page file from 10 threads simultaneously (5000 writes for each thread; page size is 4KB).
ReplyDeleteThis is with libaio and O_DIRECT:
real 1m25.719s
user 0m0.996s
sys 0m1.256s
This is without libaio (using a threadpool and pwrite instead, but still with O_DIRECT):
real 1m49.371s
user 0m0.908s
sys 0m1.696s
Run on a dual-core laptop, so not a great disk configuration, but probably still meaningful. iostat showed significantly higher tps for libaio. Linux kernel 2.6.22-15-generic #1 SMP.
Back to Mark Callaghan for a technical post on Linux, mysql, vmstat, laying out what all those columns of vmstat output actually mean. -- Log Buffer #126
ReplyDelete