Ever seen this message?
Sep 7 16:36:18 server kernel: [5359680.372082] INFO: task jbd2/xvda2-8:103 blocked for more than 120 seconds. Sep 7 16:36:18 server kernel: [5359680.372096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 7 16:36:18 server kernel: [5359680.372103] jbd2/xvda2-8 D ffff8801ffc13780 0 103 2 0x00000000 Sep 7 16:36:18 server kernel: [5359680.372112] ffff8801f3cc2880 0000000000000246 0000000000000000 ffffffff8160d020 Sep 7 16:36:18 server kernel: [5359680.372122] 0000000000013780 ffff8801f3827fd8 ffff8801f3827fd8 ffff8801f3cc2880 Sep 7 16:36:18 server kernel: [5359680.372137] ffff8801f3827b20 00000001f3827b20 ffff8801f6d10ad0 ffff8801ffc13fd0 Sep 7 16:36:18 server kernel: [5359680.372151] Call Trace: Sep 7 16:36:18 server kernel: [5359680.372163] [<ffffffff810b4fa3>] ? lock_page+0x20/0x20
I originally thought it had something to do with LVM2, as it happened to me when I was trying to do an lvextend on a logical volume. Unfortunately, the error returned by LVM2 was not helpful; it told me I did not have enough space. FYI, I'm using an iSCSI server as my resource for Xen DOMU's on several DOM0's, with each DOMU having one or more logical volumes.
LVM2 told me there was not enough space (there was more than enough). However, when I looked in syslog, I saw the above error at the same time I was trying to do this. I figure "lvm2 bug..." or "xen bug..."
Then, an associate of mine mentioned the same problem. Different setup completely. He was using Xen, but each of his virtuals were running on their own separate local partition. Based on that, I narrowed the problem to Xen, and then on further research, not to Xen but the kernel itself.
The problem appears to be how long it takes to access the mass storage. Basically, Linux (and most other OS's) do not write directly to disk but, instead, write to memory (disk buffer) where the data is stored for a period of time, then written in one large swoop. This increases the efficiency of disk writes and, on machines with little disk access and a lot of memory, helps the computer react more rapidly. Information that is in the disk buffer but is not yet written to disk is called "dirty" (vs clean, where it is just there in case you try to read it again).
Basically, my understanding is that when data is written to disk, the kernel puts it into the disk buffer. Following is an example:
- If buffer too full for new data
- If buffer contains too much dirty data
- write dirty data to disk
- Allocate enough room for new input by throwing away oldest buffer data.
- Write new data to the buffer
It is that step 1.1.1 which causes problems. It could, in theory, wait for a gigabyte or two of data to be written to disk before accepting 1 bit from the program. If it takes longer than 120 seconds for that to occur, you get this error message.
The kernel will write data to disk during rest times (my term, when it is not doing anything else) when it thinks it has enough data in the buffer to be worthwhile, or when the dirty buffer data has been in there too long.
Note: The error message above states
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
This is not a good idea. The problem will still exist, and you will still be having errors, but you won't get any notification of them in the logs. That is not a fix; it just hides the probrelm from you.
On Linux, the way the disk buffers work is controlled by sysctl which is a command you can run to view or set various parameters. To see what parameters are available, you can:
ls -l /proc/sys/vm
or, you can see the list of all current values with the command
Calls to sysctl must be done as root since it can dramatically alter (and destroy) a server.
There are many parameters, but the ones which appear to be the most important are:
|vm.dirty_background_ratio||percentage of total system memory, the number of pages at which the pdflush background writeback daemon will start writing out dirty data|
|vm.swappiness||how aggressively memory pages are swapped to disk. If you do not want swapping, than lower this value. However, if your system process sleeps for a long time you may benefit with an aggressive swapping behavior by increasing this value|
|vm.dirty_ratio||percentage of total system memory, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. This is nothing but the ratio at which dirty pages created by application disk writes will be flushed out to disk. A value of 40 mean that data will be written into system memory until the file system cache has a size of 40% of the server's RAM. So if you've 12GB ram, data will be written into system memory until the file system cache has a size of 4.8G.|
To view one of the values, execute sysctl with the name of the parameter. To set it, add the -w flag and give the form parameter=value. For example:
server> sysctl vm.swappiness vm.swappiness = 60 server> sysctl -w vm.swappiness=100 server> sysctl vm.swappiness vm.swappiness = 100
Note: using sysctl -w makes changes to the running kernel, and will be lost on the next reboot. To make changes permenant, you should edit /etc/sysctl.conf. Open /etc/sysctl.conf in your favorite text editor (you must be root) and find the parameter you want to modify (or add it at the bottom). For example, to make vm.swappiness=100 permenant, add
to sysctl.conf (or edit the existing entries), then save the file. On the next reboot, this value will be used, but to make the changes start now, you can either use sysctl -w or you can use the -p parameter to tell it to reload
First, you should try to figure out what is causing the issue. Is the buffer not being flushed fast enough? In that case, you would decrease vm_dirty_background_ratio or vm_dirty_ratio.
One convenient thing to do is look at syslog entries BEFORE the entry above. You may see NFS or iSCSI or even just plain old /dev/sd? haveing problems, waiting, and waiting. Or, it may be something else.
Is the disk caching memory and swapping too often. Decrease vm.swappiness which will decrease (or eliminating) swapping memory to the swap space.
Make small changes and see what happens. I do not have the solution down yet, and am actually just beginning to attack it since I don't know if the problem is on the iSCSI server (where I have a 15G buffer) or the individual DOM0's.
Various articles have suggested different values for sysctl, though it seems the common issue is setting vm.dirty_ratio to some lower value. This is appearantly even more true on systems with a lot of memory, where 40% of memory can be a whole lot (ie, my current iSCSI server has way overkill at 15G, meaning it can fill up 6G of RAM before thinking it must flush the cache). Since this is covered by the vm.dirty_ratio, they are talking some serious changes, down to 10, or even 5.
This will cause the kernel to flush the buffers if they ever get above 10% of total memory or, in my case, 1.5G. Actually, I'm thinking 5% would be better for me. See Bibliography 5 for a better explanation.
For a very thourough explanation of how the buffering system works, see Bibliography 7. It shows what each one of the parameters actually does in a nice, logical manner.
The following web sites helped me research this, and some of their content was stolen word-for-word (copy/paste).
- 1 - http://www.cyberciti.biz/faq/linux-kernel-tuning-virtual-memory-subsystem/
- 2 - http://www.cyberciti.biz/faq/making-changes-to-proc-filesystem-permanently/
- 3 - http://xen.1045712.n5.nabble.com/Task-blocked-for-more-than-120-seconds-td5714661.html
- 4 - http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/ - He talks about the issue. See his reply to Karty on 24 Jul at bottom
- 5 - http://ag-up.com/?p=1225 - short and sweet explaination of the problem
- 6 - http://h10025.www1.hp.com/ewfrf/wc/document?cc=us&lc=en&docname=c03735549 - HP article on it.
- 7 - http://www.westnet.com/~gsmith/content/linux-pdflush.htm - Probably more information than you wanted to know, but a really good explaination of how Linux does caching