I am experiencing hung tasks when trying to rmdir() on a cgroup. One task spins, others queue up behind it with the following: INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 Call Trace: [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 [<ffffffff81108a7c>] ? path_put+0x1d/0x22 [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 [<ffffffff81427c4f>] mutex_lock+0x31/0x4b [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no tasks. Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to the rmdir. It looks like what I am seeing here and indicates that some cgroup subsystem is busy, indefinitely. I have not worked out how to reproduce it quickly. My only way is to complete a 'dd' command in the cgroup, but then the problem is so rare it is slow progress. Documentation/cgroup.memory.txt describes how force_empty can be required in some cases. Does this mean that with the patch above, these cases will now spin on rmdir(), instead of returning -EBUSY? How can produce a reliable test case requiring memory.force_empty to be used, to test this? Or is it likely to be some other cause, and how best to find it? Thanks -- Mark --
Hi. On Thu, 26 Aug 2010 16:51:55 +0100 (BST) The commit had caused a bug about rmdir, but it was fixed by the commit 88703267. What cgroup subsystem did you mount where the directory existed you tried to rmdir() first ? If you mounted several subsystems on the same hierarchy, can you mount them separately to narrow down the cause ? Thanks, Daisuke Nishimura. --
On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura It would also be nice to see what your mounted cgroup (filesystem perspective) looks like and what /proc/cgroups looks like when the problem occurs. Balbir --
On Fri, 27 Aug 2010 09:56:39 +0900
It seems I can reproduce the issue on mmotm-0811, too.
try this.
Here, memory cgroup is mounted at /cgroups.
==
#!/bin/bash -x
while sleep 1; do
date
mkdir /cgroups/test
echo 0 > /cgroups/test/tasks
echo 300M > /cgroups/test/memory.limit_in_bytes
cat /proc/self/cgroup
dd if=/dev/zero of=./tmpfile bs=4096 count=100000
echo 0 > /cgroups/tasks
cat /proc/self/cgroup
rmdir /cgroups/test
rm ./tmpfile
done
==
hangs at rmdir. I'm no investigating force_empty.
Thanks,
-Kame
--
On Fri, 27 Aug 2010 11:35:06 +0900 Thank you very much for your information. Some questions. Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? And, how long does it likely to take to cause this problem ? I've run it on RHEL6-based kernel/ext3 for about one hour, but I cannot reproduce it yet. Thanks, Daisuke Nishimura. --
On Fri, 27 Aug 2010 12:39:48 +0900 Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm... Thanks, -Kame --
On Fri, 27 Aug 2010 14:42:25 +0900 Sorry, my test just hangs on -mm + (other patches) no troubles on 2.6.34 and 2.6.36-rc1. Where can I see 2.6.33.6(Fedora) kernel ? Thanks, -Kame --
You can get the SRPM from the mirrors, one place to find it would be http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/ -- Three Cheers, Balbir --
The test case I was running is similar to the above. With the Lustre filesystem the problem takes 4 hours or more to show itself. Recently I ran 4 threads for over 24 hours without it being seen -- I suspect some external factor is involved. I also tried NFS, and did not see a problem after 8 hours or so, but this is inconclusive. The use of the Fedora kernel, and the Lustre filesystem is not satisfactory to trace the bug. Until I can get a test case which is more readily reproducable, I'm not able to reasonably think about changing variables. It is interesting you see the problem so readily on ext4; I will test that soon (it is currently holiday weekend in the UK). I hope it will give me the test case I am looking for. Thanks -- Mark --
I repeated the test above, but did not see a problem after many hundreds of loops. My test was with the same kernel from my original bug report (Fedora 2.6.33.6-147), using memory cgroup only and ext4 filesystem. So it is possible we are experiencing different bugs with similar symptoms. -- Mark --
On Wed, 1 Sep 2010 12:10:23 +0100 (BST) Thank you for confirming. But hmm...it's curious who holds mutex and what happens. -Kame --
Refer to my original email, where I was running multiple tests at once. This backtrace is from the tests which queue up: Call Trace: [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 [<ffffffff81108a7c>] ? path_put+0x1d/0x22 [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 [<ffffffff81427c4f>] mutex_lock+0x31/0x4b [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b The one which spins has already managed to claim the mutex lock on the /cgroup directory, and no call trace is shown for this. Is there a usable way to force a similar call trace for the spinning process? Unfortunately I have not been able to reproduce the problem for some days now, so I think some network factor is able to influence this. -- Mark --
On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote: I have a system showing the failure case (but still do not have a way to reliably repeat it) Here are the two processes: 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup 23586 spends almost all of its time in 'RL+' status, occasionally it is seen in 'DL+' status. From my analysis before, both are blocked on rmdir(), but one is spinning, holding the lock on the /cgroup, and the other is waiting for the lock. If I strace 23586 then the rmdir() fails with EINTR. How best to capture information which might show why the process spins? -- Mark --
Any chance you can compile with debug cgroup subsystem and get information from there? -- Three Cheers, Balbir --
I can, I'd like to experiment with a custom kernel next. I am still finding the problem incredibly hard to reproduce, so I'd like to observe as much data as possible from the current case before rebooting. If I could capture some kind of stack trace in the kernel for the running process that would be great, any suggestions appreciated. Thanks -- Mark --
echo l > /proc/sysrq-trigger another thing you can do is run something like: perf record -gp $pid which will give you a profile of that task. --
Despite running this many times, I never 'catch' the process on a CPU,
This is very useful, thanks.
The report on the spinning process (23586) is dominated by calls from
mem_cgroup_force_empty.
It seems to show lru_add_drain_all and drain_all_stock_sync are causing
the load (I assume drain_all_stock_sync has been optimised out). But I
don't think this is as important as what causes the spin.
There are no tasks in the cgroup, but memory usage is non-zero and
constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in
this case.
# cat /cgroup/soaked-23586/tasks
# cat /cgroup/soaked-23586/memory.usage_in_bytes
24576
# cat /cgroup/soaked-23586/memsw.usage_in_bytes
<hangs>
Here are the first few entries from the perf output, I can provide the
rest if needed, but all result from mem_cgroup_force_empty.
8.13% :23586 [kernel] [k] _raw_spin_lock_irqsave
|
--- _raw_spin_lock_irqsave
|
|--45.14%-- probe_workqueue_insertion
| insert_work
| |
| |--99.09%-- __queue_work
| | queue_work_on
| | schedule_work_on
| | schedule_on_each_cpu
| | |
| | |--50.59%-- lru_add_drain_all
| | | mem_cgroup_force_empty
| | | mem_cgroup_pre_destroy
| | | cgroup_rmdir
| | | vfs_rmdir
| | | do_rmdir
| | | sys_rmdir
| | | system_call_fastpath
| | | 0x3f504d27d7
| ...On Fri, 10 Sep 2010 00:04:31 +0100 (BST) I think this "cat" hang is because of vfs's lock. Hmm, then, there are pages on LRU which cannot be moved or there is leak of account. BTW, mem_cgroup's rmdir is desgined to be able to receive SIGINT etc... Can't you stop rmdir by Ctrl-C or some ? rmdir -> hang -> Ctrl-C (or some) -> cat .../memory.stat can work ? And do you still use Fedora's kernel ? Thanks, --
On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
I wrote a patch (onto 2.6.36 but can be applied..)
Could you try this ? I'm sorry I don't use FUSE system and can't test
right now.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.
This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup.
Note: This gfp flag can be used for shmem handling, which now uses
complicated heuristics.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/fuse/dev.c | 11 ++++++++++-
include/linux/gfp.h | 7 +++++++
mm/memcontrol.c | 2 +-
3 files changed, 18 insertions(+), 2 deletions(-)
Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
struct pipe_buffer *buf = cs->pipebufs;
struct address_space *mapping;
pgoff_t index;
+ gfp_t mask = GFP_KERNEL;
unlock_request(cs->fc, cs->req);
fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
remove_from_page_cache(oldpage);
page_cache_release(oldpage);
- err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+ /*
+ * not-on-LRU pages are out of control. So, add to root cgroup.
+ * See mm/memcontrol.c for details.
+ */
+ if (buf->flags & ...On Fri, 10 Sep 2010 11:16:46 +0900 The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.". But this change means that we don't charge these pages at all. Should it be: if (gfp_mask & __GFP_NOMEMCGROUP)) mm = &init_mm; ? Or, change the comment ? Thanks, Daisuke Nishimura. --
On Fri, 10 Sep 2010 13:05:39 +0900
yes....the comment is wrong.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.
This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup.
Note: This gfp flag can be used for shmem handling, which now uses
complicated heuristics.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/fuse/dev.c | 11 ++++++++++-
include/linux/gfp.h | 7 +++++++
mm/memcontrol.c | 2 +-
3 files changed, 18 insertions(+), 2 deletions(-)
Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
struct pipe_buffer *buf = cs->pipebufs;
struct address_space *mapping;
pgoff_t index;
+ gfp_t mask = GFP_KERNEL;
unlock_request(cs->fc, cs->req);
fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
remove_from_page_cache(oldpage);
page_cache_release(oldpage);
- err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+ /*
+ * non-LRU pages are out of cgroup controls.
+ * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details.
+ */
+ if (buf->flags & PIPE_BUF_FLAG_LRU)
+ mask |= __GFP_NOMEMCGROUP;
+
+ err = add_to_page_cache_locked(newpage, mapping, index, mask);
if (err) {
printk(KERN_WARNING ...What makes you conclude that FUSE is in use? I do not think this is the case. Or do you mean that it is a problem that the kernel is built with FUSE support? I _can_ test the patch, but I still cannot reliably reproduce the problem so it will be hard to conclude whether the patch works or not. Is there a way to build a test case for this? Thanks for your help -- Mark --
On Fri, 10 Sep 2010 08:28:00 +0100 (BST) I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge which cannot be found by force_empty(). And I found FUSE's pipe copy code inserts a page cache into radix-tree but not move them onto LRU. So, - There are remaining pages which is out-of-LRU - FUSE's "move" code does something curious, add_to_page_cache() but not LRU. - You reporeted you use Lustre FS. Then, I ask you. To test this, I have to study FUSE to write test module... Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c can show something but... We may have something other problem, but it seems this is one of them. Thanks, -Kame --
Lustre does not use FUSE. But the client is a set of kernel modules, so Okay, it sounds like perhaps I need to investigate Lustre, I will do this next week. But I think FUSE can be ruled out. Thanks again -- Mark --
On Thu, 26 Aug 2010 16:51:55 +0100 (BST) please show how-to-reproduce in your way. Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel. At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c Then, rmdir doesn't seem to reach cgroup code... Do you do another operation on the directory while rmdir is called ? Thanks, -Kame --
It sleeps in D state, but enters interruptable state periodically which is I use a C program which creates a container and places itself in the container, then forks a dd process. Quite a few: memory, blkio, cpuacct, cpuset. Until I can get a more reproducable test case (see my previous mail), I Interesting, I checked for that but not sure how I missed it. There is In one case I did an 'ls -l' on the filesystem which coencided with a lock up, but I was not able to reproduce this. -- Mark --
