cgroup: rmdir() does not complete

Previous thread: Re: [PATCH] gpio: Add generic driver for simple memory mapped controllers by David Brownell on Thursday, August 26, 2010 - 9:22 am. (3 messages)

Next thread: [PATCH] scatterlist: prevent invalid free when alloc fails by Jeffrey Carlyle on Thursday, August 26, 2010 - 9:04 am. (16 messages)
From: Mark Hills
Date: Thursday, August 26, 2010 - 8:51 am

I am experiencing hung tasks when trying to rmdir() on a cgroup. One task 
spins, others queue up behind it with the following:

  INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
  ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
  0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
  ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
  Call Trace:
  [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
  [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
  [<ffffffff81108a7c>] ? path_put+0x1d/0x22
  [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
  [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
  [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
  [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
  [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no 
tasks.

Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to 
the rmdir. It looks like what I am seeing here and indicates that some 
cgroup subsystem is busy, indefinitely.

I have not worked out how to reproduce it quickly. My only way is to 
complete a 'dd' command in the cgroup, but then the problem is so rare it 
is slow progress.

Documentation/cgroup.memory.txt describes how force_empty can be required 
in some cases. Does this mean that with the patch above, these cases will 
now spin on rmdir(), instead of returning -EBUSY? How can produce a 
reliable test case requiring memory.force_empty to be used, to test this?

Or is it likely to be some other cause, and how best to find it?

Thanks

-- 
Mark
--

From: Daisuke Nishimura
Date: Thursday, August 26, 2010 - 5:56 pm

Hi.

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
What cgroup subsystem did you mount where the directory existed you tried
to rmdir() first ?
If you mounted several subsystems on the same hierarchy, can you mount them
separately to narrow down the cause ?


Thanks,
Daisuke Nishimura.
--

From: Balbir Singh
Date: Thursday, August 26, 2010 - 6:20 pm

On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura

It would also be nice to see what your mounted cgroup (filesystem
perspective) looks like and what /proc/cgroups looks like when the
problem occurs.

Balbir
--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 26, 2010 - 7:35 pm

On Fri, 27 Aug 2010 09:56:39 +0900

It seems I can reproduce the issue on mmotm-0811, too.

try this.

Here, memory cgroup is mounted at /cgroups.
==
#!/bin/bash -x

while sleep 1; do
        date
        mkdir /cgroups/test
        echo 0 > /cgroups/test/tasks
        echo 300M > /cgroups/test/memory.limit_in_bytes
        cat /proc/self/cgroup
        dd if=/dev/zero of=./tmpfile bs=4096 count=100000
        echo 0 > /cgroups/tasks
        cat /proc/self/cgroup
        rmdir /cgroups/test
        rm ./tmpfile
done
==

hangs at rmdir. I'm no investigating force_empty.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Thursday, August 26, 2010 - 8:39 pm

On Fri, 27 Aug 2010 11:35:06 +0900
Thank you very much for your information.

Some questions.

Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
And, how long does it likely to take to cause this problem ?
I've run it on RHEL6-based kernel/ext3 for about one hour, but
I cannot reproduce it yet.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 26, 2010 - 10:42 pm

On Fri, 27 Aug 2010 12:39:48 +0900


Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 26, 2010 - 11:29 pm

On Fri, 27 Aug 2010 14:42:25 +0900
Sorry, my test just hangs on -mm + (other patches)
no troubles on 2.6.34 and 2.6.36-rc1.

Where can I see  2.6.33.6(Fedora) kernel ?

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, August 30, 2010 - 12:32 am

You can get the SRPM from the mirrors, one place to find it would be

http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/ 

-- 
	Three Cheers,
	Balbir
--

From: Mark Hills
Date: Monday, August 30, 2010 - 2:13 am

The test case I was running is similar to the above. With the Lustre 
filesystem the problem takes 4 hours or more to show itself. Recently I 
ran 4 threads for over 24 hours without it being seen -- I suspect some 
external factor is involved.

I also tried NFS, and did not see a problem after 8 hours or so, but this 
is inconclusive.

The use of the Fedora kernel, and the Lustre filesystem is not 
satisfactory to trace the bug. Until I can get a test case which is more 
readily reproducable, I'm not able to reasonably think about changing 
variables.

It is interesting you see the problem so readily on ext4; I will test that 
soon (it is currently holiday weekend in the UK). I hope it will give me 
the test case I am looking for.

Thanks

-- 
Mark
--

From: Mark Hills
Date: Wednesday, September 1, 2010 - 4:10 am

I repeated the test above, but did not see a problem after many hundreds 
of loops.

My test was with the same kernel from my original bug report (Fedora 
2.6.33.6-147), using memory cgroup only and ext4 filesystem.

So it is possible we are experiencing different bugs with similar 
symptoms.

-- 
Mark
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 1, 2010 - 4:42 pm

On Wed, 1 Sep 2010 12:10:23 +0100 (BST)

Thank you for confirming.
But hmm...it's curious who holds mutex and what happens.

-Kame

--

From: Mark Hills
Date: Thursday, September 2, 2010 - 2:45 am

Refer to my original email, where I was running multiple tests at once. 
This backtrace is from the tests which queue up:

  Call Trace:
  [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
  [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
  [<ffffffff81108a7c>] ? path_put+0x1d/0x22
  [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
  [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
  [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
  [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
  [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

The one which spins has already managed to claim the mutex lock on the 
/cgroup directory, and no call trace is shown for this. Is there a usable 
way to force a similar call trace for the spinning process?

Unfortunately I have not been able to reproduce the problem for some days 
now, so I think some network factor is able to influence this.

-- 
Mark
--

From: Mark Hills
Date: Thursday, September 9, 2010 - 3:01 am

On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:


I have a system showing the failure case (but still do not have a way to 
reliably repeat it)

Here are the two processes:

23586 pts/0    RL+  5059:18 /net/homes/mhills/tmp/soaked-cgroup
23685 pts/6    DL+    0:00 /net/homes/mhills/tmp/soaked-cgroup

23586 spends almost all of its time in 'RL+' status, occasionally it is 
seen in 'DL+' status.

From my analysis before, both are blocked on rmdir(), but one is spinning, 
holding the lock on the /cgroup, and the other is waiting for the lock. If 
I strace 23586 then the rmdir() fails with EINTR.

How best to capture information which might show why the process spins?

-- 
Mark
--

From: Balbir Singh
Date: Thursday, September 9, 2010 - 3:09 am

Any chance you can compile with debug cgroup subsystem and get
information from there? 

-- 
	Three Cheers,
	Balbir
--

From: Mark Hills
Date: Thursday, September 9, 2010 - 4:36 am

I can, I'd like to experiment with a custom kernel next.

I am still finding the problem incredibly hard to reproduce, so I'd like 
to observe as much data as possible from the current case before 
rebooting. If I could capture some kind of stack trace in the kernel for 
the running process that would be great, any suggestions appreciated.

Thanks

-- 
Mark
--

From: Peter Zijlstra
Date: Thursday, September 9, 2010 - 4:50 am

echo l > /proc/sysrq-trigger

another thing you can do is run something like: perf record -gp $pid
which will give you a profile of that task.
--

From: Mark Hills
Date: Thursday, September 9, 2010 - 4:04 pm

Despite running this many times, I never 'catch' the process on a CPU, 

This is very useful, thanks.

The report on the spinning process (23586) is dominated by calls from 
mem_cgroup_force_empty.

It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
the load (I assume drain_all_stock_sync has been optimised out). But I 
don't think this is as important as what causes the spin.

There are no tasks in the cgroup, but memory usage is non-zero and 
constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in 
this case.

  # cat /cgroup/soaked-23586/tasks
  # cat /cgroup/soaked-23586/memory.usage_in_bytes
  24576
  # cat /cgroup/soaked-23586/memsw.usage_in_bytes
  <hangs>

Here are the first few entries from the perf output, I can provide the 
rest if needed, but all result from mem_cgroup_force_empty.

     8.13%   :23586  [kernel]           [k] _raw_spin_lock_irqsave
             |
             --- _raw_spin_lock_irqsave
                |          
                |--45.14%-- probe_workqueue_insertion
                |          insert_work
                |          |          
                |          |--99.09%-- __queue_work
                |          |          queue_work_on
                |          |          schedule_work_on
                |          |          schedule_on_each_cpu
                |          |          |          
                |          |          |--50.59%-- lru_add_drain_all
                |          |          |          mem_cgroup_force_empty
                |          |          |          mem_cgroup_pre_destroy
                |          |          |          cgroup_rmdir
                |          |          |          vfs_rmdir
                |          |          |          do_rmdir
                |          |          |          sys_rmdir
                |          |          |          system_call_fastpath
                |          |          |          0x3f504d27d7
                |  ...
From: KAMEZAWA Hiroyuki
Date: Thursday, September 9, 2010 - 4:43 pm

On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
I think this "cat" hang is because of vfs's lock.

Hmm, then, there are pages on LRU which cannot be moved or there is
leak of account.

BTW, mem_cgroup's rmdir is desgined to be able to receive SIGINT etc...
Can't you stop rmdir by Ctrl-C or some ?

  rmdir -> hang -> Ctrl-C (or some) -> cat .../memory.stat

can work ? And do you still use Fedora's kernel ?

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 9, 2010 - 7:16 pm

On Fri, 10 Sep 2010 00:04:31 +0100 (BST)

I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
I wrote a patch (onto 2.6.36 but can be applied..)

Could you try this ? I'm sorry I don't use FUSE system and can't test
right now.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.

This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup. 

Note: This gfp flag can be used for shmem handling, which now uses
      complicated heuristics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/fuse/dev.c       |   11 ++++++++++-
 include/linux/gfp.h |    7 +++++++
 mm/memcontrol.c     |    2 +-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/swap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
 	struct pipe_buffer *buf = cs->pipebufs;
 	struct address_space *mapping;
 	pgoff_t index;
+	gfp_t mask = GFP_KERNEL;
 
 	unlock_request(cs->fc, cs->req);
 	fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
 	remove_from_page_cache(oldpage);
 	page_cache_release(oldpage);
 
-	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+	/*
+	 * not-on-LRU pages are out of control. So, add to root cgroup.
+ 	 * See mm/memcontrol.c for details.
+	 */
+	if (buf->flags & ...
From: Daisuke Nishimura
Date: Thursday, September 9, 2010 - 9:05 pm

On Fri, 10 Sep 2010 11:16:46 +0900
The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
But this change means that we don't charge these pages at all.

Should it be:

	if (gfp_mask & __GFP_NOMEMCGROUP))
		mm = &init_mm;

?
Or, change the comment ?


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 9, 2010 - 9:11 pm

On Fri, 10 Sep 2010 13:05:39 +0900

yes....the comment is wrong.

Thanks,
-Kame
==

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.

This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup. 

Note: This gfp flag can be used for shmem handling, which now uses
      complicated heuristics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/fuse/dev.c       |   11 ++++++++++-
 include/linux/gfp.h |    7 +++++++
 mm/memcontrol.c     |    2 +-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/swap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
 	struct pipe_buffer *buf = cs->pipebufs;
 	struct address_space *mapping;
 	pgoff_t index;
+	gfp_t mask = GFP_KERNEL;
 
 	unlock_request(cs->fc, cs->req);
 	fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
 	remove_from_page_cache(oldpage);
 	page_cache_release(oldpage);
 
-	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+	/*
+	 * non-LRU pages are out of cgroup controls.
+ 	 * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details.
+	 */
+	if (buf->flags & PIPE_BUF_FLAG_LRU)
+		mask |= __GFP_NOMEMCGROUP;
+
+	err = add_to_page_cache_locked(newpage, mapping, index, mask);
 	if (err) {
 		printk(KERN_WARNING ...
From: Mark Hills
Date: Friday, September 10, 2010 - 12:28 am

What makes you conclude that FUSE is in use? I do not think this is the 
case. Or do you mean that it is a problem that the kernel is built with 
FUSE support?

I _can_ test the patch, but I still cannot reliably reproduce the problem 
so it will be hard to conclude whether the patch works or not. Is there a 
way to build a test case for this?

Thanks for your help

-- 
Mark
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 10, 2010 - 12:33 am

On Fri, 10 Sep 2010 08:28:00 +0100 (BST)


I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge
which cannot be found by force_empty(). And I found FUSE's pipe copy code
inserts a page cache into radix-tree but not move them onto LRU.

So,
  - There are remaining pages which is out-of-LRU
  - FUSE's "move" code does something curious, add_to_page_cache() but not LRU.
  - You reporeted you use Lustre FS.

Then, I ask you. To test this, I have to study FUSE to write test module...
Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c
can show something but...

We may have something other problem, but it seems this is one of them.

Thanks,
-Kame

--

From: Mark Hills
Date: Friday, September 10, 2010 - 12:51 am

Lustre does not use FUSE. But the client is a set of kernel modules, so 

Okay, it sounds like perhaps I need to investigate Lustre, I will do this 
next week. But I think FUSE can be ruled out.

Thanks again

-- 
Mark
--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 26, 2010 - 6:25 pm

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)

please show how-to-reproduce in your way.


Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.

At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
Then, rmdir doesn't seem to reach cgroup code...
Do you do another operation on the directory while rmdir is called ?

Thanks,
-Kame


--

From: Mark Hills
Date: Monday, August 30, 2010 - 2:25 am

It sleeps in D state, but enters interruptable state periodically which is 

I use a C program which creates a container and places itself in the 
container, then forks a dd process.


Quite a few: memory, blkio, cpuacct, cpuset.

Until I can get a more reproducable test case (see my previous mail), I 

Interesting, I checked for that but not sure how I missed it. There is 

In one case I did an 'ls -l' on the filesystem which coencided with a lock 
up, but I was not able to reproduce this.

-- 
Mark
--

Previous thread: Re: [PATCH] gpio: Add generic driver for simple memory mapped controllers by David Brownell on Thursday, August 26, 2010 - 9:22 am. (3 messages)

Next thread: [PATCH] scatterlist: prevent invalid free when alloc fails by Jeffrey Carlyle on Thursday, August 26, 2010 - 9:04 am. (16 messages)