Re: [PATCH] fs: run emergency remount on dedicated workqueue

Previous thread: Invetsment by Victor Miller on Wednesday, May 26, 2010 - 7:53 am. (1 message)

Next thread: PROBLEM: getpid() returning same value as getppid() by Sébastien Paumier on Wednesday, May 26, 2010 - 7:45 am. (3 messages)
From: David Howells
Date: Wednesday, May 26, 2010 - 8:01 am

The following commit may be a problem for emergency_remount() [Alt+SysRq+U]:

	commit fa4b9074cd8428958c2adf9dc0c831f46e27c193
	Author: Tejun Heo <tj@kernel.org>
	Date:   Sat May 15 20:09:27 2010 +0200

	    buffer: make invalidate_bdev() drain all percpu LRU add caches

	    invalidate_bdev() should release all page cache pages which are clean
	    and not being used; however, if some pages are still in the percpu LRU
	    add caches on other cpus, those pages are considered in used and don't
	    get released.  Fix it by calling lru_add_drain_all() before trying to
	    invalidate pages.

	    This problem was discovered while testing block automatic native
	    capacity unlocking.  Null pages which were read before automatic
	    unlocking didn't get released by invalidate_bdev() and ended up
	    interfering with partition scan after unlocking.

	    Signed-off-by: Tejun Heo <tj@kernel.org>
	    Acked-by: David S. Miller <davem@davemloft.net>
	    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

The symptoms are a lockdep warning:

SysRq : Emergency Remount R/O

=============================================
[ INFO: possible recursive locking detected ]
2.6.34-cachefs #101
---------------------------------------------
events/0/9 is trying to acquire lock:
 (events){+.+.+.}, at: [<ffffffff81042cf0>] flush_work+0x34/0xec

but task is already holding lock:
 (events){+.+.+.}, at: [<ffffffff81042264>] worker_thread+0x19a/0x2e2

other info that might help us debug this:
3 locks held by events/0/9:
 #0:  (events){+.+.+.}, at: [<ffffffff81042264>] worker_thread+0x19a/0x2e2
 #1:  ((work)#3){+.+...}, at: [<ffffffff81042264>] worker_thread+0x19a/0x2e2
 #2:  (&type->s_umount_key#30){++++..}, at: [<ffffffff810b3fc8>] do_emergency_remount+0x54/0xda

stack backtrace:
Pid: 9, comm: events/0 Not tainted 2.6.34-cachefs #101
Call Trace:
 [<ffffffff81054e80>] validate_chain+0x584/0xd23
 [<ffffffff81052ad4>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff8101c264>] ? ...
From: Tejun Heo
Date: Thursday, May 27, 2010 - 2:57 am

Commit fa4b9074cd8428958c2adf9dc0c831f46e27c193 made s_umount depend
on keventd; however, emergency remount schedules works to keventd
which grabs s_umount creating a circular dependency.  Run emergency
remount on a separate workqueue to break it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Unless someone objects, Andrew, can you please take this patch?

Thanks.

 fs/super.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 69688b1..1ada607 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -575,6 +575,11 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	return 0;
 }

+/*
+ * For emergency remount
+ */
+static struct workqueue_struct *emergency_remount_wq;
+
 static void do_emergency_remount(struct work_struct *work)
 {
 	struct super_block *sb, *n;
@@ -605,13 +610,25 @@ void emergency_remount(void)
 {
 	struct work_struct *work;

+	if (!emergency_remount_wq)
+		return;
+
 	work = kmalloc(sizeof(*work), GFP_ATOMIC);
 	if (work) {
 		INIT_WORK(work, do_emergency_remount);
-		schedule_work(work);
+		queue_work(emergency_remount_wq, work);
 	}
 }

+static int __init emergency_remount_init(void)
+{
+	emergency_remount_wq = create_singlethread_workqueue("emerg-remount");
+	if (!emergency_remount_wq)
+		pr_warn("failed to create emergency remount workqueue\n");
+	return 0;
+}
+subsys_initcall(emergency_remount_init);
+
 /*
  * Unnamed block devices are dummy devices used by virtual
  * filesystems which don't use real block-devices.  -- jrs
--

From: Américo Wang
Date: Thursday, May 27, 2010 - 7:59 am

I have a stupid question, why using workqueue instead of
calling do_remount_sb() directly in emergency_remount()?
Avoid blocking emergency_remount()?

Thanks.

-- 
Live like a child, think like the god.
 
--

From: Tejun Heo
Date: Thursday, May 27, 2010 - 10:03 am

Umm... because it's called from interrupt handler?  Right?

-- 
tejun
--

From: Américo Wang
Date: Thursday, May 27, 2010 - 11:46 pm

Ah, this is true, sysrq can be both triggered by keyboard and
/proc/sysrq-trigger.

Thanks!
--

From: Andrew Morton
Date: Tuesday, June 1, 2010 - 4:46 pm

On Thu, 27 May 2010 11:57:23 +0200

For a while I thought you had the wrong commit ID, but I worked it out!

Please, always quote the patch title rather than a bare commit ID.  The
usual form is

    fa4b9074cd8428958c2adf9dc0c831f46e27c193 ("buffer: make
    invalidate_bdev() drain all percpu LRU add caches:)

The main reason for this is so that people can more reliably and simply

gaah.  Do we really want to add Yet Another Kernel Thread just for that
dopey sysrq-U thing?

I assume (coz you didn't tell us) that it generates a lockdep spew? 
Perhaps it'd be better to just suppress that somehow rather than this...

And if we _do_ end up adding a new kernel thread for this, maybe it
would be better to use that thread for lru_add_drain_all() rather than
within the dopey do_emergency_remount(), so as to reduce the likelihood
that we'll need to add even more kernel threads to solve the same
problem elsewhere?  But this would require a new kernel thread on each
CPU, grr.

Another possibility might be to change lru_add_drain_all() to use IPI
interrupts rather than schedule_on_each_cpu().  That would greatly
speed up lru_add_drain_all().  I don't recall why we did it that way
and I don't immediately see a reason not to.  A few things in core mm
would need to be changed from spin_lock_irq() to spin_lock_irqsave().

But I do have vague memories that there was a reason for it.

<It's a huge PITA locating the commit which initially added
lru_add_drain_all()>


It didn't tell us.

<looks in the linux-mm archives>

Nope, no rationale is provided there either.
--

From: Linus Torvalds
Date: Tuesday, June 1, 2010 - 4:57 pm

Absolutely. Also, I think it's usually more readable to quote just the 
first 12 hex digits of the SHA1 - that's still going to be perfectly 
unique in any practical situation, and makes it way easier to flow the 

I do have to agree that it's disgusting. Can't we use an existing thread 
(slow-work?) or something like that?

		Linus
--

From: Tejun Heo
Date: Tuesday, June 1, 2010 - 5:13 pm

Hello,



The dedicated workqueue can go away with cmwq.  As it's a temporary
measure until then, I wanted to keep it simple.  Would it be okay if I
note that the dedicated workqueue will go away soonish in the patch
description and comment?

Thanks.

-- 
tejun
--

From: Dave Young
Date: Tuesday, June 1, 2010 - 6:02 pm

Maybe this thread?




-- 
Regards
dave
--

From: Andrew Morton
Date: Tuesday, June 1, 2010 - 6:57 pm

Close.  There's some talk there of using smp_call_function() (actually
on_each_cpu()) within lru_add_drain_all(), but nobody seems to have
tried it.

--

Previous thread: Invetsment by Victor Miller on Wednesday, May 26, 2010 - 7:53 am. (1 message)

Next thread: PROBLEM: getpid() returning same value as getppid() by Sébastien Paumier on Wednesday, May 26, 2010 - 7:45 am. (3 messages)