Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation

Previous thread: [PATCH 06/13] mm: Preemptible mmu_gather by Peter Zijlstra on Thursday, April 8, 2010 - 12:17 pm. (5 messages)

Next thread: [PATCH 08/13] sparc: Preemptible mmu_gather by Peter Zijlstra on Thursday, April 8, 2010 - 12:17 pm. (1 message)
From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 12:17 pm

The powerpc page table freeing relies on the fact that IRQs hold off
an RCU grace period, this is currently true for all existing RCU
implementations but is not an assumption Paul wants to support.

Therefore, also take the RCU read lock along with disabling IRQs to
ensure the RCU grace period does at least cover these lookups.

Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/gup.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/gup.c
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
 	 * So long as we atomically load page table pointers versus teardown,
 	 * we can follow the address down to the the page and take a ref on it.
 	 */
+	rcu_read_lock();
 	local_irq_disable();
 
 	pgdp = pgd_offset(mm, addr);
@@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
 	} while (pgdp++, addr = next, addr != end);
 
 	local_irq_enable();
+	rcu_read_unlock();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
 	return nr;
@@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
 
 slow:
 		local_irq_enable();
+		rcu_read_unlock();
 slow_irqon:
 		pr_devel("  slow path ! nr = %d\n", nr);
 


--

From: Rik van Riel
Date: Thursday, April 8, 2010 - 1:31 pm

Reviewed-by: Rik van Riel <riel@redhat.com>
--

From: Nick Piggin
Date: Thursday, April 8, 2010 - 8:11 pm

It's not wrong per se, but the entire powerpc memory management code
does the IRQ disabling for its pagetable RCU code. So I think it would be
better to do the whole thing in one go.

I don't think Paul will surprise-break powerpc :)

It's up to Ben really, though.


--

From: Benjamin Herrenschmidt
Date: Monday, April 12, 2010 - 6:05 pm

There's a few other places that need a similar fix then. The hash page
code for example. All the C cases should end up calling the
find_linux_pte() helper afaik, so we should be able to stick the lock in
there (and the hugetlbfs variant, find_linux_pte_or_hugepte()).

However, we also have cases of tight asm code walking the page tables,
such as the tlb miss handler on embedded processors. I don't see how I
could do that there. IE. I only have a handful of registers to play
with, no stack, etc...

So we might have to support the interrupt assumption, at least in some
form, with those guys...

Cheers,


--

From: Paul E. McKenney
Date: Monday, April 12, 2010 - 8:43 pm

One way to make the interrupt assumption official is to use
synchronize_sched() rather than synchronize_rcu().

--

From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 6:51 am

Well, call_rcu_sched() then, because the current usage is to use
call_rcu() to free the page directories.

Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
right in reading that code that that would not be available for
preemptible RCU?


--

From: Paul E. McKenney
Date: Thursday, April 15, 2010 - 7:28 am

Both call_rcu_sched() and call_rcu() are always there for you.  ;-)

o	If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
	implementation.

o	If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
	call_rcu() is not.

Of course, with call_rcu_sched(), the corresponding RCU read-side critical
sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
read-side critical sections must use raw spinlocks.

Can the code in question accommodate these restrictions?

							Thanx, Paul
--

From: Benjamin Herrenschmidt
Date: Thursday, April 15, 2010 - 11:54 pm

What we protect against is always code that hard-disable IRQs (though
there seem to be a bug in the hugepages code there...). Would that
work ?

Cheers,
Ben.

--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 6:43 am

From the perspective of call_rcu_sched() and synchronize_sched(),
the following things mark RCU-sched read-side critical sections:

1.	rcu_read_lock_sched() and rcu_read_unlock_sched().

2.	preempt_disable() and preempt_enable(), along with anything
	else that disables preemption.

3.	local_bh_disable() and local_bh_enable(), along with anything
	else that disables bottom halves.

4.	local_irq_disable() and local_irq_enable(), along wiht anything
	else that disables hardirqs.

5.	Handlers for NMIs.

So I believe that in this case call_rcu_sched() is your friend.  ;-)

							Thanx, Paul
--

From: Benjamin Herrenschmidt
Date: Friday, April 16, 2010 - 4:25 pm

Looks like it :-)

I'll cook up a patch changing my current call_rcu() to call_rcu_sched().

Cheers,
Ben.


--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 6:51 am

OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
synchronize_rcu} functions to {*}_preempt and then add a new
CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
{*}_preempt, we've basically got what I've been asking for for a while,

Yes, that should do just fine I think.
--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 7:17 am

Indeed:  If CONFIG_TREE_PREEMPT_RCU, call_rcu() is preemptible and


Cool!!!

							Thanx, Paul
--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 7:23 am

Same as for a preempt one, since you'd have to be able to schedule()
while holding it to be able to do things like mutex_lock().
--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 7:32 am

So what you really want is something like rcu_read_lock_sleep() rather
than rcu_read_lock_preempt(), right?  The point is that you want to do
more than merely preempt, given that it is legal to do general blocking
while holding a mutex, correct?

							Thanx, Paul
--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 7:56 am

Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
the name to _sleep, but we've been calling it preemptible-rcu for a long
while now.
--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 8:09 am

It is actually not permitted to do general blocking in a preemptible RCU
read-side critical section.  Otherwise, someone is going to block waiting
for a network packet that never comes, thus OOMing the system.

							Thanx, Paul
--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 8:14 am

Sure, something that guarantees progress seems like a sensible
restriction for any lock, and in particular RCU :-)


--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 9:45 am

Excellent point -- much of the issue really does center around
forward-progress guarantees.  In fact the Linux kernel has a number of
locking primitives that require different degrees of forward-progress
guarantee from the code in their respective critical sections:

o	spin_lock_irqsave(): Critical sections must guarantee forward
	progress against everything except NMI handlers.

o	raw_spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (including softirq)
	and NMI handlers.

o	spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (again including softirq)
	and NMI handlers and (given CONFIG_PREEMPT_RT) higher-priority
	realtime tasks.

o	mutex_lock(): Critical sections need not guarantee
	forward progress, as general blocking is permitted.

The other issue is the scope of the lock.  The Linux kernel has
the following:

o	BKL: global scope.

o	Everything else: scope defined by the use of the underlying
	lock variable.

One of the many reasons that we are trying to get rid of BKL is because
it combines global scope with relatively weak forward-progress guarantees.

So here is how the various RCU flavors stack up:

o	rcu_read_lock_bh(): critical sections must guarantee forward
	progress against everything except NMI handlers and IRQ handlers,
	but not against softirq handlers.  Global in scope, so that
	violating the forward-progress guarantee risks OOMing the system.

o	rcu_read_lock_sched(): critical sections must guarantee
	forward progress against everything except NMI and IRQ handlers,
	including softirq handlers.  Global in scope, so that violating
	the forward-progress guarantee risks OOMing the system.

o	rcu_read_lock(): critical sections must guarantee forward
	progress against everything except NMI handlers, IRQ handlers,
	softirq handlers, and (in CONFIG_PREEMPT_RT) higher-priority
	realtime tasks.  Global in scope, so that violating the
	forward-progress guarantee risks ...
From: Peter Zijlstra
Date: Friday, April 16, 2010 - 12:37 pm

Right, I would argue that they should guarantee fwd progress, but due to
being able to schedule while holding them, its harder to enforce.

Anything that is waiting for uncertainty should do so without any locks

No, I quite like SRCU when implemented as preemptible tree RCU, and I
don't at all mind that last point, all dynamic things need some sort of
init. All locks certainly have.

--

From: Paul E. McKenney
Date: Friday, April 16, 2010 - 1:28 pm

Agreed.  But holding a small-scope mutex for (say) 60 seconds would not be
a problem (at 120 seconds, you might start seeing softlockup messages).
In contrast, holding off an RCU grace period for 60 seconds might well

Very good!!!  I should clarify, though -- by "explicit initialization",
I mean that there needs to be a run-time call to init_srcu_struct().
Unless there is some clever way to initialize an array of pointers to
per-CPU structures at compile time.  And, conversely, a way to initialize
pointers in a per-CPU structure to point to possibly-different rcu_node
structures.

							Thanx, Paul
--

From: James Bottomley
Date: Saturday, April 17, 2010 - 8:06 pm

This isn't quite right.  mutex critical sections must guarantee eventual
forward progress against the class of other potential acquirers of the
mutex otherwise the system will become either deadlocked or livelocked.

James


--

From: Paul E. McKenney
Date: Sunday, April 18, 2010 - 6:55 am

If I understand you correctly, you are saying that it is OK for a given
critical section for a given mutex to fail to make forward progress if
nothing else happens to acquire that mutex during that time.  I would
agree, at least I would if you were to further add that the soft-lockup
checks permit an additional 120 seconds of failure to make forward progress
even if something -is- attempting to acquire that mutex.

By my standards, 120 seconds is a reasonable approximation to infinity,
hence my statement above.

So, would you agree with the following as a more precise statement?

o	mutex_lock(): Critical sections need not guarantee
	forward progress unless some other task is waiting
	on the mutex in question, in which case critical sections
	should complete in 120 seconds.

							Thanx, Paul
--

From: James Bottomley
Date: Sunday, April 18, 2010 - 11:55 am

Yes ... I was thinking of two specific cases: one is wrong programming
of lock acquisition where the system deadlocks; the other is doing silly
things like taking a mutex around an event loop instead of inside it so
incoming events prevent forward progress and the system livelocks, but
there are many other ways of producing deadlocks and livelocks.  I just
couldn't think of a concise way of saying all of that but I didn't want
a statement about mutexes to give the impression that anything goes.

I've got to say that I also dislike seeing any form of sleep within a
critical section because it's just asking for a nasty entangled deadlock
which can be very hard to sort out.  So I also didn't like the statement

Sounds fair.

James


--

From: Benjamin Herrenschmidt
Date: Thursday, April 15, 2010 - 11:51 pm

Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
neither, we use call_rcu and we free the pages from the callback.

Ben.


--

From: Nick Piggin
Date: Friday, April 16, 2010 - 1:18 am

BTW. you currently have an interesting page table freeing path where
you usually free by RCU, but (occasionally) free by IPI. This means
you need to disable both RCU and interrupts to walk page tables.

If you change it to always use RCU, then you wouldn't need to disable
interrupts. Whether this actually matters anywhere in your mm code, I
don't know (it's probably not terribly important for gup_fast). But
rcu disable is always preferable for latency and performance.

--

From: Benjamin Herrenschmidt
Date: Friday, April 16, 2010 - 1:29 am

Well, the point is we use interrupts to synchronize. The fact that RCU
used to do the job was an added benefit. I may need to switch to rcu
_sched variants tho to keep that. The IPI case is a slow path in case we

Well, the main case is the hash miss and that always runs with IRQs off.

Cheers,


--

From: Nick Piggin
Date: Friday, April 16, 2010 - 2:22 am

It is the slowpath but it forces all lookup paths to do irq disable

Probably not a big deal then.

--

Previous thread: [PATCH 06/13] mm: Preemptible mmu_gather by Peter Zijlstra on Thursday, April 8, 2010 - 12:17 pm. (5 messages)

Next thread: [PATCH 08/13] sparc: Preemptible mmu_gather by Peter Zijlstra on Thursday, April 8, 2010 - 12:17 pm. (1 message)