The powerpc page table freeing relies on the fact that IRQs hold off an RCU grace period, this is currently true for all existing RCU implementations but is not an assumption Paul wants to support. Therefore, also take the RCU read lock along with disabling IRQs to ensure the RCU grace period does at least cover these lookups. Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <npiggin@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- arch/powerpc/mm/gup.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/arch/powerpc/mm/gup.c =================================================================== --- linux-2.6.orig/arch/powerpc/mm/gup.c +++ linux-2.6/arch/powerpc/mm/gup.c @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st * So long as we atomically load page table pointers versus teardown, * we can follow the address down to the the page and take a ref on it. */ + rcu_read_lock(); local_irq_disable(); pgdp = pgd_offset(mm, addr); @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st } while (pgdp++, addr = next, addr != end); local_irq_enable(); + rcu_read_unlock(); VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT); return nr; @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st slow: local_irq_enable(); + rcu_read_unlock(); slow_irqon: pr_devel(" slow path ! nr = %d\n", nr); --
Reviewed-by: Rik van Riel <riel@redhat.com> --
It's not wrong per se, but the entire powerpc memory management code does the IRQ disabling for its pagetable RCU code. So I think it would be better to do the whole thing in one go. I don't think Paul will surprise-break powerpc :) It's up to Ben really, though. --
There's a few other places that need a similar fix then. The hash page code for example. All the C cases should end up calling the find_linux_pte() helper afaik, so we should be able to stick the lock in there (and the hugetlbfs variant, find_linux_pte_or_hugepte()). However, we also have cases of tight asm code walking the page tables, such as the tlb miss handler on embedded processors. I don't see how I could do that there. IE. I only have a handful of registers to play with, no stack, etc... So we might have to support the interrupt assumption, at least in some form, with those guys... Cheers, --
One way to make the interrupt assumption official is to use synchronize_sched() rather than synchronize_rcu(). --
Well, call_rcu_sched() then, because the current usage is to use call_rcu() to free the page directories. Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I right in reading that code that that would not be available for preemptible RCU? --
Both call_rcu_sched() and call_rcu() are always there for you. ;-) o If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same implementation. o If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and call_rcu() is not. Of course, with call_rcu_sched(), the corresponding RCU read-side critical sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these read-side critical sections must use raw spinlocks. Can the code in question accommodate these restrictions? Thanx, Paul --
What we protect against is always code that hard-disable IRQs (though there seem to be a bug in the hugepages code there...). Would that work ? Cheers, Ben. --
From the perspective of call_rcu_sched() and synchronize_sched(), the following things mark RCU-sched read-side critical sections: 1. rcu_read_lock_sched() and rcu_read_unlock_sched(). 2. preempt_disable() and preempt_enable(), along with anything else that disables preemption. 3. local_bh_disable() and local_bh_enable(), along with anything else that disables bottom halves. 4. local_irq_disable() and local_irq_enable(), along wiht anything else that disables hardirqs. 5. Handlers for NMIs. So I believe that in this case call_rcu_sched() is your friend. ;-) Thanx, Paul --
Looks like it :-) I'll cook up a patch changing my current call_rcu() to call_rcu_sched(). Cheers, Ben. --
OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
synchronize_rcu} functions to {*}_preempt and then add a new
CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
{*}_preempt, we've basically got what I've been asking for for a while,
Yes, that should do just fine I think.
--
Indeed: If CONFIG_TREE_PREEMPT_RCU, call_rcu() is preemptible and Cool!!! Thanx, Paul --
Same as for a preempt one, since you'd have to be able to schedule() while holding it to be able to do things like mutex_lock(). --
So what you really want is something like rcu_read_lock_sleep() rather than rcu_read_lock_preempt(), right? The point is that you want to do more than merely preempt, given that it is legal to do general blocking while holding a mutex, correct? Thanx, Paul --
Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change the name to _sleep, but we've been calling it preemptible-rcu for a long while now. --
It is actually not permitted to do general blocking in a preemptible RCU read-side critical section. Otherwise, someone is going to block waiting for a network packet that never comes, thus OOMing the system. Thanx, Paul --
Sure, something that guarantees progress seems like a sensible restriction for any lock, and in particular RCU :-) --
Excellent point -- much of the issue really does center around forward-progress guarantees. In fact the Linux kernel has a number of locking primitives that require different degrees of forward-progress guarantee from the code in their respective critical sections: o spin_lock_irqsave(): Critical sections must guarantee forward progress against everything except NMI handlers. o raw_spin_lock(): Critical sections must guarantee forward progress against everything except IRQ (including softirq) and NMI handlers. o spin_lock(): Critical sections must guarantee forward progress against everything except IRQ (again including softirq) and NMI handlers and (given CONFIG_PREEMPT_RT) higher-priority realtime tasks. o mutex_lock(): Critical sections need not guarantee forward progress, as general blocking is permitted. The other issue is the scope of the lock. The Linux kernel has the following: o BKL: global scope. o Everything else: scope defined by the use of the underlying lock variable. One of the many reasons that we are trying to get rid of BKL is because it combines global scope with relatively weak forward-progress guarantees. So here is how the various RCU flavors stack up: o rcu_read_lock_bh(): critical sections must guarantee forward progress against everything except NMI handlers and IRQ handlers, but not against softirq handlers. Global in scope, so that violating the forward-progress guarantee risks OOMing the system. o rcu_read_lock_sched(): critical sections must guarantee forward progress against everything except NMI and IRQ handlers, including softirq handlers. Global in scope, so that violating the forward-progress guarantee risks OOMing the system. o rcu_read_lock(): critical sections must guarantee forward progress against everything except NMI handlers, IRQ handlers, softirq handlers, and (in CONFIG_PREEMPT_RT) higher-priority realtime tasks. Global in scope, so that violating the forward-progress guarantee risks ...
Right, I would argue that they should guarantee fwd progress, but due to being able to schedule while holding them, its harder to enforce. Anything that is waiting for uncertainty should do so without any locks No, I quite like SRCU when implemented as preemptible tree RCU, and I don't at all mind that last point, all dynamic things need some sort of init. All locks certainly have. --
Agreed. But holding a small-scope mutex for (say) 60 seconds would not be a problem (at 120 seconds, you might start seeing softlockup messages). In contrast, holding off an RCU grace period for 60 seconds might well Very good!!! I should clarify, though -- by "explicit initialization", I mean that there needs to be a run-time call to init_srcu_struct(). Unless there is some clever way to initialize an array of pointers to per-CPU structures at compile time. And, conversely, a way to initialize pointers in a per-CPU structure to point to possibly-different rcu_node structures. Thanx, Paul --
This isn't quite right. mutex critical sections must guarantee eventual forward progress against the class of other potential acquirers of the mutex otherwise the system will become either deadlocked or livelocked. James --
If I understand you correctly, you are saying that it is OK for a given critical section for a given mutex to fail to make forward progress if nothing else happens to acquire that mutex during that time. I would agree, at least I would if you were to further add that the soft-lockup checks permit an additional 120 seconds of failure to make forward progress even if something -is- attempting to acquire that mutex. By my standards, 120 seconds is a reasonable approximation to infinity, hence my statement above. So, would you agree with the following as a more precise statement? o mutex_lock(): Critical sections need not guarantee forward progress unless some other task is waiting on the mutex in question, in which case critical sections should complete in 120 seconds. Thanx, Paul --
Yes ... I was thinking of two specific cases: one is wrong programming of lock acquisition where the system deadlocks; the other is doing silly things like taking a mutex around an event loop instead of inside it so incoming events prevent forward progress and the system livelocks, but there are many other ways of producing deadlocks and livelocks. I just couldn't think of a concise way of saying all of that but I didn't want a statement about mutexes to give the impression that anything goes. I've got to say that I also dislike seeing any form of sleep within a critical section because it's just asking for a nasty entangled deadlock which can be very hard to sort out. So I also didn't like the statement Sounds fair. James --
Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use neither, we use call_rcu and we free the pages from the callback. Ben. --
BTW. you currently have an interesting page table freeing path where you usually free by RCU, but (occasionally) free by IPI. This means you need to disable both RCU and interrupts to walk page tables. If you change it to always use RCU, then you wouldn't need to disable interrupts. Whether this actually matters anywhere in your mm code, I don't know (it's probably not terribly important for gup_fast). But rcu disable is always preferable for latency and performance. --
Well, the point is we use interrupts to synchronize. The fact that RCU used to do the job was an added benefit. I may need to switch to rcu _sched variants tho to keep that. The IPI case is a slow path in case we Well, the main case is the hash miss and that always runs with IRQs off. Cheers, --
It is the slowpath but it forces all lookup paths to do irq disable Probably not a big deal then. --
