It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c ...
Fix __task_cred()'s lockdep check by removing the following validation
condition:
lockdep_tasklist_lock_is_held()
as commit_creds() does not take the tasklist_lock, and nor do most of the
functions that call it, so this check is pointless and it can prevent
detection of the RCU lock not being held if the tasklist_lock is held.
Instead, add the following validation condition:
task->exit_state >= 0
to permit the access if the target task is dead and therefore unable to change
its own credentials.
Fix __task_cred()'s comment to:
(1) discard the bit that says that the caller must prevent the target task
from being deleted. That shouldn't need saying.
(2) Add a comment indicating the result of __task_cred() should not be passed
directly to get_cred(), but rather than get_task_cred() should be used
instead.
Also put a note into the documentation to enforce this point there too.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
Documentation/credentials.txt | 3 +++
include/linux/cred.h | 15 ++++++++++-----
include/linux/sched.h | 1 +
3 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/Documentation/credentials.txt b/Documentation/credentials.txt
index a2db352..995baf3 100644
--- a/Documentation/credentials.txt
+++ b/Documentation/credentials.txt
@@ -417,6 +417,9 @@ reference on them using:
This does all the RCU magic inside of it. The caller must call put_cred() on
the credentials so obtained when they're finished with.
+ [*] Note: The result of __task_cred() should not be passed directly to
+ get_cred() as this may race with commit_cred().
+
There are a couple of convenience functions to access bits of another task's
credentials, hiding the RCU magic from the caller:
diff --git a/include/linux/cred.h b/include/linux/cred.h
index ce40cbc..4d2c395 100644
--- a/include/linux/cred.h
+++ ...I got below warning. Is this related to this patch?
[ 140.173556] ===================================================
[ 140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 140.216461] ---------------------------------------------------
[ 140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
[ 140.218937]
[ 140.218938] other info that might help us debug this:
[ 140.218939]
[ 140.220508]
[ 140.220509] rcu_scheduler_active = 1, debug_locks = 1
[ 140.221991] 1 lock held by init/1:
[ 140.222668] #0: (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
[ 140.224709]
[ 140.224711] stack backtrace:
[ 140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
[ 140.226576] Call Trace:
[ 140.227111] [<c103cca8>] ? printk+0x18/0x20
[ 140.227908] [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
[ 140.228931] [<c104936a>] check_kill_permission+0x15a/0x170
[ 140.229932] [<c104a0ac>] ? kill_something_info+0x7c/0x160
[ 140.230921] [<c1049cca>] group_send_sig_info+0x1a/0x50
[ 140.231866] [<c1049d36>] __kill_pgrp_info+0x36/0x60
[ 140.232780] [<c104a0d0>] kill_something_info+0xa0/0x160
[ 140.233740] [<c10831c5>] ? __call_rcu+0xa5/0x110
[ 140.234596] [<c104b7ee>] sys_kill+0x5e/0x70
[ 140.235387] [<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
[ 140.236329] [<c10bbd10>] ? __fput+0x170/0x220
[ 140.257756] [<c10bbdd9>] ? fput+0x19/0x20
[ 140.258529] [<c137ad94>] ? restore_all_notrace+0x0/0x18
[ 140.259496] [<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
[ 140.260531] [<c137ad61>] syscall_call+0x7/0xb
[ 144.627841] nfsd: last server has exited, flushing export cache
[ 154.040420] Restarting system.
[ 154.041061] machine restart
--
Yes. The patch has uncovered a case of where we should be holding a lock, but
aren't.
Can you try the attached patch?
David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks
A previous patch:
commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
Author: David Howells <dhowells@redhat.com>
Date: Thu Jul 29 12:45:55 2010 +0100
Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
fixed the lockdep checks on __task_cred(). This has shown up a place in the
signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.
It's may be that it would be better to add RCU read lock calls in
group_send_sig_info() only, around the call to check_kill_permission(). On the
other hand, some of the callers are either holding the RCU read lock already,
or have disabled interrupts, in which case, it's just extra overhead to do it
in g_s_s_i().
Without this patch, the following warning can occur:
[ 140.173556] ===================================================
[ 140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 140.216461] ---------------------------------------------------
[ 140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
[ 140.218937]
[ 140.218938] other info that might help us debug this:
[ 140.218939]
[ 140.220508]
[ 140.220509] rcu_scheduler_active = 1, debug_locks = 1
[ 140.221991] 1 lock held by init/1:
[ 140.222668] #0: (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
[ 140.224709]
[ 140.224711] stack backtrace:
[ 140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
[ 140.226576] Call Trace:
[ 140.227111] [<c103cca8>] ? printk+0x18/0x20
[ 140.227908] [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
[ 140.228931] [<c104936a>] check_kill_permission+0x15a/0x170
[ 140.229932] [<c104a0ac>] ? ...Added Oleg and Thomas to the participants.
Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
relevant parts..
It's not just check_kill_permission(), is it? I thought we could do
the "for_each_process()" loops with just RCU, rather than holding the
whole tasklist_lock? So I _think_ that getting the RCU read-lock would
make it possible to get rid of the tasklist_lock in there too? At
least in kill_something_info().
Ok, makes sense. At the same time, I do wonder if we shouldn't perhaps
do this in the caller. The callers would seem to all want it - look at
kill_something_info(), for example, where it really looks like it
would be nicer to put that _whole_ function under rcu_read_lock() (in
the caller itself). Or in kernel/exit.c, we have two calls
back-to-back, so doing it in the caller would clean that up and do it
just once.
with those fragments, we now end up having rcu_read_lock() for _all_
the three cases in kill_something_info(), but rather than do it once,
we do it explicitly, and we do it in two different ways (two cases do
it explicitly, the middle one does it implicitly when it calls
__kill_pgrp_info().
So I think the patch is correct, but at the same time I do get the
feeling that we should just do it slightly differently.
Comments?
Linus
--
Yes, almost all places in the kernel which hold tasklist_lock read locked are safe with RCU. I started to work on that, but got distracted. Thanks, tglx
I am not sure I understand this patch. __task_cred() checks rcu_read_lock_held() || task_is_dead(), and task_is_dead(task) is ((task)->exit_state != 0). OK, task_is_dead() is valid for, say, wait_task_zombie(). But wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive. As for kill_something_info(), I think yes. I even sent (iirc) the protoptype patch a long ago. We can't just remove tasklist, we should avoid the races fork/exit/exec in the kill(-1, SIG) case. The same for kill_pgrp/__kill_pgrp_info(). We need tasklist to ensure that nobody in this group can escape the signal. This seems solveable I must admit, at first glance changing check_kill_permission() to take Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice (unless I missed the new version of RCU), but, say, posix_timer_event() takes rcu_read_lock() exactly because I thought we shouldn't assume that irqs_disabled() acts as rcu_read_lock() ? No, this doesn't look right. find_new_reaper() can drop tasklist and sleep. Besides, this patch conflicts with the change in -mm tree. And imho this looks a bit as "action at a distance". Oleg. --
The problem is that we have to tell lockdep this. Just checking in
__task_cred() that siglock is held is insufficient. That doesn't handle, say,
sys_setuid() from changing the credentials, and effectively skips the check in
places where it mustn't.
Similarly, having interrupts disabled on the CPU we're running on doesn't help
either, since it doesn't stop another CPU replacing those credentials.
There are ways of dealing with wait_task_stopped():
(1) Place an rcu_read_lock()'d section around the call to __task_cred().
(2) Make __task_cred()'s lockdep understand about the target task being
stopped whilst we hold its siglock.
(3) Don't use __task_cred(), but rather dereference the pointer directly:
rcu_dereference_protected(p->real_cred,
lock_is_held(&p->sighand->siglock))
I think group_send_sig_info() would be better. The only other caller of
c_k_p() already has to hold the RCU read lock for other reasons.
This CPU can't be preempted if it can't be interrupted, I think.
David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks
A previous patch:
commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
Author: David Howells <dhowells@redhat.com>
Date: Thu Jul 29 12:45:55 2010 +0100
Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
fixed the lockdep checks on __task_cred(). This has shown up a place in the
signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.
Fix group_send_sig_info() to get the RCU read lock around its call to
check_kill_permission().
Without this patch, the following warning can occur:
[ 140.173556] ===================================================
[ 140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 140.216461] ---------------------------------------------------
[ 140.217530] kernel/signal.c:660 invoked ...Sure, this solves the problem. But probably this needs a comment to explain why do we take rcu lock. OTOH, wait_task_continued() does need rcu_read_lock(), the task is running. UNLESS we believe that local_irq_disable() makes rcu_read_lock() unnecessary, May be... but we have so many special cases. Say, fill_psinfo()->__task_cred(). This is called under rcu lock, but it is not needed. The task is either current or it sleeps in exit_mm(). I mean, perhaps it is better to either always require rcu_read_lock() around __task_cred() even if it is not needed, or do not use rcu_dereference_check() at all. In any case, task_is_dead() doesn't help afaics, it is only useful for Yes, please note "It does in practice" above. My question is, should/can we rely on this fact? Or should we assume that nothing except rcu_read_lock() implies rcu_read_lock() ? Oleg. --
Can I take that as an Acked-by? David --
Yes, thanks, please feel free to add Acked-by: Oleg Nesterov <oleg@redhat.com> --
No. When we send a signal to multiple processes it needs to be an atomic operation so that kill -KILL -pgrp won't let processes escape. It is what posix specifies, it is what real programs expect, and it is the useful semantic in userspace. Just using rcu_read_lock() breaks that atomicity of sending a signal to a process group, which means we can have new processes that escape the signal. Even with the tasklist_lock we have to have a special case in fork to ensure we don't add a process to a process group while a signal is being delivered to it. I have scratched my head a few times wondering if there is a way to preserve the atomic guarantee without taking a global lock, but I Likely. At one point read_lock(&tasklist_lock) implied having the rcu_read_lock(). As it no longer does in some cases we have a small pile of places with outdated assumptions. I know I have been guilty a few times of forgetting about the new rule that we have to take rcu_read_lock() With the tasklist_lock the rule in these functions is that the caller will take the lock, so we probably make the rule the caller should take the lock in the same scenarios for the rcu_read_lock(). Aka just say: read_lock(&tasklist_lock); rcu_read_lock(); everywhere, that today we say just: read_lock(&tasklist_lock); It is what was meant when the code was written and the rcu-ized, and it will certainly preserve my human intuition and general practical reality that when you have the tasklist_lock you have the rcu_read_lock. *Shrug* I don't think it matters a whole lot. Eric --
On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
Ok. However, in that case, it's not really about the whole list
traversal, it's a totally separate thing, and it's really sad that we
end up using the (rather hot) tasklist_lock for something like that.
With the dcache/inode locks basically going away, I think
tasklist_lock ends up being one of the few hot locks left.
Wouldn't it be much nicer to:
- make it clear that all the "real" signal locking can rely on RCU
- use a separate per-pgrp lock that ends up being the one that gives
the signal _semantic_ meaning?
That would automatically document why we get the lock too, which
certainly isn't clear from the code as-is.
The per-pgrp lock might be something as simple as a silly hash that
just spreads out the process groups over some random number of simple
I agree that we probably should have done that originally, in order to
not have these bugs show up later. However, I don't think it makes
sense any more, especially not if tasklist_lock isn't even a "real"
lock from a kernel internal consistency standpoint, but has a totally
secondary meaning.
Linus
--
I still think we can avoid the new lock and rely on RCU in kill_something_info() and kill_pgrp(). I am attaching my old email below. It talks about pgrp, however I think kill_something_info() is almost the same thing. Oleg. Yes, but who can add the new entry? Let's forget about setpgrp/etc for the moment, I think we have "races" with or without tasklist. Say, setpgrp() can add the new process to the already "killed" pgrp. Then, I think the only important case is SIGKILL/SIGSTOP (or other signals which can't be blockes/ignored). We must kill/stop the entire pgrp, we must not race with fork() and miss a child. In this case I _think_ rcu_read_lock() is enough, rcu_read_lock() list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID) group_send_sig_info(sig, task); rcu_read_unlock(); except group_send_sig_info() can race with mt-exec, but this is simple to fix. If we send a signal (not necessary SIGKILL) to a process P, we must see all childs which were forked by P, both send_signal() and copy_process() take the same ->siglock, we must see the result of list_add_tail_rcu(). And, after we sent SIGKILL/SIGSTOP, it can't fork the new child. If list_for_each_entry() does not see the exited process P, this means we see the result of list_del_rcu(). But this also means we must the the result of the previous list_add_rcu(). IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we don't see the new entry on list, we must see the new one, right? (I am ignoring the case when list_for_each_entry_rcu() sees a process P but lock_task_sighand(P) fails, I think this is the same as if we we missed P) Now suppose a signal is blocked/ignored or has a handler. In this case we can miss a child, but I think this is OK, we can pretend the new child was forked after kill_pgrp() completes. Say, this child C was forked by some process P. We can miss C only if it was forked after we already sent the signal to P. However. I do not pretend the reasoning ...
It is about the list traversal. In the process group case it is about traversing the pid->tasks[PIDTYPE_PGID] hlist, which is also protected I think it is totally reasonable to add a per pid lock, that would protect the pid->task[...] hlist. That would make things clearer and finer grained without a lot of effort. Just a little more struct pid bloat, and a little extra care in fork, when we add to those lists. Even with the per-pgrp lock we still need a lock on the global process list for the kill -KILL -1 case. Which suggests that tasklist_lock is still needed for part of kill_something_info. Eric --
Hmm. Have you taken a look at Nick Piggin's VFS scalability patches?
They introduce this "RCU-safe hash chain lock", where each hashchain
has a lock-bit in the low bit. I wonder if that would be the right
Well, that -1 case is special anyway. The fact that we might want to
use the tasklist_lock there is not very relevant, I think. That is
_not_ a hotpath, really (at least not under any relevant loads, I'm
sure you could make a silly benchmark of "kill(-1,0)").
Linus
--
Interesting. I haven't looked but a lock bit in the low bit of the hlist head in struct pid would not add any space, and would not add any extra cache line bounces. So that would just be a matter of I expect even signal deliver to process groups is relatively rare. The interesting question is can we kill the tasklist_lock and/or the tasklist. A quick grep shows that we have maybe 100 users of the tasklist in the entire kernel. Poking a little deeper it looks like man of those are connected to scheduling and are uses that today take the tasklist_lock but would be fine with the rcu_read_lock(). Eric --
That patch solved the warning. Thank you. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
| Michael Breuer |
