Re: [PATCH] [RFC] fix missed SIGCONT cases

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Oleg Nesterov <oleg@...>
Cc: Jiri Kosina <jkosina@...>, Davide Libenzi <davidel@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, February 29, 2008 - 9:59 pm

> > I don't think I follow the scenario you have in mind there.  When siglock

So you concur that there is no misbehavior from that lock-drop?
(I'm all for getting rid of it, just don't want to have a wrong
thought in my head about what the precise issues are.)


I see.


I certainly didn't have in mind permitting that.  But if the only thing
that can happen is that if the parent called wait already to reap the
child, it never gets the SIGCHLD for it, then technically that is fine.
(In fact, we are not POSIX-conforming now because pending SIGCHLDs are
supposed to be cleared by a wait call made with SIGCHLD blocked when there
are no more unwaited-for children ready.)  It seemed relevant to share that
nit, but I don't think we want to perturb the behavior here any time soon.


The status quo is that the CLD_CONTINUED signal-sending call is made by an
unrelated task, the one sending the SIGCONT to the child, rather than with
child==current as other SIGCHLD's are.  I don't think that should matter,
because nothing in the do_notify_parent_cldstop code path consults current
for anything.  With your new plan, that path will be entered instead by
some thread in the group that was woken by generating SIGCONT.  If it did
matter, that would presumably only be better.

I guess the issue you meant was that the task_struct argument to
do_notify_parent_cldstop is whichever thread woke up and took the siglock
first, rather than the recipient of the SIGCONT (usually the group leader).
Now I understand your reference to ptrace--it does tsk = tsk->group_leader
unless ptraced.  AFAICT, what this affects is only the si_pid, si_uid,
si_[us]time values in the siginfo_t for the SIGCHLD with si_code =
CLD_CONTINUED.  Not that ptracers really care one way or the other, but I
think the ptrace special case really only sensibly applies to CLD_STOPPED.
There aren't going to be CLD_CONTINUED notifications for each and every
thread anyway (and I don't think they would be all that useful to a ptracer).

So independent of all this, even as things stand now I would do:

	if (why == CLD_STOPPED && (tsk->ptrace & PT_PTRACED))

to clear that up.


I haven't really reviewed that stuff properly (sorry about that, behind on
many things).  Let's not tie up this can of worms with that one for now.


If SIGSTOP was not dequeued yet, then it was pending and was cleared by
generating SIGCONT, so nothing happened: no stop, no continue, just a SIGCONT
is now pending.  (There is also the example where it's SIGTSTP and it was
blocked, so there's a case that does not depend on a race happening.)  If the
group stop has begun, then at least one thread has already begun stopping,
and experienced effects like a syscall returning -EINTR.  Those effects
shouldn't be seen if "nothing happened".  But they are expected if the
process stopped and then continued very quickly.  If that's what happened,
then the CLD_STOPPED SIGCHLD was posted before the continue happened.

So that was my original rationale.  But there are still races.  (This can't
come up in the blocked-SIGTSTP case, where the behavior is reliably testable
from the application perspective.)  An unblocked stop signal might have
caused a signal_wake_up on some thread that then lost the race for the
siglock.  The SIGCONT-sender clears the pending SIGSTOP that was the reason
that thread's syscall got interrupted.  Then that thread will see -EINTR and
such effects, but there will never be any CLD_STOPPED report to justify it.
If that same thread is not the one to immediately run the SIGCONT handler,
there is no other excuse for the side effects.  Since that possibility still
exists (and is probably impossible to avoid), maybe I shouldn't worry about
the group-stop-in-progress case.  Clearly this isn't something anyone else
ever really worries about but me.

Still, when group_stop_count > 0 that means that we know for sure that every
thread has TIF_SIGPENDING and so all those that were in syscalls will for
sure see -EINTR and such side effects.  So I do lean towards having that
situation reported as "stopped and then continued" rather than "nothing
happened".


I certainly think we should clean up all the callers to __group_send_sig_info
from outside signal.c, but I did not want to get into that cleanup in the
middle of this change.  Once we finish hashing out any changes to the locking
requirements for calling into the signals code, we can clean those uses up.


I like it too.  My inclination is still to do the other cleanup first, and
also do enough other cleanups first so that we're confident of using a
signal_struct.flags bit for this right off.


In practice no ptracer ever wanted a CLD_CONTINUED notification or cares
about them now.  So even "breaking" it wouldn't get us in much trouble.
But I don't see any problem at all, especially if we change the
do_notify_parent_cldstop behavior to use tsk = tsk->group_leader as 
I discussed above.


This is only ever needed after we have just been in do_signal_stop and it
actually stopped (returned 1).  That always gets to a goto relock.  So this
could be outside the for loop.  I'd just inline it rather than have another
function that conditionally unlocks, which sparse hates.

So: 

	spin_lock_irq(&current->sighand->siglock);

	/*
	 * long and helpful comment
	 */
	if (unlikely(current->signal->notify_parent)) {
		int why = current->signal->notify_parent;
		current->signal->notify_parent = 0;
		spin_unlock_irq(&current->sighand->siglock);
		read_lock(&tasklist_lock);
		do_notify_parent_cldstop(current, why);
		read_unlock(&tasklist_lock);
		goto relock;
	}

	for (;;) {


Thanks,
Roland
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Wed Feb 27, 11:22 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Wed Feb 27, 5:00 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Wed Feb 27, 6:04 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Thu Feb 28, 6:26 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Thu Feb 28, 11:30 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Thu Feb 28, 10:39 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Fri Feb 29, 8:01 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Fri Feb 29, 9:59 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Sat Mar 1, 12:41 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Thu Mar 6, 5:05 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Mon Mar 3, 9:25 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Roland McGrath, (Thu Mar 6, 6:50 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Fri Feb 29, 9:20 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Thu Feb 28, 11:40 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Thu Feb 28, 12:08 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Thu Feb 28, 12:13 pm)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Thu Feb 28, 7:42 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Oleg Nesterov, (Thu Feb 28, 11:32 am)
Re: [PATCH] [RFC] fix missed SIGCONT cases, Jiri Kosina, (Thu Feb 28, 11:32 am)