On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
Thank you for the kind words, and most especially for the careful review!!!
And Oleg, I don't think you qualify as a newbie anymore. ;-)
Definitely are two of them in rcupdate.h, good eyes! The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt(). However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,
will go away when the merge with upcoming hotplug patches is complete.
It would seem so! Will remove it.
In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU. I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.
Good point on rcu_try_flip_state!!!
Looks like it to me, thank you for the tip!
Hmmm... Why not the same for rcu_data? I guess because there is
very little sharing? The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.
Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.
Thoughts?
Yeah, I guess my explanation -was- a bit lacking... When I re-read it, it
didn't even help -me- much! ;-)
I am putting together better documentation, but in the meantime, let me
try again...
The problem is not with .completed being incremented, but only
by it (1) being incremented (presumably by some other CPU and
then (2) having this CPU get a scheduling-clock interrupt, which
then causes this CPU to prematurely update the rcu_flip_flag.
This is a problem, because updating rcu_flip_flag is making a
promise that you won't ever again increment the "old" counter set
(at least not until the next counter flip). If the scheduling
clock interrupt were to happen between the time that we pick
up the .completed field and the time that we increment our
counter, we will have broken that promise, and that could cause
someone to prematurely declare the grace period to be finished
(as in before our RCU read-side critical section completes).
The problem occurs if we end up incrementing our counter just
after the grace-period state machine has found the sum of the
now-old counters to be zero. (This probably requires a diagram,
and one is in the works.)
But that is not the only reason for the cli...
The second reason is that cli prohibits preemption. If we were
preempted during this code sequence, we might end up running on
some other CPU, but still having a reference to the original
CPU's counter. If the original CPU happened to be doing another
rcu_read_lock() or rcu_read_unlock() just as we started running,
then someone's increment or decrement might get lost. We could
of course prevent with with preempt_disable() instead of cli.
There might well be another reason or two, but that is what
I can come up with at the moment. ;-)
I suspect that you are correct.
This is indeed one reason. I am adding you to my list of people to send
early versions of the document to.
Yeah, it needs better documentation...
Yep. If a CPU saw the flip flag set to rcu_flipped before it saw
the new value of the counter, it might acknowledge the flip, but
then later use the old value of the counter. It might then end up
incrementing the old counter after some other CPU had concluded
that the sum is zero, which could result in premature detection
of a grace-period.
OK, your point is that I am not taking the locks into account when
determining what memory barriers to use. You are quite correct, I was
being conservative given that this is not a fast path of any sort.
Nonetheless, I will review this -- no point in any unneeded memory
barriers. Though a comment will be required in any case, as someone
will no doubt want to remove the locking at some point...
Could happen in -rt, where softirq is in process context. And can't
softirq be pushed to process context in mainline? Used to be, checking...
Yep, there still is a ksoftirqd().
The stats would come out slightly different if this optimization were
removed, but shouldn't be a problem -- easy to put the RCU_TRACE_RDP()
under an "if".
Will think through whether or not this is worth the extra code, good
eyes!
Can't do the above, because I need the pointer before I can lock it. ;-)
We need to retain the ability to manipulate other CPU's rcu_data structures
in order to make dynticks work better and also to handle impending OOM.
So I kept all the rcu_data manipulations under its lock. However, as you
say, there would be no cross-CPU accesses except in cases of low-power
state and OOM, so lock contention should not be an issue.
Seem reasonable?
Can't think of any, and it might speed up the grace period in some cases.
Though the call in rcu_process_callbacks() needs to stay there as well,
otherwise a CPU that never did call_rcu() might never do the needed
rcu_check_mb().
I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
transition has to be synchronized to the moving of the callbacks --
either that or we need more GP_STAGES.
OK, harmless but added overhead, then.
Well, the next patch is made unnecessary because of upcoming changes
to hotplug. So, what do I have messed up?
Might not anymore. It used to be that the CPU hotplug path did a
synchronize_sched() holding the hotplug lock, and that sched_setaffinity()
in turn attempted to acquire the hotplug lock. This came to light when
I proposed a similar set of patches for -mm early this year.
And with Gautham Shah's new hotplug patches, this problem goes away.
But these are not in 2.6.22, so I had to pull in the classic RCU
implementation to get a working synchronize_sched().
Thanx, Paul
-