On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
Ah, good point, and good reason to keep rcu_flipctr separate.
;-)
Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU. Then it is possible that some other CPU sees
the store to B as happening before the store that "A==0" above is
loading from. This other CPU would naturally conclude that the store
to B must have happened before the load from A.
In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer. This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.
So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:
o CPU 0 stores zero to A. This is a cache miss, so the new value
for A is placed in CPU 0's and 1's store buffer.
o CPU 1 executes the above code, first loading A. It sees
the value of A==0 in the store buffer, and therefore
stores zero to B, which hits in the cache. (I am assuming
that we left out the mb() above).
o CPU 2 loads from B, which misses the cache, and gets the
value that CPU 1 stored. Suppose it checks the value,
and based on this check, loads A. The old value of A might
still be in cache, which would lead CPU 2 to conclude that
the store to B by CPU 1 must have happened before the store
to A by CPU 0.
Memory barriers would prevent this confusion. An intro to store buffers
can be found at http://www.cs.utah.edu/mpv/papers/neiger/fmcad2001.pdf,
FYI.
;-)
Callbacks would be able to be injected into a grace period after it
started.
Or are you arguing that as long as interrupts remain disabled between
the two events, no harm done?
Ah -- you are in fact arguing that interrupts remain disabled throughout.
I would still rather that the rcu_flip_seen transition be adjacent
to the callback movement in the code. My fear is that the connection
might be lost otherwise... "Oh, but we can just momentarily enable
interrupts here!"
From a conceptual viewpoint, if this CPU hasn't caught up with the
last grace-period stage, it has no business trying to push forward to
the next stage. So this might (or might not) happen to work with this
particular implementation, it needs to stay as is. We need this code
to be robust enough to optimize the grace-period latencies, right?
Ah, good point...
Some people are calling for eliminating synchronize_sched() altogether,
but there are a few uses that would be hard to get rid of.
I need to think about your approach above. It looks like you are
leveraging the migration tasks, but I am concerned about concurrent
hotplug events. But either way, I do like the idea of communicating
with other tasks that actually do the context switches on behalf
of synchronize_sched().
Thanx, Paul
-