Adding memory barrier to the __pollwait function paired with
receive callbacks. The smp_mb__after_lock define is added,
since {read|write|spin}_lock() on x86 are full memory barriers.The race fires, when following code paths meet, and the tp->rcv_nxt and
__add_wait_queue updates stay in CPU caches.CPU1 CPU2
sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.wbr,
jirkaSigned-off-by: Jiri Olsa <jolsa@redhat.com>
---
arch/x86/include/asm/spinlock.h | 3 +++
fs/select.c | 4 ++++
include/linux/spinlock.h | 5 +++++
include/net/sock.h | 18 ++++++++++++++++++
net/atm/common.c | 4 ++--
net/core/sock.c | 8 ++++----
net/dccp/output.c | 2 +-
net/iucv/af_iucv.c | 2 +-
...
I was wondering did you see that race actually happening in practice?
If yes on which system?At least on x86 I can't see how it happens. mb() is only a compile
time barrier and the compiler doesn't optimize over indirect callbacks
like __pollwait() anyways.It might be still needed on some weaker ordered architectures, but did you
actually see it there?-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
mb() does more than just a compiler barrier, it also issues mfence.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
mfence is not needed for normal C code (not using non temporal
stores) in the Linux memory model on x86 and is a no-op. Only the compile
time barrier matters.-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
In that case this bug needs to be digged deeper regardless of
this patch.Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Agreed.
I suspect the reordering of the wake queue tests might makes a difference, but
in this case to ensure they are always tested in the proper order by
the compiler would need more smp_rmb()s-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
I just looked up the original bug report and it's against RHEL4
which is a 2.6.9 kernel. Jiri, have we reproduced this on a more
recent kernel?Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
no, only RHEL 4..
it is very hard to reproduce. I was succesfull once
on RHEL4 after 8 days, never on upstream though.--
It would be good to know whether this bug occurs on the upstream
kernel because as it stands, the patch is a no-op for the upstream
kernel on x86-64.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
I take that back. Andi, please look at section 8.2.3.4 of the
IA-32 Architectures Software Developer's Manual Volume 3A, "Loads
May Be Reordered with Earlier Stores to Different Locations.
This seems to be exactly the scenario that we have here, and shows
why mfence is required.In our case, we're doing
CPU1 CPU2
Write data ready Add ourselves to wait queue
Read wait queue to notify Check data readyCheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Just a note about this. That used to be true, that GCC didn't optimize
indirect calls. However, see -findirect-inlining in GCC 4.4.I am not saying that it applies here, but it is something to remember.
--
Zan Lynx
zlynx@acm.org"Knowledge is Power. Power Corrupts. Study Hard. Be Evil."
--
... it doesn't apply here because it works only for a few very simple
cases that don't apply for this code. Besides the kernel builds with -O2,
which only inlines functions called once or very small functions, which
also rules this out.-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
yes, we have a customer that has been able to reproduce this problem on
x86_64 CPU model Xeon E5345*2, but they didn't reproduce on XEON MV, for example.they were able to capture a backtrace when the race happened:
https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1--
Hello,
So, the problem is the half barrier semantics of spin_lock on CPU1's
side and lack of any barrier (read for tp->rcv_nxt check can creep
above the queuelist update) in waitqueue_active() check on CPU2's side
(read for waitqueue list can creep above tcv_nxt update). Am I
understanding it right?This is a little bit scary. The interface kind of suggests that they
have strong enough barrier semantics (well, I would assume that). I
wonder whether there are more more places where this kind of raceI'm not entirely sure this is the correct place to do it while the mb
for the other side lives in network code. Wouldn't it be better to
move this memory barrier to network select code? It's strange for an
API to have only single side of a barrier pair and leave the other to
the API user.Also, maybe we need smp_mb__after_unlock() too? Maybe
add_wait_queue_mb() possibly paired with wait_queue_active_mb() is
better? On x86, it wouldn't make any difference tho.One more thing, can you please use fully winged style for multiline
comments?Thanks.
--
tejun
--
Patch seems fine for me, thanks a lot Jiri !
--
Can't really comment this patch, except this all looks reasonable to me.
Add more CCs.--
While this can work, IMO it'd be cleaner to have the smp_mb() moved from
fs/select.c to the ->poll() function.
Having a barrier that matches another one in another susbsystem, because- Davide
--
Yes but barrier is necessary only if add_wait_queue() was actually called, and __pollwait()
does this call.Adding a plain smp_mb() in tcp_poll() for example would slowdown select()/poll() with NULL
timeout.Adding a cond test before smp_mb() in tcp_poll() (and other ->poll() functions)
would be litle bit overkill too...I believe this race was not existent in the past because spin_unlock() had a memory barrier,
and we changed this to a plain memory write at some point...Most add_wait_queue() calls are followed by a call to set_current_state()
--
On Fri, 26 Jun 2009, Eric Dumazet wrote:
> Davide Libenzi a
I wont argue with you David, just try to correct bugs.
fs/ext4/ioctl.c line 182
set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&EXT4_SB(sb)->ro_wait_queue, &wait);
if (timer_pending(&EXT4_SB(sb)->turn_ro_timer)) {
schedule();Another example of missing barrier after add_wait_queue()
Because add_wait_queue() misses a barrier, we have to add one after each call.
Maybe it would be safer to add barrier in add_wait_queue() itself, not in _pollwait().
--
Assuming that ->turn_ro_timer does wake_up(->ro_wait_queue) everything
is OK, we do not need a barrier.Oleg.
--
Not all the code that uses add_wait_queue() does need to have the MB,
like code that does the most common pattern:xxx_poll(...) {
poll_wait(...);
lock();
flags = calc_flags(->status);
unlock();
return flags;
}xxx_update(...) {
lock();
->status = ...;
unlock();
if (waitqueue_active())
wake_up();
}It's the code that does the lockless flags calculation in ->poll that
might need it.
I dunno what the amount of changes are, but cross-matching MB across
subsystems does not look nice.
IMHO that's a detail of the subsystem locking, and should be confined
inside the subsystem itself.
No?- Davide
--
And if we remove waitqueue_active() in xxx_update(), then lock/unlock is
not needed too.If xxx_poll() takes q->lock first, it can safely miss the changes in ->status
and schedule(): xxx_update() will take q->lock, notice the sleeper and wake
it up (ok, it will set ->triggered but this doesn't matter).If xxx_update() takes q->lock first, xxx_poll() must see the changes in
status after poll_wait()->unlock(&q->lock) (in fact, after lock, not unlock).Oleg.
--
Sure. The snippet above was just to show what typically the code does, not
a suggestion on how to solve the socket case.
But yeah, the problem in this case is the waitqueue_active() call. Without
that, the wait queue lock/unlock in poll_wait() and the one in wake_up()
guarantees the necessary barriers.
Some might argue the costs of the lock/unlock of q->lock, and wonder if
MBs are a more efficient solution. This is something I'm not going into.
To me, it just looked not right having cross-matching MB in different
subsystems.- Davide
--
Yes, yes. I just meant you are right imho, we shouldn't add mb() into
This is subjective and thus up to maintainers, but personally I think you
are very, very right.Perhaps we can add
void sock_poll_wait(struct file *file, struct sock *sk, poll_table *pt)
{
if (pt) {
poll_wait(file, sk->sk_sleep, pt);
/*
* fat comment
*/
smp_mb(); // or smp_mb__after_unlock();
}
}Oleg.
--
Maybe 'a bit' further?:
static inline void __poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
p->qproc(filp, wait_address, p);
}static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && wait_address)
__poll_wait(filp, wait_address, p);
}static inline void sock_poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && wait_address) {
__poll_wait(filp, wait_address, p);
/*
* fat comment
*/
smp_mb(); // or smp_mb__after_unlock();
}
}Jarek P.
--
Hmm... of course:
static inline void sock_poll_wait(struct file * filp, struct sock *sk, poll_table *p)
{
if (p && sk->sk_sleep) {
__poll_wait(filp, sk->sk_sleep, p);
/*
* fat comment
*/
smp_mb(); // or smp_mb__after_unlock();
}
}Jarek P.
--
Perhaps it makes sense to check ->sk_sleep != NULL in sock_poll_wait(), but
I don't think we need __poll_wait(). poll_wait() is inline, I think gcc
will optimize out "if (p && wait_address)" check if poll_wait() is called
from sock_poll_wait().This all is up to Jiri of course. But speaking about cosmetic changes, I
think it is better to make 2 patches. The first one fixes the problem using
smp_mb(), another introduces smp_mb__xxx_lock() to optimize the code.Oleg.
--
Sure, to me it looks a bit more readable, but let Jiri choose.;-)
Cheers,
Jarek P.
--
yes :) I like more Jarek's way.. and I'll send separate patch for the
smp_mb_after_lock change.thanks,
jirka
--
That'd be fine IMHO. Are DaveM and Eric OK?
- Davide
--
From: Davide Libenzi <davidel@xmailserver.org>
No objections from me.
--
Very good :)
Jiri, please respin a patch with this idea from Oleg
(We'll have to check all calls to poll_wait() in net tree)
Thanks everybody
--
thanks a lot! :) I'll send out new patch shortly..
I'll include all the poll_wait calls change I'm kind of sure of,
and list of others I'm not, so we can discuss them.jirka
--
How about poll_wait_mb() and waitqueue_active_mb() (with mb and
additional check for NULL of wait_queue_head)?Jarek P.
--
Hmm... But considering Eric's arguments I see it would be hard
/impossible to do it with the current api, so let's forget.Jarek P.
--
From: Davide Libenzi <davidel@xmailserver.org>
Perhaps every performance argument is moot since spin_unlock() used to
have the barrier :-)
--
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andrew Morton | 2.6.25-mm1 |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
