On Mon, 18 Jun 2007, Ingo Molnar wrote:I really think this the the wrong approach, although *testing* it makes sense. I think we need to handle loops that take, release, and then immediately re-take differently. Such loops are _usually_ of the form where they really just release the lock for some latency reason, but in this case I think it's actually just a bug. That code does: if (unlikely(p->array || task_running(rq, p))) { to decide if it needs to just unlock and repeat, but then to decide if it need to *yield* it only uses *one* of those tests (namely preempted = !task_running(rq, p); .. if (preempted) yield(); and I think that's just broken. It basically says: - if the task is running, I will busy-loop on getting/releasing the task_rq_lock and that is the _real_ bug here. Trying to make the spinlocks do somethign else than what they do is just papering over the real bug. The real bug is that anybody who just busy-loops getting a lock is wasting resources so much that we should not be at all surprised that some multi-core or NUMA situations will get starvation. Blaming some random Core 2 hardware implementation issue that just makes it show up is wrong. It's a software bug, plain and simple. So how about this diff? The diff looks big, but the *code* is actually simpler and shorter, I just added tons of comments, which is what blows it up. The new *code* looks like this: repeat: /* Unlocked, optimistic looping! */ rq = task_rq(p); while (task_running(rq, p)) cpu_relax(); /* Get the *real* values */ rq = task_rq_lock(p, &flags); running = task_running(rq, p); array = p->array; task_rq_unlock(rq, &flags); /* Check them.. */ if (unlikely(running)) { cpu_relax(); goto repeat; } if (unlikely(array)) { yield(); goto repeat; } and while I haven't tested it, it looks fairly obviously correct, even if it's a bit subtle. Basically, that first "while()" loop is done entirely without any locking at all, and so it's possibly "incorrect", but we don't care. Both the runqueue used, and the "task_running()" check might be the wrong tests, but they won't oops - they just mean that we might get the wrong results due to lack of locking. So then we get the proper (and careful) rq lock, and check the running/runnable state _safely_. And if it turns out that our quick-and-dirty and unsafe loop was wrong after all, we just go back and try it all again. Safe, simple, efficient. And we don't ever hold the lock for long times, and we will never *loop* by taking, releasing and re-taking the lock. Hmm? Untested, I know. Maybe I overlooked something. But even the generated assembly code looks fine (much better than it looked before!) Linus ---- kernel/sched.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 files changed, 60 insertions(+), 9 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 13cdab3..66445e1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1159,21 +1159,72 @@ void wait_task_inactive(struct task_struct *p) { unsigned long flags; struct rq *rq; - int preempted; + struct prio_array *array; + int running; repeat: + /* + * We do the initial early heuristics without holding + * any task-queue locks at all. We'll only try to get + * the runqueue lock when things look like they will + * work out! + */ + rq = task_rq(p); + + /* + * If the task is actively running on another CPU + * still, just relax and busy-wait without holding + * any locks. + * + * NOTE! Since we don't hold any locks, it's not + * even sure that "rq" stays as the right runqueue! + * But we don't care, since "task_running()" will + * return false if the runqueue has changed and p + * is actually now running somewhere else! + */ + while (task_running(rq, p)) + cpu_relax(); + + /* + * Ok, time to look more closely! We need the rq + * lock now, to be *sure*. If we're wrong, we'll + * just go back and repeat. + */ rq = task_rq_lock(p, &flags); - /* Must be off runqueue entirely, not preempted. */ - if (unlikely(p->array || task_running(rq, p))) { - /* If it's preempted, we yield. It could be a while. */ - preempted = !task_running(rq, p); - task_rq_unlock(rq, &flags); + running = task_running(rq, p); + array = p->array; + task_rq_unlock(rq, &flags); + + /* + * Was it really running after all now that we + * checked with the proper locks actually held? + * + * Oops. Go back and try again.. + */ + if (unlikely(running)) { cpu_relax(); - if (preempted) - yield(); goto repeat; } - task_rq_unlock(rq, &flags); + + /* + * It's not enough that it's not actively running, + * it must be off the runqueue _entirely_, and not + * preempted! + * + * So if it wa still runnable (but just not actively + * running right now), it's preempted, and we should + * yield - it could be a while. + */ + if (unlikely(array)) { + yield(); + goto repeat; + } + + /* + * Ahh, all good. It wasn't running, and it wasn't + * runnable, which means that it will never become + * running in the future either. We're all done! + */ } /*** -
| David Miller | Re: [patch 7/8] fdmap v2 - implement sys_socket2 |
| Sean | Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation,pathname matching |
| Andi Kleen | Re: missing madvise functionality |
| Alan Cox | [PATCH 03/57] ali: watchdog locking and style |
git: | |
| Guido Ostkamp | [PATCH] Fix Solaris Workshop Compiler issues |
| David Lang | Re: mingw, windows, crlf/lf, and git |
| Johannes Schindelin | Re: [kernel.org users] [RFD] On deprecating "git-foo" for builtins |
| Johannes Schindelin | Re: [PATCH] Fix off by one error in prep_exclude. |
| Marco Peereboom | Re: Real men don't attack straw men |
| patrick keshishian | SMTP flood + spamdb |
| Marcos Laufer | dmesg IBM x3650 OpenBSD 4.3 |
| Nick Holland | Re: The Atheros story in much fewer words |
| Hans de Goede | Re: cat /proc/net/tcp takes 0.5 seconds on x86_64 |
| Stephen Hemminger | [RFC] TCP illinois max rtt aging |
| Tilman Schmidt | Re: 2.6.25-rc8: FTP transfer errors |
| Evgeniy Polyakov | Re: Network/block layer race. |
| high memory | 15 hours ago | Linux kernel |
| semaphore access speed | 17 hours ago | Applications and Utilities |
| the kernel how to power off the machine | 18 hours ago | Linux kernel |
| Easter Eggs in windows XP | 21 hours ago | Windows |
| Shared swap partition | 22 hours ago | Linux general |
| Root password | 22 hours ago | Linux general |
| Where/when DNOTIFY is used? | 1 day ago | Linux kernel |
| How to convert Linux Kernel built-in module into a loadable module | 1 day ago | Linux kernel |
| Linux 2.6.24 and I/O schedulers | 1 day ago | Linux kernel |
| USB Driver -- Interrupt Polling -- A Little Help Please | 1 day ago | Linux general |
