OpenBSD: Fixing The Idle Loop

Submitted by Jeremy
on May 24, 2005 - 7:20pm

Bob Beck is an OpenBSD developer from Edmonton in Canada. He's one of around 60 OpenBSD developers currently working in an undisclosed hotel somewhere in downtown Calgary at the 2005 OpenBSD hackathon [story]. Bob was involved in setting up the infrastructure [story], and was responsible for the annual barbecue at OpenBSD creator Theo de Raadt [interview]'s house [story]. Following these two days of effort that helped to make the hackathon possible, he finally sat down to work on spamd and catch up on email. One of the emails in his inbox caught his attention, leading to a day's effort about which he notes, "some Days end up far far far from where they start."

In the following article, Bob provides a first-person account of tracking down what began simply as a RAID performance issue, but ultimately turned out to be a problem with the idle loop that when fixed resulted in an impressive performance boost. Bob noted, "the idle loop is where the kernel spins when there is no work to do in userland, because of this, it's also where we catch and service many of our interrupts from drivers that may queue work to the device and then tsleep waiting for an interrupt from the card saying the work is done." Bob went on to explain that prior to today's fix, interrupts were handled appropriately when there was userland work happening, but not when there was nothing happening in userland and the kernel was simply waiting for device input/output. Read on for Bob's full account of the day, leading up to the discovery of the problem and the implementation of the fix, including performance numbers.


Bob Beck writes:
        A day at the OpenBSD Hackathon.. Some Days end up far far far from
where they start.



        Well, it's Monday at the OpenBSD hackathon. I'm tired - I've spent 
two days putting together the world here, and doing a barbecue for everyone 
with Theo. Now it's finally *my time*. I can sit down and work on what I want 
to rip apart, I fire up a couple of emacs windows and start redoing the spamd 
internals, and catch up on my mail. Hmm, interesting stuff is here from marco.

        Marco Pereboom and I have been chasing RAID performance issues for a 
couple of months. Horrible read perfomance, and it's down to a few machines. 
Some are fast, and some really suck. Marco, Art Grabowski, and Toby 
Weingartner have discovered a condition in locore.s in the idle loop where we 
could lose interrupts!. Hmm. I get in and take a look with them. Aha, a race! 
OK Mickey Shalayeff helps us with that one, and Toby and Dale Rahn help us 
dive into the locore.s assembler. Yeah, I'm getting into this, but I have a 
feeling I'm not in userland anymore toto.. but the hell with it, this is big, 
I really want it fixed, and it's fun...

        Ok, so we fix the race, we're up from 14 MB/sec to 50 MB/sec. 
Better... but there are still issues... Now what? Hmm... the APM code appears 
to affect us in the idle loop! if we don't call it as often, performance goes 
up.. Marco and Toby start experimenting with calling APM only every N times, 
with the idea to put in a knob. I hate knobs. I say "Call it only if apmd is 
active!"  Great! new diff time, it has good performance, we're up to 100MB/sec.
But wait! Theo steps in. No, we can't rely on userland being there because some
crappy laptops insist on APM being around to control thermal issues. We can't 
use those semantics... (Grumble, groan..) Back to the drawing board, and Toby,
wisely decides to read the APM spec... When the light went on it nearly blinded 
us, the noise level rises in the corner, and everyone starts getting involved.  
We're calling the APM idle functions in the idle loop, but the APM bios 
can do a "hlt". if we then we receive an interrupt we resume, but our 
idle loop then does *another* hlt instruction after the APM call, instead 
of continuing. (I will remember the look on Theo's face as I was 
telling him this for a long time...)

        Now that we know what's wrong, let's fix it right. Find the right 
semantics. The room gets loud now. deraadt, art, tholo, marco, beck, weingart,
niklas, dale, nate, and a bunch of others start throwing ideas off of each 
other. "make a thread"..  "no too heavyweight".. "do it in softclock".. "no you
can't hlt there, and APM might"... "make a knob"... "no we hate knobs"... "do 
it on ticks"...  "Hey wait"...  (Art types something horrible on my screen).. 
"Use this".. Yes! that's it.  Check the profiling counters and only call the 
bios idling function when we show the CPU_IDLE counters are increasing. (as 
opposed to every time we enter the idle loop, which may be when we are in a 
tsleep driver). Yep, this is the right semantics for us...

        Now, ami gets 105 MB/sec.
        My laptop goes from 12 MB/sec to 25 MB/sec...
        Nate's laptop goes from 12 MB/sec to 40 MB/sec...
        "Holy crap. only machines with busted APM were ever fast!"
        "this affects userland crypto too.."
        "this affects Ethernet drivers"
        "Hey, this makes battery life better too".
        "This affects EVERYTHING on i386"

        We go for beer, wait for the test reports to come in, and sure enough 
this is good. Come back and commit... Wow, we've spent a whole day on the IDLE 
LOOP!  But man was it worth it.

        Well, I've still not started my userland stuff.. but well, strange 
stuff happens at a hackathon. The kind of stuff where you get enough people 
in a room to chase stuff, and really get it right. Sometimes it's kind of fun 
to get sucked off into dimensions you had no idea you'd be in.

Translations:

Great job!

Legion (not verified)
on
May 25, 2005 - 2:52pm

FreeBSD had this issue identified and fixed back in 1994. Better late than never though...

Is there any chance that a pa

:e:yselfAndI (not verified)
on
May 25, 2005 - 6:09pm

Is there any chance that a patch for the 3.6 Patch branch will be released (when tested, of course)? It sounds like the kind of thing that people would enjoy rebooting their system for ..

This is a GOOD thing?

Craig (not verified)
on
May 25, 2005 - 8:02pm

There was a major bug that completely sabotaged performance everywhere and nobody noticed?

I don't understand why I should see this as a good thing.

What versions are affected and will a fix be released for -stable?

Umm...

JA (not verified)
on
May 25, 2005 - 10:52pm

Obviously you've never worked on a kernel before... stuff like this is often not nearly as easy to notice as you might think.

Thank you! I've dabbled in k

quel (not verified)
on
May 26, 2005 - 12:02am

Thank you! I've dabbled in kernel code and am currently taking an os course. All of the theory is trivial, implementation is a total PITA!

I've taught K

Klossner (not verified)
on
May 26, 2005 - 6:04pm

If you think the theory is trivial, you're taking the wrong course. Even a first-year course should discuss deadlock detection and avoidance, disk and processor scheduling, and fault tolerance.

Well i'm 20% through the cour

quel (not verified)
on
May 26, 2005 - 6:53pm

Well i'm 20% through the course. We are doing syncronization and deadlocks this week. To me the processor scheduling (multilevel feedback queue) is in theory quite easy. You are right deadlock detection and avoidance is far from trivial even in theory. However, the approach taken by most systems (windows and linux included) is to pretend that a deadlock never occurs. I think the approach of not worrying about it is trivial to understand ;)

the bug was in the idle loop.

tedu (not verified)
on
May 26, 2005 - 1:23am

the bug was in the idle loop. many times a busy server doesn't hit the idle loop that often, or has other interrupts from network happen, or doesn't support apm.

the problem hits hardest only on machines that have apm, are mostly doing just disk i/o, and have no other runnable processes.

I'm no OpenBSD expert but I'd

Anonymoose (not verified)
on
May 25, 2005 - 10:51pm

I'm no OpenBSD expert but I'd imagine that if you aren't using APM you're not affected. Since nearly all modern machines support ACPI, if one disables APM would this performance issue go away?

if you disable apm (or your b

tedu (not verified)
on
May 26, 2005 - 1:21am

if you disable apm (or your board doesn't support it), the bug doesn't bite nearly as hard.

two issues here (as far as i read)

jf (not verified)
on
May 26, 2005 - 2:23am

there are two issues here, as far as i can tell - the race issue, and the APM issue. What do u mean the bug doesnt bite nearly just as hard? Which bug?

either. both. no apm means

tedu (not verified)
on
May 26, 2005 - 1:15pm

either. both. no apm means you don't hlt twice. no apm means you don't make super slow calls into the bios which make the race window much bigger.

Now, ami gets 105 MB/sec.

Martin Nilsson (not verified)
on
May 26, 2005 - 7:31am

Can you get ami to perform as expected? On good hardvare (MegaRAID 320-2e & 10*15k disks) you should be able to read at least 500MB/s sequential.

I'm having trouble getting FreeBSD over ~150MB/s without increasing MAXPHYS.

ami is peforming as expected;

Marco Peereboom (not verified)
on
May 29, 2005 - 12:55am

ami is peforming as expected; you failed to ask how this test was run. The test is a dd from /dev/rsd1c to /dev/zero with a 1m bs. The disk is a 7 disk raid 5. If you'd run this test on two 7 disks raid 5's it'll scale up to 190MB/s. Now tweak some more, use RAID 0 and it'll probably breach 200MB/s. MAXPHYS would be benificial but the syetm implications are vast.


You can't get 500MB/s out of a 320MB/s bus. That is plain silly. Besides the point that the disks can't even stream that fast.

Carifications

Martin Nilsson (not verified)
on
May 29, 2005 - 3:30pm

MegaRAID 320-2e have TWO U320 busses so total theoretical limit is 640MB/s.

A modern 15K disk like Atlas 15KII can stream 70MB/s, no problem. Five of those should be able to saturate a U320 channel if doing seq transfers.

What controller, motherboard (chipset) and drives were used? MegaRAID cards before 320-2x are bandwith limited because of the slow i960 processor and memory on them.

Measuring RAID5 writes will not give high transfer rates, about 100MB/s sounds right at best.

I (and I beleive you too) are interested in the performance of the driver not the disks/controller!

Test with lots of disks in RAID0 and a modern controller/disks on a box with fast bus PCI-e or PCI-X so we can see how well the driver performs.

I'm no expert, but I wonder w

ShadowEyez (not verified)
on
May 26, 2005 - 1:40pm

I'm no expert, but I wonder why they don't release a new version AFTER the hackathon. Think about it - all the new improvements (like the loop fix and PF work) could be incorporated into the new release, rather than a bunch of -stable's added 1 week after 3.7

Just a thought...

because then the code would n

anymous (not verified)
on
May 26, 2005 - 2:07pm

because then the code would not be tested as thoroughly.

Nor would all projects born d

smeegel (not verified)
on
May 26, 2005 - 3:49pm

Nor would all projects born during the hackathon be finished. Why have a release with a number of untested, half-completed improvements?

No kidding. This isn't Linux

Iron Chef Idaho (not verified)
on
May 28, 2005 - 11:01am

No kidding. This isn't Linux we're talking about here. ;)

*cough* troll *cough*

on
May 28, 2005 - 11:07am

*cough* troll *cough*

Linux also affected?

Andreas Mohr (not verified)
on
May 26, 2005 - 3:47pm

2.6.11-ck8, arch/i386/kernel/apm.c/apm_cpu_idle():

(sorry for the botched formatting, I was unable to get *any* of the offered HTML tags to work right, they all failed in weird ways):

while (!need_resched()) {
if (use_apm_idle) {
unsigned int t;

t = jiffies;
switch (apm_do_idle()) {
case 0: apm_idle_done = 1;
if (t != jiffies) {
if (bucket) {
bucket = IDLE_LEAKY_MAX;
continue;
}
} else if (bucket) {
bucket--;
continue;
}
break;
case 1: apm_idle_done = 1;
break;
default: /* BIOS refused */
break;
}
}
if (original_pm_idle)
original_pm_idle();
else
default_idle();
jiffies_since_last_check = jiffies - last_jiffies;
if (jiffies_since_last_check > idle_period)
goto recalc;
}

This looks suspiciously like Linux also does the HLT mistake, since it does the APM idle call in apm_do_idle() and then *unconditionally* calls default_idle() (unless original_pm_idle is set, which might actually be the case, haven't checked) which then does the HLT instruction in the x86 implementation, *again* (in case the APM call was a HLTing implementation, that is).
Methinks there might be another need_resched() check missing right before calling the other idle functions...
Note that I haven't even bothered checking the non-APM implementations of the idle loop yet - I'm going to do that now...

not quite

chicks (not verified)
on
May 26, 2005 - 6:00pm

not sure what you missed but at least since 2.2 linux has this (or similar, it's from 2.6.12-rc4) in default_idle() among others:

if (!need_resched())
safe_halt();

i.e., the kernel will schedule instead of entering the hlt state, as necessary.

Idle bug?!

*_* (not verified)
on
May 26, 2005 - 11:58pm

Wow!
I never used OpenBSD, but, if you have access to a DOS or Windows 9x machine, you can test this, very noticeable:
Disk I/O speeds up by about 5 times if you move the mouse while the disk is working!
I noticed the same on Linux 2.0 and 2.2, and to a lesser extent (~2x speed difference) on Linux 2.4, 2.6, FreeBSD 5.x and QNX 6.x.
Is this the same bug? Do all of these OSs have this bug in their idle loop?
Or is this normal behaviour, somehow inherent in the hardware?

Mouse movement increasing text selection.

Shane (not verified)
on
May 27, 2005 - 7:50am

Or is this normal behaviour, somehow inherent in the hardware?

I have noticed for years in DOS and then Windows (not sure if all versions), that when selecting text, by click dragging from top to bottom, that moving the mouse side to side with the cursor at the bottom of the screen will increase the text scrolling speed (and thus completing the selection) by many times.

Same deal?

This is just a common error p

on
May 27, 2005 - 3:27pm

This is just a common error people make when dealing with messaging systems.

Every time you move mouse it causes more messages to be sent to the active windows. If something is IO-bound and waiting on a message, more frequent calls mean a more timely action. If however, you're IO-bound and not waiting on a precise message, then you have to switch something else, and in the case of windows, it does it on mouse position refreshes (or something similar).

It's not a very good idea imo. Ever tried running XCOM UFO on a new PC? Here they did the opposite and expected a calculation delay to be a good 'tick count' rather than wait on an interrupt timer delayed message.

Divide by zero.

Shane (not verified)
on
May 28, 2005 - 5:06am

Ever tried running XCOM UFO on a new PC? Here they did the opposite and expected a calculation delay to be a good 'tick count' rather than wait on an interrupt timer delayed message.

No but I remember the old XT version of Spacequest would not run at all on faster computers because it would complete a test calculation in zero time units (not sure if seconds, etc) and then halt at a divide by zero error for using that zero result in a calculation.

I got burned once purchasing a helicopter game which recommended a 33MHz i386. I had a 33MHz i486 and even with the "turbo" off, it was so fast as to be unplayable. I was pretty shocked that the 486 could be so much faster than the 386, but there you go.

If only people who charge for software, would program properly.

Haha!

*_* (not verified)
on
June 6, 2005 - 5:38pm

I was not talking about text scrolling speed though.
But disk I/O.
And I am completely serious ;)
Windows XP/2000 don't exhibit this (or at least it's not significant).
Windows 95/98 do...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.