Linux: Kernel Preemption, To Enable Or Not To Enable

Submitted by Jeremy
on March 21, 2004 - 8:04pm

A recent bug report on the lkml complained of significant performance degradation from enabling CONFIG_PREEMPT, kernel preemption. 2.6 kernel maintainer Andrew Morton [interview] pointed out that such degradation from enabling kernel preemption is not normal, instead likely from it triggering a bug. However, an interesting conversation on the merits of kernel preemption followed.

Andrea Arcangeli, author of the 2.4 virtual memory subsystem, bluntly suggested, "keep preempt turned off always, it's useless. Preempt just wastes cpu with tons of branches in fast paths that should take one cycle instead." Andrew Morton noted, "preempt is overrated." However, he went on to point out that kernel preemption has been very useful in detecting locking bugs. He added, "it has been demonstrated that preempt improves the average latency. But not worst-case, because those paths tend to be under spinlock."

Robert Love [interview] replied directly to Andrea's comments saying, "I think you are really blowing the overhead of kernel preemption out of proportion," going on to agree with Andrew's statements. He further explained:

"I also feel you underestimate the improvements kernel preemption gives. Yes, the absolute worst case latency probably remains because it tends to occur under lock (although, it is now easier to pinpoint that latency and work some magic on the locks). But the variance of the latency goes way down, too. We smooth out the curve. And these are differences that matter."


From: Marinos J. Yannikos [email blocked]
To:  linux-kernel
Subject: CONFIG_PREEMPT and server workloads
Date: Thu, 18 Mar 2004 05:00:01 +0100

Hi,

we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the 
default .config setting was CONFIG_PREEMPT=y. To get straight to the 
point: according to our measurements, this results in severe performance 
degradation with our typical and some artificial workload. By "severe" I 
mean this:

---

1. "production" load(*), kernel-compiles (make -j5, in both cases 
kernels with CONFIG_PREEMPT=n were compiled), dual Xeon 3,06GHz/4GB RAM, 
SMP+HT.

2.6.4 with CONFIG_PREEMPT=y:
real    4m41.741s
user    6m13.631s
sys     0m43.729s

2.6.4 with CONFIG_PREEMPT=n:
real    2m20.424s
user    5m54.498s
sys     0m37.297s

The slowness during the compilation was very noticeable on the console.

2. artificial load(**), kernel-compiles (make -j5), single 2,8GHz P4 
with SMP+HT:

2.6.4 with CONFIG_PREEMPT=y:
real    7m40.933s
user    5m42.454s
sys     0m28.114s

2.6.4 with CONFIG_PREEMPT=n:
real    7m13.735s
user    4m56.266s
sys     0m35.495s

3. no noticeable slowdown with no load at all during the benchmarking 
was observed.

(*) busy webserver doing apache/mod_perl stuff, ~40 static/~10 dynamic 
hits/sec, ~20% CPU usage; typical load has no significant 
jumps/anomalies in this time frame (it's also one of 3 boxes in a 
cluster with a hardware load balancer)
(**) "ab -n 100000 -c2 <some simple CGI script>" from another box 
running at the same time
---

Now, we can go into details about our methodology and what else may have 
been borked about our boxes, but what I'm wondering about is (assuming 
that we didn't mess things up somewhere and other people can confirm 
this!): why in the world would anyone want CONFIG_PREEMPT=y as a default 
setting when it has such an impact on performance in actual production 
environments? Is 2.6 intended to become the "Desktop Linux" code path? 
If not, shouldn't there be a big warning sticker somewhere that says 
"DON'T EVEN THINK ABOUT KEEPING THIS DEFAULT SETTING UNLESS ALL YOU WANT 
TO DO IS LISTEN TO MP3 AUDIO WHILE USING XFREE86!"? I've been skimming 
over older lkml postings about performance problems with 2.6.x and many 
of them, while obviously being about systems with CONFIG_PREEMPT=y, 
don't even mention the fact that the degradation might be because of 
that setting. Noone seems to know/care. Why?

Regards,
  Marinos
-- 
Dipl.-Ing. Marinos Yannikos, CEO
Preisvergleich Internet Services AG
Franzensbrückenstraße 8/2/16, A-1020 Wien
Tel./Fax: (+431) 5811609-52/-55


From: Andrew Morton [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Wed, 17 Mar 2004 21:12:27 -0800 "Marinos J. Yannikos" [email blocked] wrote: > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the > default .config setting was CONFIG_PREEMPT=y. To get straight to the > point: according to our measurements, this results in severe performance > degradation with our typical and some artificial workload. By "severe" I > mean this: You're the first to report this. On a little 256MB P4-HT box here, running 2.6.5-rc1: preempt: make -j3 vmlinux 272.49s user 16.32s system 196% cpu 2:26.77 total non-preempt: make -j3 vmlinux 271.64s user 15.65s system 195% cpu 2:27.25 total > If not, shouldn't there be a big warning sticker somewhere that says > "DON'T EVEN THINK ABOUT KEEPING THIS DEFAULT SETTING UNLESS ALL YOU WANT > TO DO IS LISTEN TO MP3 AUDIO WHILE USING XFREE86!"? Dude, chill ;) Something seems to be pretty busted there. Is the machine swapping at all? Under any sort of memory stress? How does it compare with 2.4 running the same workloads? Suggest you run some simple IO benchmarks (straight dd, hdparm -t, bonnie++, tiobench, etc) and compute-intensive tasks, etc. Try to narrow down exactly what part of the kernel's operation is being impacted. Is it context switches, I/O rates, disk fragmentation, etc, etc. If you can distill to regression down to the most simple test then we can identify its cause more easily.
From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 07:03:58 +0100 On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote: > Hi, > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the > default .config setting was CONFIG_PREEMPT=y. To get straight to the > point: according to our measurements, this results in severe performance > degradation with our typical and some artificial workload. By "severe" I > mean this: this is expected (see the below email, I predicted it on Mar 2000), keep preempt turned off always, it's useless. Worst of all we're now taking spinlocks earlier than needed, and the preempt_count stuff isn't optmized away by PREEMPT=n, once those bits will be fixed too it'll go even faster. preempt just wastes cpu with tons of branches in fast paths that should take one cycle instead. Takashi Iwai did lots of research on the preempt vs lowlatency and he found that preempt buys nothing and he confirmed my old theories (I always advocated against preempt, infact I still advocate for "enabling" preempt on demand during the copy-user only, so to enable preempt on demand, not to disable it on demand like it happens now with the bloats it generates), infact 2.4-aa has a lower max latency than 2.6 stock with preempt enabled. About my old idea of enabling preempt on demand (i.e. the opposite of preempt) in the copy-user we've to check for reschedule anyways, so we can as well enable preempt, and that would still keep the kernel simple and efficient. This way we would dominate the latency during the bulk work (especially important with bigger page size or with page clustering). These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I suggested to use EIP dumps from interrupts to get the hotspots, he promptly used the RTC for that and he could fixup all the spots, great job he did since now we've a very low worst case sched latency in 2.6 too: --- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100 +++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100 @@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m unlock_page(page); } page_cache_release(page); + cond_resched(); spin_lock(&mapping->page_lock); } /* --- linux/fs/super.c-dist 2004-03-09 19:28:58.482270871 +0100 +++ linux/fs/super.c 2004-03-09 19:29:05.000792950 +0100 @@ -356,6 +356,7 @@ void sync_supers(void) { struct super_block * sb; restart: + cond_resched(); spin_lock(&sb_lock); sb = sb_entry(super_blocks.next); while (sb != sb_entry(&super_blocks)) --- linux/fs/fs-writeback.c-dist 2004-03-09 19:15:25.237752504 +0100 +++ linux/fs/fs-writeback.c 2004-03-09 19:16:37.630330614 +0100 @@ -360,6 +360,7 @@ writeback_inodes(struct writeback_contro } spin_unlock(&sb_lock); spin_unlock(&inode_lock); + cond_resched(); } /* I'm actually for dropping preempt from the kernel and to try to implement my old idea of enabling preempt on demand in a few latency critical spots. So instead of worrying about taking spinlocks too early and calling preempt_disable, the only thing the kernel will do w.r.t. preempt is: enter_kernel: preempt_enable() copy_user() preempt_disable() exit_kernel: I think it's perfectly acceable to do the above (i.e. the opposite of preempt). While I think preempt is overkill. More details on this in the old posts (I recall I even did a quick hack to try if it worked, I'm surprised how old this email is but it's still very actual apparently): http://www.ussg.iu.edu/hypermail/linux/kernel/0003.1/0998.html "With the fact we'll have to bloat the fast path (a fast lock like the above one and all the spinlocks will need an additional forbid_preempt(smp_processor_id()) the preemtable kernel it's not likely to be a win. The latency will decrease without drpping throughtput only in code that runs for long time with none lock held like the copy_user stuff. That stuff will run at the same speed as now but with zero scheduler latency. The _lose_ instead will happen in _all_ the code that grabs any kind of spinlock because spin_lock/spin_unlock will be slower and the latency won't decrease for that stuff. But now by thinking at that stuff I have an idea! Why instead of making the kernel preemtable we take the other way around? So why instead of having to forbid scheduling in locked regions, we don't simply allow rescheduling in some piece of code that we know that will benefit by the preemtable thing? The kernel won't be preemtable this way (so we'll keep throughtput in the locking fast path) but we could mark special section of kernel like the copy user as preemtable. It will be quite easy: static atomic_t cpu_preemtable[NR_CPUS] = { [0..NR_CPUS] = ATOMIC_INIT(0), }; #define preemtable_copy_user(...) \ do { \ atomic_inc(&cpu_preemtable[smp_processor_id()]); \ copy_user(...); \ atomic_dec(&cpu_preemtable[smp_processor_id()]); \ } while (0) [..] " I still think after 4 years that such idea is more appealing then preempt, and numbers start to prove me right.
From: Andrew Morton [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 01:50:04 -0800 Andrea Arcangeli [email blocked] wrote: > > On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote: > > Hi, > > > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the > > default .config setting was CONFIG_PREEMPT=y. To get straight to the > > point: according to our measurements, this results in severe performance > > degradation with our typical and some artificial workload. By "severe" I > > mean this: > > this is expected (see the below email, I predicted it on Mar 2000), Incorrectly. > keep preempt turned off always, it's useless. Preempt is overrated. The infrastructure which it has introduced has been useful for detecting locking bugs. It has been demonstrated that preempt improves the average latency. But not worst-case, because those paths tend to be under spinlock. > Worst of all we're now taking spinlocks earlier than needed, Where? CPU scheduler? > and the preempt_count stuff isn't optmized away by PREEMPT=n, It should be. If you see somewhere where it isn't, please tell us. We unconditionally bump the preempt_count in kmap_atomic() so that we can use atomic kmaps in read() and write(). This is why four concurrent write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels. > preempt just wastes cpu with tons of branches in fast paths that should > take one cycle instead. I don't recall anyone demonstrating even a 1% impact from preemption. If preemption was really causing slowdowns of this magnitude it would of course have been noticed. Something strange has happened here and more investigation is needed. > ... > I still think after 4 years that such idea is more appealing then > preempt, and numbers start to prove me right. The overhead of CONFIG_PREEMPT is quite modest. Measuring that is simple.
From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 15:51:29 +0100 On Thu, Mar 18, 2004 at 01:50:04AM -0800, Andrew Morton wrote: > Andrea Arcangeli [email blocked] wrote: > > > > On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote: > > > Hi, > > > > > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the > > > default .config setting was CONFIG_PREEMPT=y. To get straight to the > > > point: according to our measurements, this results in severe performance > > > degradation with our typical and some artificial workload. By "severe" I > > > mean this: > > > > this is expected (see the below email, I predicted it on Mar 2000), > > Incorrectly. > > > keep preempt turned off always, it's useless. > > Preempt is overrated. The infrastructure which it has introduced has been > useful for detecting locking bugs. yes, I agree preempt is useful to debug SMP systems on UP systems, but it's debugging feature so it should be disabled on production systems. > It has been demonstrated that preempt improves the average latency. But > not worst-case, because those paths tend to be under spinlock. Sure, I agree it improves average latency, the problem is that there are nearly no application that cares about average, what matters is the worst case latency only. > > Worst of all we're now taking spinlocks earlier than needed, > > Where? CPU scheduler? Everywhere, see the kmaps, we spinlock before instead of spinlock after, the scheduler, lots of places. I mean, people don't call preempt_disable() kmap_atomic spin_lock they do: spin_lock kmap_atomic so they're effectively optimizing for PREEMPT=y and I don't think this is optimal for the long term. One can aruge the microscalability slowdown isn't something to worry about, I certainly don't worry about it too much either, it's more a bad coding habit to spinlock earlier than needed to avoid preempt_disable. > > and the preempt_count stuff isn't optmized away by PREEMPT=n, > > It should be. If you see somewhere where it isn't, please tell us. the counter is definitely not optimized away, see: #define inc_preempt_count() \ do { \ preempt_count()++; \ } while (0) #define dec_preempt_count() \ do { \ preempt_count()--; \ } while (0) #define preempt_count() (current_thread_info()->preempt_count) those are running regardless of PREEMPT=n. This is debugging code to catch some basic preempt issue with in_interrupt() and friends even with PREEMPT=n, but it wastes 1 cacheline per-cpu during irq handling. > We unconditionally bump the preempt_count in kmap_atomic() so that we can > use atomic kmaps in read() and write(). This is why four concurrent > write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels. sorry, why should the atomic kmaps read the preempt_count? Are those ++ -- useful for anything more than debugging PREEMPT=y on a kernel compiled with PREEMPT=n? I thought it was just debugging code with PREEMPT=n. I know why the atomic kmaps speedup write but I don't see how can preempt_count help there when PREEMPT=n, the atomic kmaps are purerly per-cpu and one can't schedule anyways while taking those kmaps (no matter if inc_preempt_count or not). > > preempt just wastes cpu with tons of branches in fast paths that should > > take one cycle instead. > > I don't recall anyone demonstrating even a 1% impact from preemption. If > preemption was really causing slowdowns of this magnitude it would of > course have been noticed. Something strange has happened here and more > investigation is needed. I'm also surprised the slowdown is so huge, maybe he tweaked the CONFIG_SLAB at the same time of PREEMPT? ;) Anyways there is a slowdown, and the whole point is that preempt doesn't improve the worst case latency at all. > > ... > > I still think after 4 years that such idea is more appealing then > > preempt, and numbers start to prove me right. > > The overhead of CONFIG_PREEMPT is quite modest. Measuring that is simple. It is quite modest I agree, but there is an overhead and it doesn't payoff. BTW, with preempt enabled there is no guarantee that RCU can ever reach a quiescient point and secondly there is no guarantee that you will ever be allowed to unplug a CPU hotline since again there's no guarantee to reach a quiescient point. Think a kernel thread doing for (;;) (i.e. math computations in background, to avoid starving RCU the kernel thread will have to add schedule() explicitly no matter if PREEMPT=y or PREEMPT=n, again invalidating the point of preempt, the rcu tracking for PREEMT=y is also more expensive). Note, the work you and the other preempt developers did with preempt was great, it wouldn't be possible to be certain that it wasn't worthwhile until we had the thing working and finegrined (i.e. in all in_interrupt etc..), and now we know it doesn't payoff and in turn I'm going to try the explicit-preempt that is to explicitly enable preempt in a few cpu-intensive kernel spots where we don't take locks (i.e. copy-user), the original suggestion I did 4 years ago, I believe in such places an explicit-preempt will work best since we've already to check every few bytes the current->need_resched, so adding a branch there should be very worthwhile. Doing real preempt like now is overkill instead and should be avoided IMHO.
From: Robert Love [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 10:34:22 -0500 On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote: > Note, the work you and the other preempt developers did with preempt was > great, it wouldn't be possible to be certain that it wasn't worthwhile > until we had the thing working and finegrined (i.e. in all in_interrupt > etc..), and now we know it doesn't payoff and in turn I'm going to try > the explicit-preempt that is to explicitly enable preempt in a few > cpu-intensive kernel spots where we don't take locks (i.e. copy-user), > the original suggestion I did 4 years ago, I believe in such places an > explicit-preempt will work best since we've already to check every few > bytes the current->need_resched, so adding a branch there should be very > worthwhile. Doing real preempt like now is overkill instead and should > be avoided IMHO. I think you are really blowing the overhead of kernel preemption out of proportion. The numbers Marinos J. Yannikos are reported are definitely a bug, an issue, something that will be fixed. The numbers everyone else has historically shown are in line with Andrew: slight changes and generally a small improvement to kernel compiles, dbench runs, et cetera. I also feel you underestimate the improvements kernel preemption gives. Yes, the absolute worst case latency probably remains because it tends to occur under lock (although, it is now easier to pinpoint that latency and work some magic on the locks). But the variance of the latency goes way down, too. We smooth out the curve. And these are differences that matter. And it can be turned off, so if you don't care about that and are not debugging atomicity (which preempt is a big help with, right?) then turn it off. Oh, and if the PREEMPT=n overhead is really an issue, then I agree that needs to be fixed :) Robert Love
From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 17:01:09 +0100 Hi Robert, On Thu, Mar 18, 2004 at 10:34:22AM -0500, Robert Love wrote: > I also feel you underestimate the improvements kernel preemption gives. Takashi benchmarked the worst case latency in very good detail. 2.6 stock with PREEMPT=y has a worst case latency of 2.4-aa. This is a fact. With Takashi's lowlatency fixes the latency goes below 2.4-aa, w/ or w/o PREEMPT. PREEMPT=y doesn't and cannot improve the worst case latency. This is true today like it was true 4 years ago. > Yes, the absolute worst case latency probably remains because it tends > to occur under lock (although, it is now easier to pinpoint that latency > and work some magic on the locks). But the variance of the latency goes > way down, too. We smooth out the curve. And these are differences that > matter. I don't think they can matter when the worst case is below 0.2msec. > And it can be turned off, so if you don't care about that and are not > debugging atomicity (which preempt is a big help with, right?) then turn > it off. I want to implement my aged idea that is to do the opposite of preempt. I believe that is a much more efficient way to smooth the curve at lower overhead and no kernel complexity. Preempt is always enabled as soon as the cpu enters kernel. And it can be disabled on demand. I want preempt to be disabled as soon as teh cpu enters kernel, and I want to enable it on demand _only_ during the copy user, or similar cpu intensive operations, also guaranteeing that those operations comes to an end to avoid RCU starvation. Then I would like to ompare the average latency (the curve) I doubt they'll be any different, and the overhead will be zero (we've to check need_resched anyways after a copy-user, so we can as well do preempt_enable preembt_disable around it). > Oh, and if the PREEMPT=n overhead is really an issue, then I agree that > needs to be fixed :) It's not a big issue of course (very low prio thing ;).
From: Robert Love [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 12:48:50 -0500 On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote: > the counter is definitely not optimized away, see: This is because of work Dave Miller and Ingo did - irq count, softirq count, and lock count (when PREEMPT=y) are unified into preempt_count. So it is intended. The unification makes things cleaner and simpler, using one value in place of three and one interface and concept in place of many others. It also gives us a single simple thing to check for an overall notion of "atomicity", which is what makes debugging so nice. Robert Love
From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 19:00:59 +0100 On Thu, Mar 18, 2004 at 12:48:50PM -0500, Robert Love wrote: > On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote: > > > the counter is definitely not optimized away, see: > > This is because of work Dave Miller and Ingo did - irq count, softirq > count, and lock count (when PREEMPT=y) are unified into preempt_count. > > So it is intended. > > The unification makes things cleaner and simpler, using one value in > place of three and one interface and concept in place of many others. > It also gives us a single simple thing to check for an overall notion of > "atomicity", which is what makes debugging so nice. You're right, I didn't notice the other counters disappeared. Those counter existed anyways w/o preempt too, so it would been superflous with preempt=y to do the accounting in two places. So this is zerocost with preempt=n and I was wrong claiming superflous preempt leftovers.
From: Takashi Iwai [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 16:28:16 +0100 At Thu, 18 Mar 2004 07:03:58 +0100, Andrea Arcangeli wrote: > > On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote: > > Hi, > > > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the > > default .config setting was CONFIG_PREEMPT=y. To get straight to the > > point: according to our measurements, this results in severe performance > > degradation with our typical and some artificial workload. By "severe" I > > mean this: > > this is expected (see the below email, I predicted it on Mar 2000), keep > preempt turned off always, it's useless. Worst of all we're now taking > spinlocks earlier than needed, and the preempt_count stuff isn't > optmized away by PREEMPT=n, once those bits will be fixed too it'll go > even faster. > > preempt just wastes cpu with tons of branches in fast paths that should > take one cycle instead. > > Takashi Iwai did lots of research on the preempt vs lowlatency and > he found that preempt buys nothing and he confirmed my old theories well, i personally am not against the current preempt mechanism from the viewpoint of the audio-processing purpose :) the implementation is relatively clean and easy. but i agree with Andrea, that surely we can achieve the alsmo same RT-performance even without preemption, i.e. with less perempt overhead. it's not necessary to be default. (snip) > These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I > suggested to use EIP dumps from interrupts to get the hotspots, he > promptly used the RTC for that and he could fixup all the spots, great > job he did since now we've a very low worst case sched latency in 2.6 > too: > > --- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100 > +++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100 > @@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m > unlock_page(page); > } > page_cache_release(page); > + cond_resched(); > spin_lock(&mapping->page_lock); > } > /* the above one is the major source of RT-latency. only this oneliner will reduce more than 90% of RT-latencies. in my case with reiserfs, i got 0.4ms RT-latency with my test suite (with athlon 2200+). there is another point to be fixed in the reiserfs journal transaction. then you'll get 0.1ms RT-latency without preemption. for ext3, these two spots are relevant. --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100 +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100 @@ -290,6 +290,9 @@ write_out_data_locked: commit_transaction->t_sync_datalist = jh; break; } + + if (need_resched()) + break; } while (jh != last_jh); if (bufs || need_resched()) { --- linux-2.6.4-8/fs/ext3/inode.c-dist 2004-03-18 02:33:38.000000000 +0100 +++ linux-2.6.4-8/fs/ext3/inode.c 2004-03-18 02:33:40.000000000 +0100 @@ -1987,6 +1987,7 @@ static void ext3_free_branches(handle_t if (is_handle_aborted(handle)) return; + cond_resched(); if (depth--) { struct buffer_head *bh; int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb); i think the first one is needed for preemptive kernel, too. with these patches, also 0.1-0.2ms RT-latency is achieved. BTW, my measurement tool is found at http://www.alsa-project.org/~iwai/latencytest-0.5.2.tar.gz -- Takashi Iwai [email blocked] ALSA Developer - www.alsa-project.org
From: Robert Love [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 10:40:58 -0500 On Thu, 2004-03-18 at 10:28, Takashi Iwai wrote: Hi, Takashi Iwai. > well, i personally am not against the current preempt mechanism from > the viewpoint of the audio-processing purpose :) the implementation > is relatively clean and easy. Agreed. > i think the first one is needed for preemptive kernel, too. > with these patches, also 0.1-0.2ms RT-latency is achieved. Ohh, interesting. I'll give these a spin with PREEMPT=y and see. Thank you! Robert Love
From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 16:42:50 +0100 On Thu, Mar 18, 2004 at 04:28:16PM +0100, Takashi Iwai wrote: > the above one is the major source of RT-latency. > only this oneliner will reduce more than 90% of RT-latencies. > in my case with reiserfs, i got 0.4ms RT-latency with my test suite > (with athlon 2200+). cool ;) > there is another point to be fixed in the reiserfs journal > transaction. then you'll get 0.1ms RT-latency without preemption. [..] > i think the first one is needed for preemptive kernel, too. > with these patches, also 0.1-0.2ms RT-latency is achieved. amazing. And without those improvements you did, the worst case for 2.6 mainline is 8msec, right? thanks for the great work on lowlatency. I'm sure Andrew will pick those improvements immediatly :).
From: Andrew Morton [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 11:01:59 -0800 Takashi Iwai [email blocked] wrote: > > > These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I > > suggested to use EIP dumps from interrupts to get the hotspots, he > > promptly used the RTC for that and he could fixup all the spots, great > > job he did since now we've a very low worst case sched latency in 2.6 > > too: > > > > --- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100 > > +++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100 > > @@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m > > unlock_page(page); > > } > > page_cache_release(page); > > + cond_resched(); > > spin_lock(&mapping->page_lock); > > } > > /* > > the above one is the major source of RT-latency. This one's fine. > for ext3, these two spots are relevant. > > --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100 > +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100 > @@ -290,6 +290,9 @@ write_out_data_locked: > commit_transaction->t_sync_datalist = jh; > break; > } > + > + if (need_resched()) > + break; > } while (jh != last_jh); > > if (bufs || need_resched()) { This one I need to think about. Perhaps we can remove the yield point a few lines above. One needs to be really careful with the lock-dropping trick - there are weird situations in which the kernel fails to make any forward progress. I've been meaning to do another round of latency tuneups for ages, so I'll check this one out, thanks. There's also the SMP problem: this CPU could be spinning on a lock with need_resched() true, but the other CPU is hanging on the lock for ages because its need_resched() is false. In the 2.4 ll patch I solved that via the scary hack of broadcasting a reschedule instruction to all CPUs if an rt-prio task just became runnable. In 2.6-preempt we use preempt_spin_lock(). But in 2.6 non-preempt we have no solution to this, so worst-case scheduling latencies on 2.6 SMP CONFIG_PREEMPT=n are high. Last time I looked the worst-case latency is in fact over in the ext3 checkpoint code. It's under spinlock and tricky to fix. > --- linux-2.6.4-8/fs/ext3/inode.c-dist 2004-03-18 02:33:38.000000000 +0100 > +++ linux-2.6.4-8/fs/ext3/inode.c 2004-03-18 02:33:40.000000000 +0100 > @@ -1987,6 +1987,7 @@ static void ext3_free_branches(handle_t > if (is_handle_aborted(handle)) > return; > > + cond_resched(); > if (depth--) { > struct buffer_head *bh; > int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb); > This one's OK. btw, several months ago we discussed the idea of adding a sysctl to the ALSA drivers which would cause a dump_stack() to be triggered if the audio ISR detected a sound underrun. This would be a very useful feature, because it increases the number of low-latency developers from O(2) to O(lots). If some user is complaining of underruns we can just ask them to turn on the sysctl and we get a trace pointing at the culprit code. And believe me, we need the coverage. There are all sorts of weird code paths which were found during the development of the 2.4 low-latency patch. i2c drivers, fbdev drivers, all sorts of things which you and I don't test. I know it's a matter of if (sysctl_is_set) dump_stack(); in snd_pcm_update_hw_ptr_post() somewhere, but my brain burst when working out the ALSA sysctl architecture. Is this something you could add please?
From: Takashi Iwai [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 20:08:59 +0100 At Thu, 18 Mar 2004 11:01:59 -0800, Andrew Morton wrote: > > btw, several months ago we discussed the idea of adding a sysctl to the > ALSA drivers which would cause a dump_stack() to be triggered if the audio > ISR detected a sound underrun. > > This would be a very useful feature, because it increases the number of > low-latency developers from O(2) to O(lots). If some user is complaining > of underruns we can just ask them to turn on the sysctl and we get a trace > pointing at the culprit code. > > And believe me, we need the coverage. There are all sorts of weird code > paths which were found during the development of the 2.4 low-latency patch. > i2c drivers, fbdev drivers, all sorts of things which you and I don't > test. > > I know it's a matter of > > if (sysctl_is_set) > dump_stack(); > > in snd_pcm_update_hw_ptr_post() somewhere, but my brain burst when working > out the ALSA sysctl architecture. > > Is this something you could add please? oh, sorry, maybe i forgot to tell you that it has been already there :) # echo 1 > /proc/asound/card0/pcm0p/xrun_debug this will show the stacktrace when a buffer overrun/underrun is detected in the irq handler. it's not perfect, though. we can add stacktracing in other nasty places, e.g. when the unexpected h/w pointer is returned (this is usually because of sloppy irq handling). Takashi
From: Andrew Morton [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 11:18:07 -0800 Takashi Iwai [email blocked] wrote: > > oh, sorry, maybe i forgot to tell you that it has been already there > :) > > # echo 1 > /proc/asound/card0/pcm0p/xrun_debug > > this will show the stacktrace when a buffer overrun/underrun is > detected in the irq handler. it's not perfect, though. heh, you just shrunk my todo list by 0.01%. Have you had any useful reports from this feature?
From: Takashi Iwai [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 20:20:17 +0100 At Thu, 18 Mar 2004 11:18:07 -0800, Andrew Morton wrote: > > Takashi Iwai [email blocked] wrote: > > > > oh, sorry, maybe i forgot to tell you that it has been already there > > :) > > > > # echo 1 > /proc/asound/card0/pcm0p/xrun_debug > > > > this will show the stacktrace when a buffer overrun/underrun is > > detected in the irq handler. it's not perfect, though. > > heh, you just shrunk my todo list by 0.01%. > > Have you had any useful reports from this feature? not much yet. i should write this feature in documentation now :) also, the total buffer underrun/overrun problem doesn't happen *so* often (except for the heavy i/o load, etc). Takashi From: Andrea Arcangeli [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 20:43:45 +0100 On Thu, Mar 18, 2004 at 08:20:17PM +0100, Takashi Iwai wrote: > also, the total buffer underrun/overrun problem doesn't happen *so* > often (except for the heavy i/o load, etc). you mean because the disk has not enough bandwidth to read the file, right? (not scheduler related)
From: Takashi Iwai [email blocked] Subject: Re: CONFIG_PREEMPT and server workloads Date: Thu, 18 Mar 2004 20:50:01 +0100 At Thu, 18 Mar 2004 20:43:45 +0100, Andrea Arcangeli wrote: > > On Thu, Mar 18, 2004 at 08:20:17PM +0100, Takashi Iwai wrote: > > also, the total buffer underrun/overrun problem doesn't happen *so* > > often (except for the heavy i/o load, etc). > > you mean because the disk has not enough bandwidth to read the file, > right? (not scheduler related) it's just my general opinion judging from the bug reports from ALSA users. the i/o load has more influence than scheduler-senstive loads. i think i can test the "pseudo" interactivity with the latency test program running as a normal user. i'll try it tonight. Takashi

Related Links:

Latency

Con Kolivas
on
March 21, 2004 - 8:47pm

This in-kernel preemption was much more useful in 2.4 and before when the latencies in the kernel were much greater. In 2.6 most of the high latency areas have been addressed and the remaining few are also being looked at. When the scheduling latency due to priorities and timeslices is in the order of 10ms to 10s it is hard to believe that tens of microseconds changes to latency help. Furthermore, if you have a poorly coded application that is prone to priority inversion preemption has been shown to make it worse.

so..

Hiryu
on
March 21, 2004 - 10:25pm

So Con, does this mean you feel that preemption no longer makes sense in 2.6?

Does this say anything about other OS's with preemptive kernels? Such as MacOS X and I think freebsd 5.*? Is linux's latency just *really* good or what?

Preemption

Con Kolivas
on
March 22, 2004 - 12:32am

The linux kernel is preemptible by default. In-kernel preemption is not adding to normal scheduling policies any more. The latency decrease by preemption is never going to make linux a hard real time OS but it may help if you're using RT tasks and your timing is critical. So, yes, I don't think preempt need be enabled in 2.6. I'd rather see all remaining latencies fixed.

Realtime tasks

Anonymous
on
March 21, 2004 - 10:49pm

No, it is realtime tasks that care about maximum latency. The scheduler doesn't have any involvement other than correctly implementing the standards.

Preemption actually becomes more beneficial as lock hold times are reduced...

how does that make sense?

Anonymous
on
March 22, 2004 - 5:07am

Preemption actually becomes more beneficial as lock hold times are reduced...

if i'm holding lock a for .3 ms, but you hold lock c for .8 ms... don't you think it'd be more beneficial to preempt lock c?? since it's being help more than double the time of lock a? please explain this.. i'm not getting it.

Preemption vs locking

farnz
on
March 22, 2004 - 10:48am

You don't preempt a lock; a lock is used to ensure that you are not interrupted at a critical point. Preemption allows the kernel to suspend a function it's working on and do something else for a bit.

When lock hold times are long, preemption doesn't help much, because although I can interrupt what you're doing, I have to wait until you've released the lock to do anything useful; when lock hold times are short, there are longer periods of time when I can interrupt you and do something.

So, if you run for (say) 1 ms, and hold lock a for 0.3 ms, there's 0.7 ms when I can interrupt you and do things, even if I need lock a. If I run for 1 ms, and hold lock c for 0.8 ms, there's only 0.2 ms of latency saved by preemption if someone needs lock c.

maybe someone can help me

Anonymous
on
March 22, 2004 - 4:53pm

I've been holding off of 2.6 because of a weird performance issue with a game (eneny territory). I've asked in irc, posted on a few sites (here included) and filed a bug. Basically with 2.6, my load times get longer and longer for new maps within the game. I could ramble on but I'll just list the bugzilla link.




I haven't tried 2.6.4 yet, but if anyone could help me with this, or suggest anything else to test that would be great.

Con, re: your scheduler

Anonymous
on
March 22, 2004 - 8:40pm

With all due respect, wasn't it not too long ago that you didn't feel comfortable programming in C at all (contest was written in bash)? Did you author this scheduler completely, or did you collaborate with others?

I'm curious, because I'm not sure I'd feel comfortable replacing Ingo's rock-solid (and by all standards, world-class) scheduler for your untested one. Just my $0.02.

Re: Con, re: your scheduler

Anonymous
on
March 22, 2004 - 9:47pm

He's not suggesting to remove Ingo's scheduler. Also, it's obvious to me, through his work managing his patch set, that he's a more then competent C programmer.

Policy change

Con Kolivas
on
March 22, 2004 - 11:13pm

The architecture of Ingo's scheduler remains in mine. I'm not saying I can code as well as Ingo, or ever will be able to. Mine is a scheduling policy rewrite, and lets face it it's only a first release so it comes with all inherent warnings but it's been solid for me so far. Of course that doesn't mean it's the final form and I do have some changes already penned in.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.