logo
Published on KernelTrap (http://kerneltrap.org)

Linux: High-Res Timers and Tickless Kernel

By Jeremy
Created Jun 23 2006 - 11:11

Thomas Gleixner and Ingo Molnar [interview [1]] posted an update of their high-res timers kernel patches [2] for the 2.6.17 kernel, "upon which we based a tickless kernel (dyntick) implementation and a 'dynamic HZ' feature as well". The patch currently works for x86, with ports to x86_64, PPC and ARM in the works. Thomas explains, "the high-res timers feature (CONFIG_HIGH_RES_TIMERS) enables POSIX timers and nanosleep() to be as accurate as the hardware allows (around 1usec on typical hardware). This feature is transparent - if enabled it just makes these timers much more accurate than the current HZ resolution." He goes on to discribe the tickless kernel:

"The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer interrupts: if there is no timer to be expired for say 1.5 seconds when the system goes idle, then the system will stay totally idle for 1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ to 1-2 timer interrupts per second.

"This feature is implemented by driving 'low res timer wheel' processing via special per-CPU high-res timers, which timers are reprogrammed to the next-low-res-timer-expires interval. This tickless-kernel design is SMP-safe in a natural way and has been developed on SMP systems from the beginning."


From: Thomas Gleixner [email blocked]
To: LKML [email blocked]
Subject: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ
Date:	Sun, 18 Jun 2006 17:10:26 +0200

We are pleased to announce the 2.6.17 based release of our high-res 
timers kernel feature, upon which we based a tickless kernel (dyntick) 
implementation and a 'dynamic HZ' feature as well:

http://www.tglx.de/projects/hrtimers/2.6.17/ [3]

The easiest way to try these features is to apply the combo patch to 
vanilla 2.6.17. The patching order is:

http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.17.tar.bz2 [4]
http://www.tglx.de/projects/hrtimers/2.6.17/patch-2.6.17-hrt-dyntick1.patch [5]


A broken out patch series is available too:

http://www.tglx.de/projects/hrtimers/2.6.17/patch-2.6.17-hrt-dyntick1.patches.tar.bz2 [6]


The high-res timers feature (CONFIG_HIGH_RES_TIMERS) enables POSIX 
timers and nanosleep() to be as accurate as the hardware allows (around 
1usec on typical hardware). This feature is transparent - if enabled it 
just makes these timers much more accurate than the current HZ 
resolution. It is based on the Generic Time Of Day patchset from John 
Stultz and it in essence finishes what we started with the 
kernel/hrtimers.c code in 2.6.16.
 
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer 
interrupts: if there is no timer to be expired for say 1.5 seconds when 
the system goes idle, then the system will stay totally idle for 1.5 
seconds. This should bring cooler CPUs and power savings: on our (x86) 
testboxes we have measured the effective IRQ rate to go from HZ to 1-2 
timer interrupts per second.

This feature is implemented by driving 'low res timer wheel' processing 
via special per-CPU high-res timers, which timers are reprogrammed to 
the next-low-res-timer-expires interval. This tickless-kernel design is 
SMP-safe in a natural way and has been developed on SMP systems from
the 
beginning.

Note: while our code should be similar in behavior to the existing 
dynticks kernel patch from Con, it is a fundamentally different design 
(being based on the high-res timers support and APIs) and is thus a 
different implementation. We reused one area of dynticks: we integrated 
and improved the 'timer top' profiling tool (CONFIG_TIMER_INFO).

When running the kernel then there's a 'timeout granularity' 
runtime tunable parameter as well, under:

   /proc/sys/kernel/timeout_granularity

it defaults to 1, meaning that CONFIG_HZ is the granularity of timers. 

For example, if CONFIG_HZ is 1000 and timeout_granularity is set to 10, 
then low-res timers will be expired every 10 jiffies (every 10 msecs), 
thus the effective granularity of low-res timers is 100 HZ. Thus this 
feature implements nonintrusive dynamic HZ in essence, without touching 
the HZ macro itself.

Supported platforms: high-res timers and tickless works on x86 (x86_64,
PPC and ARM port are in the works). Other platforms should still work
fine with the usual HZ frequency timer tick.

Naturally, we'd like these features to be integrated into the upstream 
kernel as well.

Bugreports and suggestions are welcome,
 
	Thomas, Ingo


From: Roman Zippel <zippel@linux-m68k.org> Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Mon, 19 Jun 2006 01:47:22 +0200 (CEST) Hi, On Sun, 18 Jun 2006, Thomas Gleixner wrote: > Bugreports and suggestions are welcome, Could you please document the patches? I know it sucks compared to hacking, but it would make a review a lot simpler. bye, Roman
From: Ingo Molnar [7] [email blocked] Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Mon, 19 Jun 2006 14:50:18 +0200 * Roman Zippel <zippel@linux-m68k.org> wrote: > > Bugreports and suggestions are welcome, > > Could you please document the patches? I know it sucks compared to > hacking, but it would make a review a lot simpler. yeah, we'll add some description to the patches themselves, but otherwise i'm afraid it will be like with almost all patch submissions on lkml: 99% of the details are in the code and people have to ask specifically if one area or another is unclear :-| Meanwhile the patch names should provide you with some initial info (also, we reuse GTOD which is documented in -mm) and the splitup is pretty clean too - but in any case please feel free to ask pointed questions! (we happily accept documentation patches as well.) Ingo
From: Roman Zippel <zippel@linux-m68k.org> Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Mon, 19 Jun 2006 15:47:45 +0200 (CEST) Hi, On Mon, 19 Jun 2006, Ingo Molnar wrote: > > > Bugreports and suggestions are welcome, > > > > Could you please document the patches? I know it sucks compared to > > hacking, but it would make a review a lot simpler. > > yeah, we'll add some description to the patches themselves, but The problem is this is not the first time I mentioned this and some patches still have no descriptions at all! :-( > otherwise i'm afraid it will be like with almost all patch submissions > on lkml: 99% of the details are in the code and people have to ask > specifically if one area or another is unclear :-| For a lot of things this acceptable, but if patches (e.g. clockevents) add new generic infrastructure which effect all archs, they need documentation (unless you also provide all the arch specific changes). > Meanwhile the patch names should provide you with some initial info > (also, we reuse GTOD which is documented in -mm) and the splitup is > pretty clean too - but in any case please feel free to ask pointed > questions! (we happily accept documentation patches as well.) I can't do this without documentation. Without any information I'm only wondering why it has to be this complex. For example clockevents, I think all the special event handlers are overkill, a simple list would do just fine. This way it may also possible to treat a clock as virtual interrupt source and we could share code with interrupt code and a callback can simply be requested via request_irq(). More information about what this code actually intends to do and what it is required to do, would help a great deal to judge alternative solutions, but only the author of this code can really provide this information and IMO it's really sad that this information is still lacking after being requested multiple times. bye, Roman
From: Con Kolivas [8] [email blocked] Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Mon, 19 Jun 2006 15:21:05 +1000 On Monday 19 June 2006 01:10, Thomas Gleixner wrote: > We are pleased to announce the 2.6.17 based release of our high-res > timers kernel feature, upon which we based a tickless kernel (dyntick) > implementation and a 'dynamic HZ' feature as well: > > http://www.tglx.de/projects/hrtimers/2.6.17/ [9] > > The easiest way to try these features is to apply the combo patch to > vanilla 2.6.17. The patching order is: > > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.17.tar.bz2 [10] > http://www.tglx.de/projects/hrtimers/2.6.17/patch-2.6.17-hrt-dyntick1.patch [11] > > > A broken out patch series is available too: > > http://www.tglx.de/projects/hrtimers/2.6.17/patch-2.6.17-hrt-dyntick1.patch [12] >es.tar.bz2 > > > The high-res timers feature (CONFIG_HIGH_RES_TIMERS) enables POSIX > timers and nanosleep() to be as accurate as the hardware allows (around > 1usec on typical hardware). This feature is transparent - if enabled it > just makes these timers much more accurate than the current HZ > resolution. It is based on the Generic Time Of Day patchset from John > Stultz and it in essence finishes what we started with the > kernel/hrtimers.c code in 2.6.16. > > The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer > interrupts: if there is no timer to be expired for say 1.5 seconds when > the system goes idle, then the system will stay totally idle for 1.5 > seconds. This should bring cooler CPUs and power savings: on our (x86) > testboxes we have measured the effective IRQ rate to go from HZ to 1-2 > timer interrupts per second. > > This feature is implemented by driving 'low res timer wheel' processing > via special per-CPU high-res timers, which timers are reprogrammed to > the next-low-res-timer-expires interval. This tickless-kernel design is > SMP-safe in a natural way and has been developed on SMP systems from > the > beginning. > > Note: while our code should be similar in behavior to the existing > dynticks kernel patch from Con, it is a fundamentally different design > (being based on the high-res timers support and APIs) and is thus a > different implementation. We reused one area of dynticks: we integrated > and improved the 'timer top' profiling tool (CONFIG_TIMER_INFO). > > When running the kernel then there's a 'timeout granularity' > runtime tunable parameter as well, under: > > /proc/sys/kernel/timeout_granularity > > it defaults to 1, meaning that CONFIG_HZ is the granularity of timers. > > For example, if CONFIG_HZ is 1000 and timeout_granularity is set to 10, > then low-res timers will be expired every 10 jiffies (every 10 msecs), > thus the effective granularity of low-res timers is 100 HZ. Thus this > feature implements nonintrusive dynamic HZ in essence, without touching > the HZ macro itself. > > Supported platforms: high-res timers and tickless works on x86 (x86_64, > PPC and ARM port are in the works). Other platforms should still work > fine with the usual HZ frequency timer tick. > > Naturally, we'd like these features to be integrated into the upstream > kernel as well. > > Bugreports and suggestions are welcome, > > Thomas, Ingo Nice work Thomas and Ingo. The approach to previous dynticks that I was working on had some nasty issues with scalability that were not addressable without a complete rewrite which is why I abandoned the previous implementation. Your approach for using the hires timer events is ultimately a better solution and the code base is cleaner so I'm very pleased to see it. A couple of comments. One of the problems we enountered with dynticks was that using the higher resolution timers such as TSC and HPET to adjust for timer ticks over longer periods when skipping ticks made the overall clock drift when run for many days and only the PM Timer was not prone to this happening. ie the timers were very accurate for short periods but over days it would drift. It could well have been a design flaw in the dynticks I was maintaining rather than the timers themselves but have you checked that this isn't a problem? The other thing I note is that there is a reasonable amount of indirection in fairly hot paths. It looks like there is scope for more local variable storage of these indirect calls. Also if set_next_event is separated from struct clock_event, the whole struct looks like a suitable candidate for __read_only. -- -ck
From: Ingo Molnar [13] [email blocked] Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Mon, 19 Jun 2006 14:26:07 +0200 * Con Kolivas [email blocked] wrote: > Nice work Thomas and Ingo. > > The approach to previous dynticks that I was working on had some nasty > issues with scalability that were not addressable without a complete > rewrite which is why I abandoned the previous implementation. Your > approach for using the hires timer events is ultimately a better > solution and the code base is cleaner so I'm very pleased to see it. thanks! > A couple of comments. > > One of the problems we enountered with dynticks was that using the > higher resolution timers such as TSC and HPET to adjust for timer > ticks over longer periods when skipping ticks made the overall clock > drift when run for many days and only the PM Timer was not prone to > this happening. ie the timers were very accurate for short periods but > over days it would drift. It could well have been a design flaw in the > dynticks I was maintaining rather than the timers themselves but have > you checked that this isn't a problem? not yet. If it's a real problem we could introduce a 'make clock events more reliable' framework by doing something like always programming clock event sources into periodic mode and reading their current time offset [if possible] when the event is processesed (thus compensating for most of the drift caused by irq processing latency). But if it's not needed it would be nice to avoid that complexity. I'm also wondering why the PM timer was the most accurate in that regard - it's almost as slow to program as the PIT, so i'd have expected it to to show the biggest drift. (another technique to reduce drift: we could increase the APIC-priority of the lapic timer, making it less suspect to drift when there are lots of other IRQs going on.) can you think of any other similar 'weird cases' that you saw happen with dynticks? For example there's the 'APIC stops timer irqs when entering C3 mode' bug - any similar weirdness we should be careful about? [right now the patch doesnt handle the C3 mode bug, but it should be relatively straightforward to blacklist lapic events in that case] i'm looking at dynticks-060227.patch right now, and there seem to be a fair amount of dyntick specific changes to ACPI's processor_idle.c code. Do you remember what those changes were about and should we pick them up in one way or another? > The other thing I note is that there is a reasonable amount of > indirection in fairly hot paths. It looks like there is scope for more > local variable storage of these indirect calls. [...] which function(s) were you looking at when coming to this conclusion? clockevents_init_next_event() perhaps? [we could certainly put 'sources->nextevent' into a local variable there] > [...] Also if set_next_event is separated from struct clock_event, the > whole struct looks like a suitable candidate for __read_mostly. You mean ->event_handler()? We can make all clockevent instantiations __read_mostly right now - all of the fields of clock_event are static, even ->event_handler() will change at most once per bootup [when we switch from low-res into high-res mode]. Ingo
From: Con Kolivas [14] [email blocked] Subject: Re: [PATCHSET] Announce: High-res timers, tickless/dyntick and dynamic HZ Date: Tue, 20 Jun 2006 00:03:25 +1000 On Monday 19 June 2006 22:26, Ingo Molnar wrote: > * Con Kolivas [email blocked] wrote: > > One of the problems we enountered with dynticks was that using the > > higher resolution timers such as TSC and HPET to adjust for timer > > ticks over longer periods when skipping ticks made the overall clock > > drift when run for many days and only the PM Timer was not prone to > > this happening. ie the timers were very accurate for short periods but > > over days it would drift. It could well have been a design flaw in the > > dynticks I was maintaining rather than the timers themselves but have > > you checked that this isn't a problem? > > not yet. If it's a real problem we could introduce a 'make clock events > more reliable' framework by doing something like always programming > clock event sources into periodic mode and reading their current time > offset [if possible] when the event is processesed (thus compensating > for most of the drift caused by irq processing latency). But if it's not > needed it would be nice to avoid that complexity. I'm also wondering why > the PM timer was the most accurate in that regard - it's almost as slow > to program as the PIT, so i'd have expected it to to show the biggest > drift. > > (another technique to reduce drift: we could increase the APIC-priority > of the lapic timer, making it less suspect to drift when there are lots > of other IRQs going on.) Better to wait and see if it was an artefact of my dodgy code for recover walltime and if this code doesn't have that issue. > can you think of any other similar 'weird cases' that you saw happen > with dynticks? For example there's the 'APIC stops timer irqs when > entering C3 mode' bug - any similar weirdness we should be careful > about? [right now the patch doesnt handle the C3 mode bug, but it should > be relatively straightforward to blacklist lapic events in that case] The hardware that also did C4 was more troublesome but for the same reasons since it's a subset of C3. See Dominik's patches mentioned below which address these high state transitions. There isn't anything else offhand I can think of that I actually managed to track down :| > i'm looking at dynticks-060227.patch right now, and there seem to be a > fair amount of dyntick specific changes to ACPI's processor_idle.c code. > Do you remember what those changes were about and should we pick them up > in one way or another? Dominik donated a lot of code to use the dynticks infrastructure to actually implement the power savings. Just skipping ticks seemed to make very little power difference unless we also used the knowledge from next timer interrupt to know how long we are going to be idle and choose C state transitions accordingly. Each patch is documented at length in the split out C-States-1_bm_activity_improvements.patch C-States-2_bm_activity_handling_improvement.patch C-States-3_accounting_of_sleep_times.patch C-States-4_dyn-ticks_tweaks.patch http://ck.kolivas.org/patches/dyn-ticks/split-out/ [15] > > The other thing I note is that there is a reasonable amount of > > indirection in fairly hot paths. It looks like there is scope for more > > local variable storage of these indirect calls. [...] > > which function(s) were you looking at when coming to this conclusion? > clockevents_init_next_event() perhaps? [we could certainly put > 'sources->nextevent' into a local variable there] From what I could see hrtimer_restart_sched_tick() could use struct hrtimer *sched_timer = &cpu_base->sched_timer; clockevents_init_next_event() and clockevents_set_next_event() could use struct clock_event *nextevt = sources->nextevt; > > [...] Also if set_next_event is separated from struct clock_event, the > > whole struct looks like a suitable candidate for __read_mostly. > > You mean ->event_handler()? We can make all clockevent instantiations > __read_mostly right now - all of the fields of clock_event are static, > even ->event_handler() will change at most once per bootup [when we > switch from low-res into high-res mode]. Great, thanks! -- -ck



Related Links:


Source URL:
http://kerneltrap.org/node/6750