Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem

Previous thread: Re: [PATCH 0/5] ftrace: to kill a daemon by Bodo Eggert on Saturday, August 9, 2008 - 4:50 am. (4 messages)

Next thread: Re: [PATCH 1/1] [x86] Configuration options to compile out x86 CPU support code by Bodo Eggert on Saturday, August 9, 2008 - 7:08 am. (1 message)
From: David Witbrodt
Date: Saturday, August 9, 2008 - 5:39 am

I'm _way_ over my head in this discussion, but here's some more food
for thought.  Last weekend, when I first tried 2.6.26 and discovered the
freeze, I thought an error of my own in .config was causing it.  Before
I ever sought help, I made about a dozen experiments with different
.config files.

One series of those experiments involved turning off most of the kernel...
including CONFIG_INET.  The kernel still froze, but when entering 
pci_init().  (This info can be read in my original post to the Debian BTS,
which I have provided links for a couple of times in this LKML thread.  I
even went further and removed enough that the freeze was avoided, but so
much of the kernel was missing that my init scripts couldn't mount a hard
disk any more.  Trying to restore enough to allow HD mounting just brought
back the freeze.)

I am completely ignorant about how the kernel works, so any guesses I have
are probably worthless... but I'll throw some out anyway:

1.  Maybe HPET is used (if present) for timing by RCU, so disabling it
forces RCU to work differently.  (Pure guess here:  I know nothing about
RCU, and haven't even tried looking at its code.)

2.  Maybe my hardware is broken.  We need see one initcall return that
report over 280,000 msecs... when the entire boot->freeze time was about
3 secs.  On the other hand, 2.6.25 (and before) work just fine with HPET
enabled.

3. I was able to find the commit that introduced the freeze
(3def3d6ddf43dbe20c00c3cbc38dfacc8586998f), so there has to be a connection
between that commit and the RCU problem.  Is it possible that a prexisting
error or oversight in the code was merely exposed by that commit?  (And 
only on certain hardware?)  Or does that code itself contain the error?

4. Another bug has been posted on the Debian BTS, which is worked around
by disabling HPET.  The user provided some links to bugzilla.kernel.org
where David Brownell is fighting with some HPET/RTC issues (but no mention
of ...
From: Paul E. McKenney
Date: Saturday, August 9, 2008 - 6:56 am

RCU doesn't use HPET directly.  Most of its time-dependent behavior

For CONFIG_CLASSIC_RCU and !CONFIG_PREEMPT, in-kernel infinite spin loops
will cause synchronize_rcu() to hang.  For other RCU configurations,
spinning with interrupts disabled will result in similar hangs.  Invoking
synchronize_rcu() very early in boot (before rcu_init() has been called)
will of course also hang.

Could you please let me know whether your config has CONFIG_CLASSIC_RCU

Thank you for finding the commit -- should be quite helpful!!!

A quick look reveals what appears to be reader-writer locking rather
than RCU.  It does run in early boot before rcu_init(), so if it managed
to call synchronize_rcu() somehow you indeed would see a hang.  I do
not see such a call, but then again, I don't know this code much at all.

This is the second time in as many days that motivated RCU's working


If you can answer my CONFIG_CLASSIC_RCU vs. CONFIG_PREEMPT_RCU question
above, I should be able to provide you a diagnostic patch that would say
which CPU RCU was waiting on.  At least assuming that at least one CPU
was still taking the scheduling-clock interrupt, that is.  ;-)

							Thanx, Paul
--

From: Ingo Molnar
Date: Monday, August 11, 2008 - 4:25 am

such freezes frequently occur due to the plain lack of timer interrupts.

As networking's rcu_synchronize() is one of the first calls in the 
kernel that relies on a timer IRQ hitting the CPU, it would be the first 
one that "freezes". It's not a real freeze though: it's the lack of 
timer events breaking RCU completion. (RCU has an implicit and somewhat 
subtle dependency on timer irqs periodically hitting the CPU)

You can probably verify this by adding something like this to 
kernel/timer.c's do_timer() function:

   if (printk_ratelimit())
	printk("timer irq hit, jiffies: %ld\n", jiffies);

Yinghai, do you have any ideas about this particular problem? One theory 
would be that your e820 changes might have caused a shuffling of 
resources that made the hpet's timer IRQ generation inoperable.

David, it would be nice to check whether tip/master still locks up for 
you:

    http://people.redhat.com/mingo/tip.git/README

just to make sure no pending fix resolves your issue. (the bug is 
probably still present, but might be worth checking nevertheless.)

	Ingo
--

From: Yinghai Lu
Date: Monday, August 11, 2008 - 9:15 am

the hpet request_resource() calling fail?

YH
--

Previous thread: Re: [PATCH 0/5] ftrace: to kill a daemon by Bodo Eggert on Saturday, August 9, 2008 - 4:50 am. (4 messages)

Next thread: Re: [PATCH 1/1] [x86] Configuration options to compile out x86 CPU support code by Bodo Eggert on Saturday, August 9, 2008 - 7:08 am. (1 message)