Hello,
We have been seeing a condition where wall time gets stuck in a 4-second
loop:
lab-m500-5:~ # date
Thu Mar 13 17:54:48 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:49 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:49 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:45 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:46 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:46 EDT 2008
lab-m500-5:~ # date
Thu Mar 13 17:54:47 EDT 2008
Of course when time stops progessing forward at a steady pace all
sorts of Bad Things happen.
The clock loop is caused by jiffies not incrementing. It seems that wall
time updates itself and then resync's with jiffies every ~4s.
We verified that all spinlocks ( xtime_lock in particular ) were not
being held, and that the kernel had no reason not to update jiffies.
Other interrupts continue to fire, so we're not stuck in the timer
handler.
So, we went on to look at the PIC. We threw together a little kernel
module to get some debug info.
What we found is that when the time loop occurs, the PIC registers look
like this:
... PIC IMR: ff00
... PIC IRR: 0001
... PIC ISR: 0001
... PIC ELCR: 0e00
When running correctly these registers look like:
... PIC IMR: fffa
... PIC IRR: 0000
... PIC ISR: 0000
... PIC ELCR: 0ea0
The Interrupt Mask Register masks off the corresponding interrupts,
preventing the CPU from getting these. In the bad state, all interrupts
have been enabled. This is wrong, but does not seem to contribute to the
problem directly. I do not believe that INT2, INT4-INT7 are connected to
anything in our system. But the register is getting stomped, and we
have no idea how.
The Interrupt Request Register shows ...On Fri, 14 Mar 2008 11:53:29 -0400 Hi, this is roughly a 3 year old kernel; and the thing is, this code has been revamped entirely several times since then. Do you see this behavior with, say the 2.6.24 kernel? That's a twofold question, since it has with-tickless and without-tickless as option. (tickless doesn't tend to use the PIT at all so would just avoid the entire issue; without-tickless still uses the PIT) It's really better to use your support contract with Novell to get them to fix it for SLE(S/D)10 if you are not in a position to test or diagnose with new kernels; they're there to support you for their old kernel. In general, the folks on this list are working on current kernels and don't tend to spend time working on such old kernels. The probability of this bug being fixed since then, or the fix being totally not applicable to current kernels, is quite high (given the several rewrites of the timer code) so it'd be a waste of the time of the people on this list. Now, if you can see the behavior on 2.6.24 (or better, the latest 2.6.25-rc), then suddenly you'll find a lot more people who're going to be interested.... -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Hi Arjan, Thanks for your response. Sorry for the long post, I think the issue required a lot of detail. I'm sure that it was easy to overlook that this is still an issue in a 6 It looks like from http://www.kernel.org/pub/linux/kernel/v2.6/ that .16 is the longest-running subversion and is still being maintained. Is there a better list for 2.6.16 issues? I am testing 2.6.24.3 now. I am not going tickless. I am not using PIT with any of the kernels I am testing. Thanks, Joel. --
On Fri, 14 Mar 2008 13:21:22 -0400 I did see that, doesn't mean it's not 3 years old... -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Does nosmp help? Does noapic/nolapic help? You may want to try testing with HZ=4000... to show the problem faster. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
We have not tried. I would expect that it would, but I think that's changing the system too much, and we cannot do that in production. As we have seen system run for a couple of months without demonstrating the issue I think that proving the negative is very difficult. I am focusing These do not affect the problem. Thanks! Joel. --
