Re: 2.6.21.5 june 30th to july 1st date hang?

Previous thread: [patch 5/5] s390 tlb flush fix. by Martin Schwidefsky on Tuesday, July 3, 2007 - 4:18 am. (3 messages)

Next thread: Re: shouldn't LVM support be selected by default? by Bodo Eggert on Tuesday, July 3, 2007 - 5:54 am. (3 messages)
From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 5:44 am

Hi all,

All my servers and workstations running a 2.6.21.5 kernel hanged exactly
when the date shift from june 30th to july 1st.

On my monitoring system every single station running a 2.6.21.5 kernel
stoped responding exactly after midnight on the date shift from June
30th to July 1st.  Although, stations still running 2.6.18 to 2.6.20.11
worked flawlessly.

I first tought there had been an electricity outage but two of my
servers (dell PE 2950 dual-quad core) on UPS in our server room also
hanged:
Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
/usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
"$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
Jul  3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.

I could not get anything on any of the 20+ consoles...  All the systems
hanged at around the exact same time... When the date shifted from June
30th to July 1st in UTC ...?

Any clue any one?

- vin
-

From: Florian Attenberger
Date: Tuesday, July 3, 2007 - 6:59 am

there was one 'special' event at that date:
syslog.2.gz:Jul  1 01:59:59 master kernel: Clock: inserting leap second
23:59:60 UTC

-

From: Arne Georg Gleditsch
Date: Tuesday, July 3, 2007 - 7:20 am

As far as I can tell, no leap second was due to be inserted at 1. of
July this year.  Is the year set correctly for this box?

-- 
								Arne.
-

From: Florian Attenberger
Date: Tuesday, July 3, 2007 - 8:02 am

yep, controlled by ntpd.
You're right according to
ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
that event shouldn't have been there.

-

From: Arne Georg Gleditsch
Date: Tuesday, July 3, 2007 - 8:26 am

I'm not all that versed in ntp-ish, but it appears that the leap
second insertion should be propagated through the ntp protocol.
Whether the leap second in question came from a ntp server giving out
wrong data or from a misinterpretation or bug in ntpd is of course
hard to say, but either way turning the clock back is unlikely to
reconstruct the circumstances.  An interesting exercise might be to
code up a small program to call adjtimex with timex.status |= STA_INS,
to see if this can trigger the problem.  (The bogus leap second might
be a red herring entirely, of course...)

-- 
								Arne.
-

From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 8:36 am

You are probably right, I did tried to reproduce the problem without
success...

Although it is wierd that it happend only on 2.6.21 kernels... It did
not happend on any of my workstations/servers running either 2.6.18 or
2.6.20.  

Could dynticks be involved?

- vin
-

From: Dave Jones
Date: Tuesday, July 3, 2007 - 10:19 am

On Tue, Jul 03, 2007 at 11:36:42AM -0400, Fortier,Vincent [Montreal] wrote:
 > > -----Message d'origine-----
 > > De : linux-kernel-owner@vger.kernel.org 
 > > [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
 > > Arne Georg Gleditsch
 > > 
 > > Florian Attenberger <valdyn@gmail.com> writes:
 > > > yep, controlled by ntpd.
 > > > You're right according to
 > > > ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
 > > > that event shouldn't have been there.
 > > 
 > > I'm not all that versed in ntp-ish, but it appears that the 
 > > leap second insertion should be propagated through the ntp protocol.
 > > Whether the leap second in question came from a ntp server 
 > > giving out wrong data or from a misinterpretation or bug in 
 > > ntpd is of course hard to say, but either way turning the 
 > > clock back is unlikely to reconstruct the circumstances.  An 
 > > interesting exercise might be to code up a small program to 
 > > call adjtimex with timex.status |= STA_INS, to see if this 
 > > can trigger the problem.  (The bogus leap second might be a 
 > > red herring entirely, of course...)
 > 
 > You are probably right, I did tried to reproduce the problem without
 > success...
 > 
 > Although it is wierd that it happend only on 2.6.21 kernels... It did
 > not happend on any of my workstations/servers running either 2.6.18 or
 > 2.6.20.  
 > 
 > Could dynticks be involved?

I saw it on a box that happened to have lockdep enabled.
(I run it everywhere thankfully).  This is what it looked like..
http://www.codemonkey.org.uk/junk/img_0421.jpg

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Chris Friesen
Date: Tuesday, July 3, 2007 - 12:28 pm

Setting the date to just before midnight June 30 UTC and then running the 
following as root triggered the crash on a modified 2.6.10.  Anyone see anything 
wrong with the code below, or is this a valid indication of a bug in the leap 
second code?

Chris


#include <sys/timex.h>
#include <stdio.h>
#include <errno.h>

struct timex buf;
int main(void)
{
	int rc = adjtimex(&buf);
	printf("initial status: 0x%x\n", buf.status);
	buf.status |= STA_INS;
	buf.modes = ADJ_STATUS;
	rc = adjtimex(&buf);
	if (rc == -1) {
		printf("unable to set status: %m\n");
		return -1;
	} else
		printf("rc: %d\n", rc);
	printf("final status: 0x%x\n", buf.status);
	return 0;
}
-

From: Chris Friesen
Date: Tuesday, July 3, 2007 - 2:02 pm

As a further data point, the test app triggers problems on x86 uniprocessor and 
SMP as well as arm uniprocessor.  On ppc64 we see the leap second being added, 
but it doesn't hang, while on ppc we don't even see the leap second being 
added--leading me to wonder if the leap second code even works for ppc32.

The above is all for modified 2.6.10.

Chris
-

From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 6:06 pm

Thanx a lot!  This was fast! (beat that closed source!)

- vin

-

From: Uli Luckas
Date: Wednesday, July 4, 2007 - 1:56 am

Hi Chris,
does that qualify for inclusion into 2.6.21.6?

regards,
Uli

-- 

------- ROAD ...the handyPC Company - - -  ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit
-

From: Chris Wright
Date: Wednesday, July 4, 2007 - 9:53 am

Yes, it has already been sent to -stable.

thanks,
-chris
-

From: Clemens Koller
Date: Thursday, July 5, 2007 - 7:13 am

Hello, Chris!


Okay, we all survived Y2K and this little glitch. Puh! ;-)
Can you please explain in which configuration this problem got triggered.

Does it make sense to have some testing environments which have the date
set to about one month in the future to catch any crashes like that,
preventing machines in production from failing?!

Best regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
-

From: Chris Friesen
Date: Thursday, July 5, 2007 - 10:48 am

As far as I can tell many kernel versions contained the source code bug. 
  (I'd like some more information on exactly what the problem was if 
anyone cares to share..the proposed patch didn't give much in the way of 
specifics.)

However, in order to trigger the problem you also need to have NTP 
servers that were erroneously broadcasting the addition of a leap second.

So most people didn't see the issue because there wasn't supposed to be 
a leap second added this year...but they would have seen it the next 
time a leap second was added.

Chris
-

From: Clemens Koller
Date: Thursday, July 5, 2007 - 11:34 am

True. It seems like we will have another one next year.

Regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
-

From: Thomas Gleixner
Date: Thursday, July 5, 2007 - 1:10 pm

It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.

	tglx


-

From: Chris Friesen
Date: Thursday, July 5, 2007 - 2:02 pm

Are you certain?

Vanilla 2.6.10 shows a clock_was_set() function.  Does it just not call 
the dangerous code or something?

Also, our modified 2.6.10 has the high res timers patch applied, but the 
config option is turned off and we were still affected.

Chris
-

From: Thomas Gleixner
Date: Thursday, July 5, 2007 - 2:17 pm

Ouch, the old posix timer code might be affected as well, but I did not

You mean Anzingers high res patches. No idea about those.

	tglx


-

From: Ernie Petrides
Date: Thursday, July 5, 2007 - 3:28 pm

Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
vulnerable.

Cheers.  -ernie
-

From: Chris Friesen
Date: Thursday, July 5, 2007 - 3:49 pm

As I mentioned in my post to Thomas, we have high res timers disabled 
and were still affected.  Granted, our kernel has been modified so it is 
possible that vanilla would not be affected....I haven't tested it.

Chris
-

From: Ernie Petrides
Date: Thursday, July 5, 2007 - 4:12 pm

That's odd, because Thomas's patch removed two calls to clock_was_set(),
which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
the 2.6.21 source tree).

Also, I personally tested with the reproducer you posted here, initially
on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
what config options were enabled on that kernel).  I did reproduce the
problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.

Cheers.  -ernie
-

From: Chris Friesen
Date: Thursday, July 5, 2007 - 4:45 pm

I'm using a modified 2.6.10 tree...I expect the timer code is different.

Chris
-

From: Thomas Gleixner
Date: Thursday, July 5, 2007 - 10:16 pm

Way different and you have extra patches on top.

	tglx


-

From: Thomas Gleixner
Date: Thursday, July 5, 2007 - 10:17 pm

It needs a running smp_call_function() to be interrupted by the timer
interrupt, which calls clock_was_set(). So it's not that easy to
reproduce.

	tglx


-

From: Chris Friesen
Date: Friday, July 6, 2007 - 8:47 am

On our 2.6.10-based kernel its basically trivial to reproduce, and the 
posted fix doesn't solve the issue.

One of our guys is trying to track it down. As yet we don't know if it's 
the vanilla code or the patches on top that contain the bug.

Chris
-

From: Ernie Petrides
Date: Friday, July 6, 2007 - 1:03 pm

I think it's reproducible at will when CONFIG_BUG is enabled, because the
WARN_ON() on line 546 of arch/i386/kernel/smp.c fires in smp_call_function(),
causing lots of console output.  By the time on_each_cpu() later reenables
interrupts, another clock interrupt is pending, and (I think) causes a
self-deadlock on the xtime_lock in vmi_account_real_cycles().

That's my (unproven) theory, anyway.  :)

At any rate, I reproduced it twice in two tries on stock 2.6.21.

Cheers.  -ernie
-

From: Uli Luckas
Date: Tuesday, July 3, 2007 - 6:56 am

Same thing here on two machines with plain vanilla 2.6.21.(3/4), on debian 
testing & debian unstable.

regards,
Uli

-- 

------- ROAD ...the handyPC Company - - -  ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit
-

From: Clemens Koller
Date: Tuesday, July 3, 2007 - 6:05 am

Hi, Vincent!


No problems over here with plain 2.6.21.5 on x686

You could just reset the date back on one of these machines and
check the transition again... and see if it was really the kernel
who crashed.... and check your cron configuration.

Regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
-

From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 7:51 am

All my server/workstations are in sync using ntp... And yes, the year is set properly on all of them.

Regards,

- vin
-

From: Chris Friesen
Date: Tuesday, July 3, 2007 - 8:59 am

Interesting, I just sent out an email for a similar issue, but with a pair of 
2.6.10 machines.

I'm wondering if its related to a spurious leap second event...

Chris
-

From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 9:00 am

Just wondering, what is your distribution?

- vin
-

From: Chris Friesen
Date: Tuesday, July 3, 2007 - 9:03 am

We're based off a WindRiver PNE-LE distribution.

Chris
-

From: Chris Friesen
Date: Tuesday, July 3, 2007 - 10:28 am

Some more information....

I'm trying to get a console on the affected system to query the leap second info 
from the ntp servers.

However, just for kicks I queried the local servers for my desktop following the 
instructions that I found on a thread about spurious leap second notifications. 
  Interestingly, two of the associations show non-zero leap values...

Chris



[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -cas zcars0vr

ind assID status  conf reach auth condition  last_event cnt
===========================================================
   1 62124  b614   yes   yes  none  sys.peer   reachable  1
   2 62125  b4f4   yes   yes  none  candidat   reachable 15
   3 62126  b314   yes   yes  none   outlyer   reachable  1
   4 62127  b314   yes   yes  none   outlyer   reachable  1
   5 62128  8000   yes   yes  none    reject
   6 62129  b434   yes   yes  none  candidat   reachable  3
   7 62130  b424   yes   yes  none  candidat   reachable  2
   8 62131  a0f3   yes   yes  none    reject  lost reach 15


[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62124 leap" zcars0vr
assID=62124 status=b614 reach, conf, sel_sys.peer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62125 leap" zcars0vr
assID=62125 status=b0f4 reach, conf, 15 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62126 leap" zcars0vr
assID=62126 status=b314 reach, conf, sel_outlyer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62127 leap" zcars0vr
assID=62127 status=b414 reach, conf, sel_candidat, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62128 leap" zcars0vr
assID=62128 status=8000 unreach, conf, no events,
leap=11
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62129 leap" zcars0vr
assID=62129 status=b434 reach, conf, sel_candidat, 3 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62130 leap" zcars0vr
assID=62130 status=b424 reach, conf, sel_candidat, 2 events, ...
From: Fortier,Vincent [Montreal]
Date: Tuesday, July 3, 2007 - 5:55 am

Forgot to mention:

- All stations that failed where running a 2.6.21 kernel + CFS v18 (I don't have any stations running a plain 2.6.21 kernel so can't tell)
- Config file can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/CONFIG-i686-2.6.21-005
-

Previous thread: [patch 5/5] s390 tlb flush fix. by Martin Schwidefsky on Tuesday, July 3, 2007 - 4:18 am. (3 messages)

Next thread: Re: shouldn't LVM support be selected by default? by Bodo Eggert on Tuesday, July 3, 2007 - 5:54 am. (3 messages)