Re: kernel + gcc 4.1 = several problems

Previous thread: Re: [PATCH 2.6.20-rc2] fs/jffs2/scan.c: Fix error-path leak by Andrew Morton on Tuesday, January 2, 2007 - 2:07 pm. (1 message)

Next thread: Re: Oops in 2.6.19.1 by Adrian Bunk on Tuesday, January 2, 2007 - 2:12 pm. (1 message)
From: Adrian Bunk
Date: Tuesday, January 2, 2007 - 2:10 pm

Sorry if my point goes a bit away from your problem:

My point is that we have several reported problems only visible
with gcc 4.1.

Other bug reports are e.g. [2] and [3], but they are only present with
using gcc 4.1 _and_ using -Os.

There's simply a bunch of bugs only present with gcc 4.1, and what 
worries me most is that the estimated number of unknown cases is most 
likely very high since most people won't check different compiler 

cu
Adrian

[1] http://bugzilla.kernel.org/show_bug.cgi?id=7176
[2] http://bugzilla.kernel.org/show_bug.cgi?id=7106
[3] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186852

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Alistair John Strachan
Date: Tuesday, January 2, 2007 - 2:56 pm

On Tuesday 02 January 2007 21:10, Adrian Bunk wrote:

I find [2] most compelling, and I can confirm that I do have the same problem 
with or without optimisation for size. I don't use selinux nor has it ever 
been enabled.

At any rate, I have absolute confirmation that it is GCC 4.1.1, because with 
GCC 3.4.6 the same kernel I reported booting three days ago is still 
cheerfully working. I regularly get uptimes of 60+ days on that machine, 
rebooting only for kernel upgrades. 2.6.19 seems to be no worse in this 
regard.

Perhaps fortunately, the configs I've tried have consistently failed to shake 
the crash, so I have a semi-reproducible test case here on C3-2 hardware if 
somebody wants to investigate the problem (though it still takes 6-12 hours).

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-

From: D. Hazelton
Date: Tuesday, January 2, 2007 - 3:06 pm

The GCC code generator appears to have been rewritten between 3.4.6 and 
4.1.1....

I took a look at the dump he posted and there are some minor and some massive 
differences between the code. In one case some of the code is swapped, in 
another there is code in the 3.4.6 version that isn't in the 4.1.1... Finally 
the 4.1.1 version of the function has what appears to be function calls and 
these don't appear in the code generated by 3.4.6

In other words - the code generation for 4.1.1 appears to be broken when it 
comes to generating system code.

DRH
-

From: Adrian Bunk
Date: Tuesday, January 2, 2007 - 4:24 pm

Differences are expected since we disable unit-at-a-time for gcc < 4 

Bug number for an either already open or created by you bug in the gcc 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: D. Hazelton
Date: Tuesday, January 2, 2007 - 4:41 pm

Okay. Thing is that these noted differences, aside from where 4.1.1 doesn't 
generate an opcode that 3.4.6 does aren't all that fatal, IMHO. The fact that 
there it does generate call's rather than jumps for local pointer moves 
(IIRC - been a while since I looked at the dump of pipe_poll that he 

None. I didn't file a report on this because I didn't find the big, just noted 
a problem that appears to occur. In this case the call's generated seem to 
wrap loops - something I've never heard of anyone doing. These *might* be 
causing the off-by-one that is causing the function to re-enter in the middle 
of an instruction.

Seeing this I'd guess that this follows for all system-level code generated by 
4.1.1 and this is exactly what I was reporting. If you'd like I'll go dig up 
the dumps he posted and post the two related segments side-by-side to give 
you a better example what I'm referring to.

DRH
-

From: Horst H. von Brand
Date: Tuesday, January 2, 2007 - 7:05 pm

D. Hazelton <dhazelton@enter.net> wrote:




Define "system-level code". What makes it different from, say,
bog-of-the-mill compiler code (yes, gcc compiles itself as part of its

If the related segments show code that is somehow wrong, by all means
report it /with your detailed analysis/ to the compiler people. Just a
warning, gcc is pretty smart in what it does, its code is often surprising
to the unwashed. Also, the C standard is subtle, the error might be in a
unwarranted assumption in the source code.
-

From: Linus Torvalds
Date: Tuesday, January 2, 2007 - 3:13 pm

Historically, some people have actually used horrible hacks like trying to 
figure out which particular C file gets miscompiled by basically having 
both compilers installed, and then trying out different subdirectories 
with different compilers. And once the subdirectory has been pinpointed, 
pinpointing which particular file it is.. etc.

Pretty damn horrible to do, and I'm afraid we don't have any real helpful 
scripts to do any of the work for you. So it's all effectively manual 
(basically boils down to: "compile everything with known-good compiler. 
Then replace the good compiler with the bad one, remove the object files 
from one directory, and recompile the kernel". "Rinse and repeat".

I don't think anybody has ever done that with something where triggering 
the cause then also takes that long - that just ends up making the whole 
thing even more painful. 

What are the exact crash details? That might narrow things down enough 
that maybe you could try just one or two files that are "suspect".

		Linus
-

From: Alistair John Strachan
Date: Tuesday, January 2, 2007 - 4:18 pm

Linus,

On Tuesday 02 January 2007 22:13, Linus Torvalds wrote:

I'll do a digest of the problem for you and anybody else that's lost track of 
the debugging story so far..

There are no hardware problems evidenced by any testing I have performed 
(memtest, prime95 CPU torture tests, temp monitors). Furthermore, kernels 
compiled with older GCCs have been running without problems for literally 
years on this machine.

Here is an example of an oops. The kernel continued to limp along after this.

BUG: unable to handle kernel NULL pointer dereference at virtual address 
00000009
 printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat 
xt_state iptable_filter ip_tables x_tables prism54 yenta_socket 
rsrc_nonstatic pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm 
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore ehci_hcd 
usblp eth1394 uhci_hcd usbcore ohci1394 ieee1394 via_agp agpgart vt1211 
hwmon_vid hwmon ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack
CPU:    0
EIP:    0060:[<c0156f60>]    Not tainted VLI
EFLAGS: 00010246   (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008   ebx: 00000000   ecx: 00000008   edx: 00000000
esi: f70f3e9c   edi: f7017c00   ebp: f70f3c1c   esp: f70f3c0c
ds: 007b   es: 007b   ss: 0068
Process python (pid: 4178, ti=f70f2000 task=f70c4a90 task.ti=f70f2000)
Stack: 00000000 00000000 f70f3e9c f6e111c0 f70f3fa4 c015d7f3 f70f3c54 f70f3fac
       084c44a0 00000030 084c44d0 00000000 f70f3e94 f70f3e94 00000006 f70f3ecc
       00000000 f70f3e94 c015e580 00000000 00000000 00000006 f6e111c0 00000000
Call Trace:
 [<c015d7f3>] do_sys_poll+0x253/0x480
 [<c015da53>] sys_poll+0x33/0x50
 [<c0102c97>] syscall_call+0x7/0xb
 [<b7f6b402>] 0xb7f6b402
 =======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8 
8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f 
45 ca eb b6 8d b6 ...
From: Linus Torvalds
Date: Tuesday, January 2, 2007 - 6:43 pm

It's not an off-by-one either (eg say we're taking an exception and 
screiwing up %eip by one somehow).

The code sequence in question is

	mov    %ecx,%edx
	mov    0x6c(%esi),%eax
	or     $0x10,%edx
	cmp    0x168(%edi),%eax		<--
	cmovne %edx,%ecx
	jmp    ...

and it's in the second byte of the "cmp".

And yes, it definitely entered there, because trying other random 
entry-points will have either invalid instructions or instructions that 
would fault due to NULL pointers. HOWEVER, it's also not as simple as 
"took an interrupt, and returned with %eip incremented by one", becasue 
your %edx is zero, so it won't have done that "or $10,%edx" and then some 
interrupt happened and screwed up just %eip.

So it's literally a random %eip, but since you say it's consistently in 
that function, it's not truly "random". There's something that triggers it 
just _there_.

However, that's a damn simple function. There's _nothing_ there. The 
particular code that is involved right there is literally

	if (!pipe->writers && filp->f_version != pipe->w_counter)
		mask |= POLLHUP;

and that's it.  There's not even anything half-way interesting around it, 
except for the "poll_wait()" call, but even that is about as common as
you can humanly get..

Looking at the register set and the stack, I see:

	Stack:	00000000
		00000000  <- saved %ebx (dunno, seems dead in caller)
		f70f3e9c  <- saved %esi (== pollfd in do_pollfd)
		f6e111c0  <- saved %edi	(== filp)
		f70f3fa4  <- outer EBP (looks reasonable) 
		c015d7f3  <- return address (do_sys_poll+0x253/0x480)

and the strange thing is that when the oops happens, it really looks like 
%esi _still_ contains the value it had originally (and that is saved on 
the stack). But afaik, from your disassembly, it should have been 
overwritten by the initial %eax, which should have had the same value as 
%edi on entry...

IOW, none of it really makes any sense. The stack frames look fine, so we 
_did_ enter at the beginning of the ...
From: Linus Torvalds
Date: Tuesday, January 2, 2007 - 3:01 pm

Traditionally, afaik, -Os has tended to show compiler problems that 
_could_ happen with -O2 too, but never do in practice. It may be that 
gcc-4.1 without -Os miscompiles some very unusual code, and then with -Os 
we just hit more cases of that.

That said, I th ink gcc-4.1.1 is very common - I know it's the Fedora 
compiler. Also, CC_OPTIMIZE_FOR_SIZE defaults to 'y' if you have 
EXPERIMENTAL on, and from all the bug-reports about other features that 
are marked EXPERIMENTAL, I know that a lot of people do seem to select for 
it. So I would expect that gcc-4.1.1 and -Os is actually a fairly common 
combination. I just checked, and it's what I use personally, for example.

Of course, my main machine is an x86-64, and it has more registers. At 
least some historical -Os bug was about bad things happening under 
register pressure, iirc, and so x86-64 would show fewer problems than 
regular 32-bit x86 (which has far fewer registers for the compiler to 
use).

It is a bit worrisome. These things seem to be about 50:50 real kernel 
bugs (just hidden by some common code generation sequence) and real 
honest-to-goodness compiler bugs. But they are hard as hell to find.

		Linus
-

From: David Rientjes
Date: Tuesday, January 2, 2007 - 4:09 pm

gcc optimizations were almost completely rewritten between 3.4.6 and 4.1, 
and one of the subtle changes that may have been introduced is with regard 
to the heuristics used to determine whether to inline an 'inline' function 
or not when using -Os.  This problem can show up in dynamic linking and 
break on certain architectures but should be detectable by using -Winline.

		David
-

Previous thread: Re: [PATCH 2.6.20-rc2] fs/jffs2/scan.c: Fix error-path leak by Andrew Morton on Tuesday, January 2, 2007 - 2:07 pm. (1 message)

Next thread: Re: Oops in 2.6.19.1 by Adrian Bunk on Tuesday, January 2, 2007 - 2:12 pm. (1 message)