Sorry if my point goes a bit away from your problem: My point is that we have several reported problems only visible with gcc 4.1. Other bug reports are e.g. [2] and [3], but they are only present with using gcc 4.1 _and_ using -Os. There's simply a bunch of bugs only present with gcc 4.1, and what worries me most is that the estimated number of unknown cases is most likely very high since most people won't check different compiler cu Adrian [1] http://bugzilla.kernel.org/show_bug.cgi?id=7176 [2] http://bugzilla.kernel.org/show_bug.cgi?id=7106 [3] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186852 -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed -
On Tuesday 02 January 2007 21:10, Adrian Bunk wrote: I find [2] most compelling, and I can confirm that I do have the same problem with or without optimisation for size. I don't use selinux nor has it ever been enabled. At any rate, I have absolute confirmation that it is GCC 4.1.1, because with GCC 3.4.6 the same kernel I reported booting three days ago is still cheerfully working. I regularly get uptimes of 60+ days on that machine, rebooting only for kernel upgrades. 2.6.19 seems to be no worse in this regard. Perhaps fortunately, the configs I've tried have consistently failed to shake the crash, so I have a semi-reproducible test case here on C3-2 hardware if somebody wants to investigate the problem (though it still takes 6-12 hours). -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. -
The GCC code generator appears to have been rewritten between 3.4.6 and 4.1.1.... I took a look at the dump he posted and there are some minor and some massive differences between the code. In one case some of the code is swapped, in another there is code in the 3.4.6 version that isn't in the 4.1.1... Finally the 4.1.1 version of the function has what appears to be function calls and these don't appear in the code generated by 3.4.6 In other words - the code generation for 4.1.1 appears to be broken when it comes to generating system code. DRH -
Differences are expected since we disable unit-at-a-time for gcc < 4
Bug number for an either already open or created by you bug in the gcc
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
Okay. Thing is that these noted differences, aside from where 4.1.1 doesn't generate an opcode that 3.4.6 does aren't all that fatal, IMHO. The fact that there it does generate call's rather than jumps for local pointer moves (IIRC - been a while since I looked at the dump of pipe_poll that he None. I didn't file a report on this because I didn't find the big, just noted a problem that appears to occur. In this case the call's generated seem to wrap loops - something I've never heard of anyone doing. These *might* be causing the off-by-one that is causing the function to re-enter in the middle of an instruction. Seeing this I'd guess that this follows for all system-level code generated by 4.1.1 and this is exactly what I was reporting. If you'd like I'll go dig up the dumps he posted and post the two related segments side-by-side to give you a better example what I'm referring to. DRH -
D. Hazelton <dhazelton@enter.net> wrote: Define "system-level code". What makes it different from, say, bog-of-the-mill compiler code (yes, gcc compiles itself as part of its If the related segments show code that is somehow wrong, by all means report it /with your detailed analysis/ to the compiler people. Just a warning, gcc is pretty smart in what it does, its code is often surprising to the unwashed. Also, the C standard is subtle, the error might be in a unwarranted assumption in the source code. -
Historically, some people have actually used horrible hacks like trying to figure out which particular C file gets miscompiled by basically having both compilers installed, and then trying out different subdirectories with different compilers. And once the subdirectory has been pinpointed, pinpointing which particular file it is.. etc. Pretty damn horrible to do, and I'm afraid we don't have any real helpful scripts to do any of the work for you. So it's all effectively manual (basically boils down to: "compile everything with known-good compiler. Then replace the good compiler with the bad one, remove the object files from one directory, and recompile the kernel". "Rinse and repeat". I don't think anybody has ever done that with something where triggering the cause then also takes that long - that just ends up making the whole thing even more painful. What are the exact crash details? That might narrow things down enough that maybe you could try just one or two files that are "suspect". Linus -
Linus,
On Tuesday 02 January 2007 22:13, Linus Torvalds wrote:
I'll do a digest of the problem for you and anybody else that's lost track of
the debugging story so far..
There are no hardware problems evidenced by any testing I have performed
(memtest, prime95 CPU torture tests, temp monitors). Furthermore, kernels
compiled with older GCCs have been running without problems for literally
years on this machine.
Here is an example of an oops. The kernel continued to limp along after this.
BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_state iptable_filter ip_tables x_tables prism54 yenta_socket
rsrc_nonstatic pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore ehci_hcd
usblp eth1394 uhci_hcd usbcore ohci1394 ieee1394 via_agp agpgart vt1211
hwmon_vid hwmon ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c
ds: 007b es: 007b ss: 0068
Process python (pid: 4178, ti=f70f2000 task=f70c4a90 task.ti=f70f2000)
Stack: 00000000 00000000 f70f3e9c f6e111c0 f70f3fa4 c015d7f3 f70f3c54 f70f3fac
084c44a0 00000030 084c44d0 00000000 f70f3e94 f70f3e94 00000006 f70f3ecc
00000000 f70f3e94 c015e580 00000000 00000000 00000006 f6e111c0 00000000
Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb
[<b7f6b402>] 0xb7f6b402
=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f
45 ca eb b6 8d b6 ...It's not an off-by-one either (eg say we're taking an exception and screiwing up %eip by one somehow). The code sequence in question is mov %ecx,%edx mov 0x6c(%esi),%eax or $0x10,%edx cmp 0x168(%edi),%eax <-- cmovne %edx,%ecx jmp ... and it's in the second byte of the "cmp". And yes, it definitely entered there, because trying other random entry-points will have either invalid instructions or instructions that would fault due to NULL pointers. HOWEVER, it's also not as simple as "took an interrupt, and returned with %eip incremented by one", becasue your %edx is zero, so it won't have done that "or $10,%edx" and then some interrupt happened and screwed up just %eip. So it's literally a random %eip, but since you say it's consistently in that function, it's not truly "random". There's something that triggers it just _there_. However, that's a damn simple function. There's _nothing_ there. The particular code that is involved right there is literally if (!pipe->writers && filp->f_version != pipe->w_counter) mask |= POLLHUP; and that's it. There's not even anything half-way interesting around it, except for the "poll_wait()" call, but even that is about as common as you can humanly get.. Looking at the register set and the stack, I see: Stack: 00000000 00000000 <- saved %ebx (dunno, seems dead in caller) f70f3e9c <- saved %esi (== pollfd in do_pollfd) f6e111c0 <- saved %edi (== filp) f70f3fa4 <- outer EBP (looks reasonable) c015d7f3 <- return address (do_sys_poll+0x253/0x480) and the strange thing is that when the oops happens, it really looks like %esi _still_ contains the value it had originally (and that is saved on the stack). But afaik, from your disassembly, it should have been overwritten by the initial %eax, which should have had the same value as %edi on entry... IOW, none of it really makes any sense. The stack frames look fine, so we _did_ enter at the beginning of the ...
Traditionally, afaik, -Os has tended to show compiler problems that _could_ happen with -O2 too, but never do in practice. It may be that gcc-4.1 without -Os miscompiles some very unusual code, and then with -Os we just hit more cases of that. That said, I th ink gcc-4.1.1 is very common - I know it's the Fedora compiler. Also, CC_OPTIMIZE_FOR_SIZE defaults to 'y' if you have EXPERIMENTAL on, and from all the bug-reports about other features that are marked EXPERIMENTAL, I know that a lot of people do seem to select for it. So I would expect that gcc-4.1.1 and -Os is actually a fairly common combination. I just checked, and it's what I use personally, for example. Of course, my main machine is an x86-64, and it has more registers. At least some historical -Os bug was about bad things happening under register pressure, iirc, and so x86-64 would show fewer problems than regular 32-bit x86 (which has far fewer registers for the compiler to use). It is a bit worrisome. These things seem to be about 50:50 real kernel bugs (just hidden by some common code generation sequence) and real honest-to-goodness compiler bugs. But they are hard as hell to find. Linus -
gcc optimizations were almost completely rewritten between 3.4.6 and 4.1, and one of the subtle changes that may have been introduced is with regard to the heuristics used to determine whether to inline an 'inline' function or not when using -Os. This problem can show up in dynamic linking and break on certain architectures but should be detectable by using -Winline. David -
