I am using kernel 2.6.17.14, and would like to use newer versions for the wireless code but I get file corruption with any kernel more recent than this. My system uses a SiS645DX chipset (it is a rebadged CLEVO D400E) with 5513 IDE. I get this corruption with read/writes, i.e using "cp", "dd", when compiling software, writing to CD, etc etc. An example: cp ../Changelog-2.6.18 . diff Changelog-2.6.18 ../Changelog-2.6.18 16377c16377 < LD .top_vmlinux1 --- > LD .tmp_vmlinux1 23954c23954 < Date: Mon Jul 1% 04:45:11 2006 -0700 --- > Date: Mon Jul 10 04:45:11 2006 -0700 24955c24955 < This is generally useful, but partacularly helps see if it is the same sector --- > This is generally useful, but particularly helps see if it is the same sector 31879c31879 < [MMC] sdhci: version bump cdhci --- > [MMC] sdhci: version bump sdhci 42955c42955 < Replace `he temp makefile hacks with proper CONFIG entries, which are also --- > Replace the temp makefile hacks with proper CONFIG entries, which are also 49050c49050 < and this task is(already holding: --- > and this task is already holding: [output clipped] I would suspect a memory problem but memtest86+ gives no errors despite numerous passes, and I get no errors in older kernels. Does anyone have any idea what changed in 2.6.18 to cause such an error? I have tried some obvious things (see thread on linuxquestions.org: http://www.linuxquestions.org/questions/showthread.php?t=578200), but I don't understand enough about the kernel to get any further. Neil. -
Hi Neil, I don't but you can try to isolate the changeset introducing the corruption with git-bisect: http://kernel.org/pub/software/scm/git/docs/v1.3.3/howto/isolate-bugs-with-bisect.txt So, in your case, you do: <clone Linux mainline git repository> # git bisect start # git bisect bad v2.6.18 # git bisect good v2.6.17 then <recompile and test> <git bisect [good|bad] depending on results> <repeat until you've narrowed down the changeset> Also, please remember to send your .config when reporting bugs as described in REPORTING-BUGS. Pekka -
Thanks for your help on this. I have narrowed it down to commit "c22ce143d15eb288543fe9873e1c5ac1c01b69a1 x86: cache pollution aware __copy_from_user_ll()". This fits with the errors I'm getting, so now I need to find out if I can safely ignore this patch, or does it have to be modified? This is my first Linux bug in many years of simply using it, so I'm a little nervous! My kernel .config is attached, along with lspci output. Neil
Hi Neil, Just to make sure, if you disable CONFIG_X86_INTEL_USERCOPY, the corruption goes away? -
It took some fiddling to disable (edit arch/i386/Kconfig.cpu) but that has fixed it. Many thanks! Does this need to be reported as a bug? Or should the kernel config scripts be changed to enable this option to be easily turned off? Neil. -
Hi Neil,
Looks like a bug to me. Can we have your /proc/cpuinfo too?
Pekka
-
I doubt it's a CPU bug. It's more likely a chipset or motherboard bug
around the CPU. The patterns for the original "cp" corruption that Neil
posted seem to be:
File offset correct corrupt
decimal hex
======== ========
642470 0009cda6 'm' 0x6D 'o' 0x6f
972198 000ed5a6 'i' 0x69 'a' 0x61
1243686 0012fa26 's' 0x73 'c' 0x63
1676846 0019962e 't' 0x74 '`' 0x64
1907974 001d1d06 ' ' 0x20 '(' 0x28
...
and since it's apparently about using the uncached accesses, it's
interesting that the low three bits are identical in all those corruptions
(they also seem to be single-bit errors in the actual byte-value, although
the bit is not the same). If the external bus is 64 bits (?), that would
say that it's one particular byte lane that is dodgy.
I would bet that the reason the intel-optimized memcpy triggers this is
that the non-temporal stores just means that you go out directly on the
bus, and it probably just shows a weakness in the chipset or bus that
doesn't show with the normal cacheline accesses.
Linus
-
Hi Linus,
But that should show up with memtest too, no?
Pekka
-
Not unless memtest uses non-temporal stores with the same (or similar) access patterns. The thing is, the CPU cache hides a *lot* of activity from the chipset, and changes the access patterns radically. With normal cached accesses, you'd normally see just the "fill cacheline" and "write out cacheline" pattern. With movnt, you'd see non-cacheline accesses to memory. If the chipset was tested under mostly normal loads, the movnt cases have been getting a lot less coverage. Now, I do agree that it certainly *can* be a CPU bug too. I doubt it, though. I'd check the power supply (brownouts cause random corruption, and it might have a "peak power pattern" thing to it), and it's worth re-seating any DIMM's etc. And it's definitely worth going into the BIOS setup screen and making sure that nothing is even close to debatable (ie take RAM timings down to non-aggressive levels, make sure bus frequencies and multipliers are not even close to borderline, etc etc). Linus -
Hi, I'm not so sure whether it is chipset's bug or not. The movnt does have the WC (write combining) semantics and bypass the hardware cache to store the data. http://www.intel.com/products/processor/manuals/index.htm Intel 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide Thanks in advance, Hiro -
On Wed, 3 Oct 2007 22:35:24 +0300 Not neccessarily. The old VIA memory copying bug only showed up with prefetching and mmx store patterns. That was a hardware flaw that took extreme memory utilisation to show up - so it does occur but thats not to say it is the cause Alan -
This is way out of my depth, but I have re-seated my 2x256MB DDR single sided DIMMS and checked my BIOS, though the Phoenix BIOS in my rebadged CLEVO D400E doesn't allow any tampering with timings, and recompiled with the USERCOPY enabled but still get the bit corruption. Since I seem to be alone in getting this effect perhaps my machine is dodgy? I've attached my dmesg output if that's of any help. Neil.
Is there some pattern to the corruption, like maybe it occurs once every N bytes in the file? -
