Re: File corruption when using kernels 2.6.18+

Previous thread: 2.6.21 -> 2.6.22 & 2.6.23-rc8 performance regression by Denys on Sunday, September 30, 2007 - 8:22 am. (1 message)

Next thread: Fwd: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20 by Matthew on Sunday, September 30, 2007 - 9:45 am. (1 message)
From: Neil Romig
Date: Sunday, September 30, 2007 - 8:40 am

I am using kernel 2.6.17.14, and would like to use newer versions for the wireless code but I get file corruption with any kernel 
more recent than this. My system uses a SiS645DX chipset (it is a rebadged CLEVO D400E) with 5513 IDE.

I get this corruption with read/writes, i.e using "cp", "dd", when compiling software, writing to CD, etc etc. An example:

cp ../Changelog-2.6.18 .
diff Changelog-2.6.18 ../Changelog-2.6.18

16377c16377
<       LD      .top_vmlinux1
---
 >       LD      .tmp_vmlinux1
23954c23954
< Date:   Mon Jul 1% 04:45:11 2006 -0700
---
 > Date:   Mon Jul 10 04:45:11 2006 -0700
24955c24955
<     This is generally useful, but partacularly helps see if it is the same sector
---
 >     This is generally useful, but particularly helps see if it is the same sector
31879c31879
<     [MMC] sdhci: version bump cdhci
---
 >     [MMC] sdhci: version bump sdhci
42955c42955
<     Replace `he temp makefile hacks with proper CONFIG entries, which are also
---
 >     Replace the temp makefile hacks with proper CONFIG entries, which are also
49050c49050
<       and this task is(already holding:
---
 >       and this task is already holding:
[output clipped]

I would suspect a memory problem but memtest86+ gives no errors despite numerous passes, and I get no errors in older kernels. Does 
anyone have any idea what changed in 2.6.18 to cause such an error?

I have tried some obvious things (see thread on linuxquestions.org: 
http://www.linuxquestions.org/questions/showthread.php?t=578200), but I don't understand enough about the kernel to get any further.

Neil.
-

From: Pekka Enberg
Date: Sunday, September 30, 2007 - 9:29 am

Hi Neil,


I don't but you can try to isolate the changeset introducing the
corruption with git-bisect:

http://kernel.org/pub/software/scm/git/docs/v1.3.3/howto/isolate-bugs-with-bisect.txt

So, in your case, you do:

<clone Linux mainline git repository>
# git bisect start
# git bisect bad v2.6.18
# git bisect good v2.6.17

then

<recompile and test>
<git bisect [good|bad] depending on results>
<repeat until you've narrowed down the changeset>

Also, please remember to send your .config when reporting bugs as
described in REPORTING-BUGS.

                                         Pekka
-

From: Neil Romig
Date: Tuesday, October 2, 2007 - 2:05 pm

Thanks for your help on this. I have narrowed it down to commit "c22ce143d15eb288543fe9873e1c5ac1c01b69a1 x86: cache pollution aware 
__copy_from_user_ll()". This fits with the errors I'm getting, so now I need to find out if I can safely ignore this patch, or does 
it have to be modified? This is my first Linux bug in many years of simply using it, so I'm a little nervous!

My kernel .config is attached, along with lspci output.

Neil
From: Pekka Enberg
Date: Tuesday, October 2, 2007 - 10:18 pm

Hi Neil,


Just to make sure, if you disable CONFIG_X86_INTEL_USERCOPY, the
corruption goes away?
-

From: Neil Romig
Date: Wednesday, October 3, 2007 - 11:42 am

It took some fiddling to disable (edit arch/i386/Kconfig.cpu) but that has fixed it. Many thanks!

Does this need to be reported as a bug? Or should the kernel config scripts be changed to enable this option to be easily turned off?

Neil.
-

From: Pekka Enberg
Date: Wednesday, October 3, 2007 - 11:48 am

Hi Neil,




Looks like a bug to me. Can we have your /proc/cpuinfo too?

                              Pekka
-

From: Linus Torvalds
Date: Wednesday, October 3, 2007 - 12:22 pm

I doubt it's a CPU bug. It's more likely a chipset or motherboard bug 
around the CPU. The patterns for the original "cp" corruption that Neil 
posted seem to be:

	   File offset		correct		corrupt
	 decimal      hex
	======== ========
	  642470 0009cda6	'm' 0x6D	'o' 0x6f
	  972198 000ed5a6	'i' 0x69	'a' 0x61
	 1243686 0012fa26	's' 0x73	'c' 0x63
	 1676846 0019962e	't' 0x74	'`' 0x64
	 1907974 001d1d06	' ' 0x20	'(' 0x28
	...

and since it's apparently about using the uncached accesses, it's 
interesting that the low three bits are identical in all those corruptions 
(they also seem to be single-bit errors in the actual byte-value, although 
the bit is not the same). If the external bus is 64 bits (?), that would 
say that it's one particular byte lane that is dodgy.

I would bet that the reason the intel-optimized memcpy triggers this is 
that the non-temporal stores just means that you go out directly on the 
bus, and it probably just shows a weakness in the chipset or bus that 
doesn't show with the normal cacheline accesses.

			Linus
-

From: Pekka Enberg
Date: Wednesday, October 3, 2007 - 12:35 pm

Hi Linus,


But that should show up with memtest too, no?

                             Pekka
-

From: Linus Torvalds
Date: Wednesday, October 3, 2007 - 12:54 pm

Not unless memtest uses non-temporal stores with the same (or similar) 
access patterns.

The thing is, the CPU cache hides a *lot* of activity from the chipset, 
and changes the access patterns radically. 

With normal cached accesses, you'd normally see just the "fill cacheline" 
and "write out cacheline" pattern. With movnt, you'd see non-cacheline 
accesses to memory. If the chipset was tested under mostly normal loads, 
the movnt cases have been getting a lot less coverage.

Now, I do agree that it certainly *can* be a CPU bug too.  I doubt it, 
though. 

I'd check the power supply (brownouts cause random corruption, and it 
might have a "peak power pattern" thing to it), and it's worth re-seating 
any DIMM's etc. And it's definitely worth going into the BIOS setup screen 
and making sure that nothing is even close to debatable (ie take RAM 
timings down to non-aggressive levels, make sure bus frequencies and 
multipliers are not even close to borderline, etc etc).

			Linus
-

From: Hiro Yoshioka
Date: Wednesday, October 3, 2007 - 6:11 pm

Hi,


I'm not so sure whether it is chipset's bug or not.

The movnt does have the WC (write combining) semantics and
bypass the hardware cache to store the data.

http://www.intel.com/products/processor/manuals/index.htm

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 1: Basic Architecture

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide

Thanks in advance,
  Hiro
-

From: Alan Cox
Date: Wednesday, October 3, 2007 - 1:30 pm

On Wed, 3 Oct 2007 22:35:24 +0300

Not neccessarily. The old VIA memory copying bug only showed up with
prefetching and mmx store patterns. That was a hardware flaw that took
extreme memory utilisation to show up - so it does occur but thats not to
say it is the cause

Alan
-

From: Neil Romig
Date: Thursday, October 4, 2007 - 11:34 am

This is way out of my depth, but I have re-seated my 2x256MB DDR single sided DIMMS and checked my BIOS, though the Phoenix BIOS in 
my rebadged CLEVO D400E doesn't allow any tampering with timings, and recompiled with the USERCOPY enabled but still get the bit 
corruption.
Since I seem to be alone in getting this effect perhaps my machine is dodgy? I've attached my dmesg output if that's of any help.

Neil.


From: Chuck Ebbert
Date: Tuesday, October 2, 2007 - 3:30 pm

Is there some pattern to the corruption, like maybe it occurs once every
N bytes in the file?

-

Previous thread: 2.6.21 -> 2.6.22 & 2.6.23-rc8 performance regression by Denys on Sunday, September 30, 2007 - 8:22 am. (1 message)

Next thread: Fwd: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20 by Matthew on Sunday, September 30, 2007 - 9:45 am. (1 message)