Re: [PATCH RFC] x86: check for and defend against BIOS memory corruption

Previous thread: [PATCH] x86: split e820 reserved entries record to late by Yinghai Lu on Thursday, August 28, 2008 - 12:39 pm. (6 messages)

Next thread: none
From: Jeremy Fitzhardinge
Date: Thursday, August 28, 2008 - 12:52 pm

Some BIOSes have been observed to corrupt memory in the low 64k.  This
patch does two things:
 - Reserves all memory which does not have to be in that area, to
   prevent it from being used as general memory by the kernel.  Things
   like the SMP trampoline are still in the memory, however.
 - Clears the reserved memory so we can observe changes to it.
 - Adds a function check_for_bios_corruption() which checks and reports on
   memory becoming unexpectedly non-zero.  Currently it's called in the
   x86 fault handler, and the powermanagement debug output.

RFC: What other places should we check for corruption in?

[ Alan, Rafał: could you check you see:
   1: corruption messages
   2: no crashes
  Thanks -J
]

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rafał Miłecki <zajec5@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 Documentation/kernel-parameters.txt |    5 ++
 arch/x86/Kconfig                    |    3 +
 arch/x86/kernel/setup.c             |   86 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                 |    2 
 drivers/base/power/main.c           |    1 
 include/linux/kernel.h              |   12 ++++
 6 files changed, 109 insertions(+)

===================================================================
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -359,6 +359,11 @@
 			BayCom Serial Port AX.25 Modem (Half Duplex Mode)
 			Format: <io>,<irq>,<mode>
 			See header of drivers/net/hamradio/baycom_ser_hdx.c.
+
+	bios_corruption_check=0/1 [X86]
+			Some BIOSes seem to corrupt the first 64k of memory
+			when doing things like suspend/resume.  Setting this
+			option will scan the memory looking for corruption.
 
 	boot_delay=	Milliseconds to delay each printk during boot.
 			Values larger than 10 seconds (10000) are changed ...
From: Yinghai Lu
Date: Thursday, August 28, 2008 - 6:49 pm

can you please not punish systems without this bios problem?

if (!bios_corruption_check)

move earlier and
bios_corruption_check = 0;

BTW: SMI evil damaged that area?

YH
From: Jeremy Fitzhardinge
Date: Thursday, August 28, 2008 - 8:28 pm

Yeah, OK, but I think it should default to ON for now.  The problem is
that we had two very different systems (Sony Vaio and Intel desktop)
exhibit the same problem in two different ways.  These systems worked
fine until we slightly changed the way that pagetable memory was
allocated.  We just don't know how many other systems are doing this
kind of subtle corruption of low memory.

    J
--

From: Alan Cox
Date: Friday, August 29, 2008 - 2:25 am

> >>  - Clears the reserved memory so we can observe changes to it.

You ought to pick different values on different runs 0x00, 0xFF, 0xAA,

So a zillion people should lose a chunk of RAM because of what is
probably an obscure bug in a single version of a piece of SMM code on two
systems total ?

That appears to be totally out of proportion - plus the defaults are
*irrelevant* for mass user coverage. If you want large scale coverage ask
the Fedora, OpenSuSE and Ubuntu people to turn it on for a testing kernel
release and report back.

Alan
--

From: Rafał Miłecki
Date: Friday, August 29, 2008 - 3:13 am

T3V0IG9mIGN1cnJlbnQgZGlzY3Vzc2lvbiBJIHRyaWVkIHVzaW5nIHMycmFtIG9uIHBhdGNoZWQg
a2VybmVsIChJIGRpZApub3QgdHJ5IHMycmFtIGVhcmxpZXIsIG15IHByb2JsZW0gd2FzICh1bilw
bHVnZ2luZyBIRE1JIC0gc29tZSBBQ1BJCmNvZGUgcHJvYmFibHkpLgoKQ29ycnVwdGlvbiBvdXRw
dXQgaXMgcXVpdGUgaHVnZSwgSSBhdHRhY2hlZCBpdCB0byBidWcgcmVwb3J0OgpodHRwOi8vYnVn
emlsbGEua2VybmVsLm9yZy9zaG93X2J1Zy5jZ2k/aWQ9MTEyMzcKaHR0cDovL2J1Z3ppbGxhLmtl
cm5lbC5vcmcvYXR0YWNobWVudC5jZ2k/aWQ9MTc1MjYKCgpJIGRvIG5vdCBrbm93IG11Y2ggYWJv
dXQgY3VycmVudCBkaXNjdXNzaW9uIGFuZCBmYWN0IGlmIGNoZWNraW5nIGZvcgpjb3JydXB0aW9u
IHNob3VsZCBiZSBlbmFibGVkIGJ5IGRlZmF1bHQuCgpIb3dldmVyLCBwbGVhc2Ugbm90ZSB0aGF0
IGVuZC11c2VyIHdpbGwgbm90IGFibGUgdG8gZGlhZ25vc2UgdGhhdApMaW51eCBkb2Vzbid0IHdv
cmsgb24gaGlzIG1hY2hpbmUgYmVjYXVzZSBvZiBsZXNzLW9yLW1vcmUgYnJva2VuIEJJT1MuCkhl
IHdpbGwgbm90IGdldCB0byBrbm93IHRoYXQgaGUgbmVlZHMgdG8gZW5hYmxlCmJpb3NfY29ycnVw
dGlvbl9jaGVjaz0xIGFuZCBlZGl0IGdydWIgdG8gbWFrZSBpdCBkZWZhdWx0LgoKSSB1bmRlcnN0
YW5kIHRoYXQgZGlzYWJsaW5nIGNodW5rIG9mIFJBTSBvbiBldmVyeSwgZXZlbiBub3QgYWZmZWN0
ZWQsCm1hY2hpbmUgaXMgZmFyIGZyb20gcGVyZmVjdCwgYnV0IHdlIGhhdmUgdG8gcmVtZW1iZXIg
YWJvdXQgbm90CmFkdmFuY2VkIGVuZC11c2Vycy4KCi0tIApSYWZhxYIgTWnFgmVja2kK
--

From: Alan Cox
Date: Friday, August 29, 2008 - 3:06 am

Including not advanced end users without the bug ... which is probably
all but one or two people.
--

From: Hugh Dickins
Date: Friday, August 29, 2008 - 3:24 am

Not quite the output we were expecting!  I've not got around to trying
it yet, so beware, but I think Jeremy's patch needs the following on top.
Or you may prefer to wait until one of us reports that it is now working
as intended.

--- a/arch/x86/kernel/setup.c=092008-08-29 11:17:16.000000000 +0100
+++ b/arch/x86/kernel/setup.c=092008-08-29 11:19:24.000000000 +0100
@@ -636,11 +636,12 @@ void check_for_bios_corruption(void)
 =09=09unsigned long *addr =3D __va(scan_areas[i].addr);
 =09=09unsigned long size =3D scan_areas[i].size;
=20
-=09=09for(; size; addr++, size--) {
+=09=09for(; size; addr++, size -=3D sizeof(unsigned long)) {
 =09=09=09if (!*addr)
 =09=09=09=09continue;
 =09=09=09printk(KERN_ERR "Corrupted low memory at %p (%lx phys) =3D %08lx\=
n",
 =09=09=09       addr, __pa(addr), *addr);
+=09=09=09*addr =3D 0;
 =09=09=09corruption =3D 1;
 =09=09}
 =09}
From: Rafał Miłecki
Date: Friday, August 29, 2008 - 4:54 am

I tried your patch anyway (after applying Jeremy's patch of course)
and it doesn't seem to work. The only output is:
scanning 2 areas for BIOS corruption
after using s2ram. I do not get any
Corrupted low memory at*

-- 
Rafał Miłecki
From: Alan Jenkins
Date: Friday, August 29, 2008 - 5:09 am

It seemed to work for me.  Did you remember to plug HDMI to trigger the
corruption before you used s2ram?

[   71.663828] Back to C!
[   71.663828] Corrupted low memory at ffff8800000083e8 (83e8 phys) =
803c85370cfc0000
[   71.663828] Corrupted low memory at ffff8800000083f0 (83f0 phys) =
00003000
[   71.663828] Pid: 7335, comm: pm-suspend Not tainted
2.6.27-rc4-00216-ge1c9c9d-dirty #152
[   71.663828]
[   71.663828] Call Trace:
[   71.663828]  [<ffffffff8020fdd5>] check_for_bios_corruption+0xb5/0xd0
[   71.663828]  [<ffffffff803a3e1a>] pm_dev_dbg+0x2a/0xa0
[   71.663828]  [<ffffffff803a49c2>] dpm_power_up+0x32/0xf0

Alan
--

From: Hugh Dickins
Date: Friday, August 29, 2008 - 6:21 am

I hope that's what got missed.

Here's my version of Jeremy's patch, that I've now tested on my machines,
as x86_32 and as x86_64.  It addresses none of the points Alan Cox made,
and it stays silent for me, even after suspend+resume, unless I actually
introduce corruption myself.  Omits Jeremy's check in fault.c, but does
a check every minute, so should soon detect Rafa=C5=82's HDMI corruption
without any need to suspend+resume.

Hugh

--- 2.6.27-rc5/Documentation/kernel-parameters.txt=092008-08-29 01:02:34.00=
0000000 +0100
+++ linux/Documentation/kernel-parameters.txt=092008-08-29 11:17:16.0000000=
00 +0100
@@ -360,6 +360,11 @@ and is between 256 and 4096 characters.=20
 =09=09=09Format: <io>,<irq>,<mode>
 =09=09=09See header of drivers/net/hamradio/baycom_ser_hdx.c.
=20
+=09bios_corruption_check=3D0/1 [X86]
+=09=09=09Some BIOSes seem to corrupt the first 64k of memory
+=09=09=09when doing things like suspend/resume.  Setting this
+=09=09=09option will scan the memory looking for corruption.
+
 =09boot_delay=3D=09Milliseconds to delay each printk during boot.
 =09=09=09Values larger than 10 seconds (10000) are changed to
 =09=09=09no delay (0).
--- 2.6.27-rc5/arch/x86/Kconfig=092008-08-29 01:02:35.000000000 +0100
+++ linux/arch/x86/Kconfig=092008-08-29 11:17:16.000000000 +0100
@@ -201,6 +201,9 @@ config X86_TRAMPOLINE
 =09depends on X86_SMP || (X86_VOYAGER && SMP) || (64BIT && ACPI_SLEEP)
 =09default y
=20
+config X86_CHECK_BIOS_CORRUPTION
+        def_bool y
+
 config KTIME_SCALAR
 =09def_bool X86_32
 source "init/Kconfig"
--- 2.6.27-rc5/arch/x86/kernel/setup.c=092008-08-29 01:02:35.000000000 +010=
0
+++ linux/arch/x86/kernel/setup.c=092008-08-29 13:50:19.000000000 +0100
@@ -579,6 +579,106 @@ static struct x86_quirks default_x86_qui
 struct x86_quirks *x86_quirks __initdata =3D &default_x86_quirks;
=20
 /*
+ * Some BIOSes seem to corrupt the low 64k of memory during events
+ * like suspend/resume and unplugging an HDMI cable.  Reserve all
+ * remaining free ...
From: Rafał Miłecki
Date: Friday, August 29, 2008 - 9:30 am

Ah, stupid mistake, sorry! Indeed, using HDMI and s2ram later works fine:

Corrupted low memory at ffff88000000be98 (be98 phys) = b02a000400000000
 [<ffffffff8020fc9c>] check_for_bios_corruption+0x94/0xa0

-- 
Rafał Miłecki
From: Rafał Miłecki
Date: Friday, August 29, 2008 - 10:39 am

Your periodic test works fine:

Corrupted low memory at ffff88000000be9c (be9c phys) = b02a0004
 <IRQ>  [<ffffffff8020fc9b>] check_for_bios_corruption+0x93/0x9f
 [<ffffffff8020fca7>] ? periodic_check_for_corruption+0x0/0x25
 [<ffffffff8020fcb0>] periodic_check_for_corruption+0x9/0x25

By the way I confirmed this bug on Sony Vaio FW11M (my one is FW11S).
Probably more machines from FW11* are affected.

-- 
Rafał Miłecki
From: Rafał Miłecki
Date: Thursday, September 4, 2008 - 12:42 pm

If this patch is known to work fine for Sony Vaio FW* and Alan's
machine, could it go mainline somehow?

-- 
Rafał Miłecki
From: Hugh Dickins
Date: Thursday, September 4, 2008 - 1:23 pm

Well.

Thanks for the prod, and I'm certainly remiss for not following
up sooner.  But I'm really not at all keen on such a patch going
into mainline myself.

It's an interesting experiment, and I'd be happy to see such a patch
(adjusted to make sure output goes to kerneloops.org) spending a little
while in Fedora Rawhide (who'd be the right contact for that?).

But so far as mainline goes, I share Alan Cox's opinion that we should
not be chopping pages out of every x86 user's memory, just because a
couple of machines with faulty BIOSes have been observed.

Particularly now it's evident that the 64kB "limit" is no more than a
reflection of where the directmap pagetable changes have caught such
corruption.

If lots more such corruptions are reported, of course I would change
my position; but those bad directmap PMD crashes are themselves quite
recognizable now we know to look out for them.

I would prefer you both to use the minimal memmap=3D solutions for now;
but others may disagree.

Hugh
From: Jeremy Fitzhardinge
Date: Thursday, September 4, 2008 - 4:04 pm

I have my original patch with your changes as a followup patch sitting
in my queue.  I was planning on sending it in the next day or so.  I was

Dave Jones?  He used to do it, at least.  I guess a WARN_ON() would get

I think that's a worthwhile cost for -rc.  We can fix it up (ie, make it
a real config option, defaulting off) for release once we're happy that

We should definitely make the kernel parameter set the banned memory

The fact that we're seeing this problem in two completely different
systems with different BIOSes and everything else makes me worried that
this is quite widespread.  It's only the persistence and diligence of
our bug reporters that we managed to work out that they're the same
problem.  How many other people are getting strange crashes and haven't
managed to correlate it any particular BIOS interaction?  Or just happen
to be corrupting memory we don't care about right now, but is only a
small code change or link order change away from disaster?

    J
--

From: Ingo Molnar
Date: Saturday, September 6, 2008 - 11:09 am

please put this all behind a .config debug option that distros can turn 
on/off. Also, when it's enabled in the .config, there should be another 
.config option that marks it disabled by default but it can be enabled 
via a boot parameter.

Distro debug kernels will most likely enable the .config - even release 
kernels might enable it it, with default off - users can enable the boot 
switch if they suspect something, without having to build a new kernel.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 7:08 am

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 7:18 am

Yes, that wouldn't be a bad idea.  The corruption we've seen so far is

Well, we just don't know.  We have two systems which started reporting
problems when I made a small change to how the initial pagetables are
allocated.  Before that they appeared to work fine, though in reality it
was just some other part of the address space being corrupted.  Who
knows how many systems are out there "working fine", except that some
obscure address in the user address space now allows you to write into
kernel memory?

The two systems are *completely different*.  One has a Phoenix BIOS, the
other is AMI.  One is a Sony Vaio, the other is an OEM Intel desktop. 
One corrupts memory on unplugging a HDMI cable, the other corrupts on
suspend/resume.  I don't think this is reassuring, or indicates the

The idea is to leave it on by default for a while - maybe a couple of
-rc releases - and see what turns up.  We can turn it off for a release
kernel if people really object, though I'd hope something like Fedora
Rawhide would leave it on.

    J
--

From: Kasper Sandberg
Date: Friday, August 29, 2008 - 1:31 pm

This is very interresting. I have a gigabyte X48 board, and everything
works perfect, however, in memtest86+ i get thousands of errors in the
range 0-0.9mb. I am certain my ram is fine, i've tried it in other
computers, and reversed. I've run dozens of stress tests within my
booted linux, no trouble, also the userspace memtester (allthough im
aware it wont actually ever get to grab offending addresses).


--

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 6:15 pm

I would expect you'd get crashes if you had massive corruption in that
area.  By default it's being used for the kernel pagetables, so stomping
on them will cause problems.  In the cases we've been looking at so far,
it's only been a word or two being stomped.

The low 1Mbyte is used as general memory, so corruption over the whole
area will have widespread effects.

But try this patch (Hugh's version) and see what happens.

    J
--

From: Rafał Miłecki
Date: Thursday, August 28, 2008 - 11:20 pm

MjAwOC84LzI4IEplcmVteSBGaXR6aGFyZGluZ2UgPGplcmVteUBnb29wLm9yZz46Cj4gU29tZSBC
SU9TZXMgaGF2ZSBiZWVuIG9ic2VydmVkIHRvIGNvcnJ1cHQgbWVtb3J5IGluIHRoZSBsb3cgNjRr
LiAgVGhpcwo+IHBhdGNoIGRvZXMgdHdvIHRoaW5nczoKPiAgLSBSZXNlcnZlcyBhbGwgbWVtb3J5
IHdoaWNoIGRvZXMgbm90IGhhdmUgdG8gYmUgaW4gdGhhdCBhcmVhLCB0bwo+ICAgcHJldmVudCBp
dCBmcm9tIGJlaW5nIHVzZWQgYXMgZ2VuZXJhbCBtZW1vcnkgYnkgdGhlIGtlcm5lbC4gIFRoaW5n
cwo+ICAgbGlrZSB0aGUgU01QIHRyYW1wb2xpbmUgYXJlIHN0aWxsIGluIHRoZSBtZW1vcnksIGhv
d2V2ZXIuCj4gIC0gQ2xlYXJzIHRoZSByZXNlcnZlZCBtZW1vcnkgc28gd2UgY2FuIG9ic2VydmUg
Y2hhbmdlcyB0byBpdC4KPiAgLSBBZGRzIGEgZnVuY3Rpb24gY2hlY2tfZm9yX2Jpb3NfY29ycnVw
dGlvbigpIHdoaWNoIGNoZWNrcyBhbmQgcmVwb3J0cyBvbgo+ICAgbWVtb3J5IGJlY29taW5nIHVu
ZXhwZWN0ZWRseSBub24temVyby4gIEN1cnJlbnRseSBpdCdzIGNhbGxlZCBpbiB0aGUKPiAgIHg4
NiBmYXVsdCBoYW5kbGVyLCBhbmQgdGhlIHBvd2VybWFuYWdlbWVudCBkZWJ1ZyBvdXRwdXQuCj4K
PiBSRkM6IFdoYXQgb3RoZXIgcGxhY2VzIHNob3VsZCB3ZSBjaGVjayBmb3IgY29ycnVwdGlvbiBp
bj8KPgo+IFsgQWxhbiwgUmFmYcWCOiBjb3VsZCB5b3UgY2hlY2sgeW91IHNlZToKPiAgIDE6IGNv
cnJ1cHRpb24gbWVzc2FnZXMKPiAgIDI6IG5vIGNyYXNoZXMKPiAgVGhhbmtzIC1KCj4gXQoKSSB3
YXMgdHJ5aW5nIG15IGJlc3QgdG8gY3Jhc2ggc3lzdGVtIHdpdGggdGhpcyBwYXRjaCBhcHBsaWVk
IGFuZCBmYWlsZWQgOikKCldvcmtzIGdyZWF0LgoKSnVzdCB3b25kZXIgaWYgSSBzaG91bGQgZXhw
ZWN0IGFueSBwcmludGsgZnJvbQpjaGVja19mb3JfYmlvc19jb3JydXB0aW9uPyBJIGRvIG5vdCBz
ZWUgYW55OgoKemFqZWNAc29ueTp+PiBkbWVzZyB8IGdyZXAgLWkgY29ycgpzY2FubmluZyAyIGFy
ZWFzIGZvciBCSU9TIGNvcnJ1cHRpb24KCi0tIApSYWZhxYIgTWnFgmVja2kK
--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 11:45 pm

that's _very_ weird.

maybe the BIOS expects _zeroes_ somewhere? Do you suddenly see crashes 
if you change this line in Jeremy's patch:

+               memset(__va(addr), 0, size);

to something like:

+               memset(__va(addr), 0x55, size);

If this does not tickle any messages either, then maybe the problem is 
in the identity of the entities we allocate in the first 64K. Is there a 
list of allocations that go there when Jeremy's patch is not applied?

but ... i think with an earlier patch you saw corruption, right? 
Far-fetched idea: maybe it's some CPU erratum during suspend/resume that 
corrupts pagetables if the pagetables are allocated in the first 64K of 
RAM? In that case we should use a bootmem allocation for pagetables that 
give a minimum address of 64K.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 12:21 am

No, it's expected.  Rafał only got corruption when plugging his HDMI
cable, and I didn't put any corruption checks on that path (I'm not even
sure what kernel code would get executed in that case).  Hugh's original
patch put a check in the hot path of the fault handler - and so it would
get called regularly - but I put it in the kernel-bug path, which is
fairly pointless given that we expect this patch to prevent the crashes.

It does, however, do the check in the pm state changes, so doing a
suspend should make it print some of the corruption it found.  Alan's
case would be a better test for that though.

It does raise the question of where the good places to put the check
are.  It shouldn't be too hot, given that it's scanning ~64k of memory,
but often enough to actually show something.  I was thinking of putting
some calls in the acpi code itself, but got, erm, discouraged.


Rafał's corruption was definitely non-zero.  I think the corruption is
happening, but it's just not reported.

    J
--

From: Ingo Molnar
Date: Friday, August 29, 2008 - 12:30 am

i thought we did a check after coming back from resume? That would be a 
natural place to do the check. (the corruption is suspend/resume related 
after all, right?)

	Ingo
--

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 1:02 am

No, in Rafał's case it happens when he unplugs an HDMI cable.

    J
--

From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 12:22 am

I just followed up in detail to Ingo, but I think it's because I didn't
put any checks where you would hit them.  If you suspend/resume it
should print something.

    J
--

From: Hugh Dickins
Date: Friday, August 29, 2008 - 1:14 am

Thanks for taking this on, Jeremy: I had too many doubts to do so.


hpa introduced the 64k idea, and we've all been repeating it;
but I've not heard the reasoning behind it.  Is it a fundamental
addressing limitation within the BIOS memory model?  Or a case
that Windows treats the bottom 64k as scratch, so BIOS testers
won't notice if they corrupt it?

The two instances of corruption we've been studying have indeed
been below 64k (one in page 8 and one in page 11), but that's
because they were both recognizable corruptions of direct map PMDs.

If there is not a very strong justification for that 64k limit,
then I don't think this approach will be very useful, and we should
simply continue to rely on analyzing corruption when it appears, and
recommend memmap=3D as a way of avoiding it once analyzed.  If there

I don't know: the easy answer would be just to do it once every minute.
As the patch stands, we'll only learn more on machines going through
suspend+resume (disk or ram).  If the corruption avoidance works,

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

It's actually a bottom_of_memory corruption_check: it would pick
up corruption there whether it's caused by the BIOS or not.
The boot parameter description ought to refer to the config
option: ah, the config option is always on for x86, hmm.

If the 64k is in any doubt, maybe the corruption_check boot option
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

Always on?  For the moment, perhaps.  I'm very pleased to see you've
set it on for x86_32 as well as for x86_64, I'd been wanting to ask
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=


Small point, but since we're doing the same on 32-bit and 64-bit,
I think it would be better to operate on unsigned ints, to get


The purpose of the dump_stack being to insert a recognizable and
irritating spew into the log, prompting people to report these cases:
the stacktrace itself ...
From: Jeremy Fitzhardinge
Date: Friday, August 29, 2008 - 7:48 am

I don't think there's been a strong rationale.  64k is significant
because it's the size of an old 16-bit real-mode segment, and perhaps
the bios code in question is 16-bit code.

But I think it's really a question of limiting the scope of the problem:
yes, the BIOS could corrupt any memory anywhere, but what can we
possibly do about that?  Such a machine is as useless as a system with
intermittent faulty memory hardware.

We have to assume that it is basically possible to use the machine, and
that presumably whatever bug the BIOS has isn't making Windows crash. 
Windows might be avoiding the low 64k for some ancient-history DOS
reason, or perhaps it's just full of boot code which isn't used again?

It would be easy to add another kernel parameter to select how much

Right.  Once a minute sounds good, though it might be a bit too long to


Yes, that sounds good.  Also, it might be worth being able to set the
rate.  Doing it once every 10 seconds wouldn't add too much overhead,
and once a second would be reasonable if you think there's an actual

I couldn't see any reason the problem would be restricted to 64-bit.  It




Well, no, I was thinking about it would give a clue about what went
wrong.  If you see the spew out of the pm code, then it was likely
suspend/resume.  If it's out of the timer function then it won't tell
you much other than "something else".  If we put it elsewhere (even if
temporarily for detecting other suspected corruption symptoms), then it
will be useful for distinguishing those.

An alternative would be to pass a flag to say whether a backtrace would

That would break Xen.  I needed to do the pagetable reuse to make sure

We could add NX.  What's the behaviour of setting NX in a
non-NX-supporting CPU?  I don't think it would trigger a "reserved bit"
exception (the other high pte flags don't).  Or failing that, we could
mask out NX once we've worked out the CPU doesn't support it (at the
same time it relocates the pagetables to the kernel's ...
From: H. Peter Anvin
Date: Friday, August 29, 2008 - 10:20 am

Or just make the boot_corruption_check parameter take a range instead of 
a boolean.

	-hpa
--

From: Hugh Dickins
Date: Monday, September 8, 2008 - 4:35 am

I've no experience of what happens if NX is set to a non-NX-supporting
CPU, so can't advise on that at all.  But I think you're looking at
it the wrong way round - or else I am.

Here's the declaration and comment on level2_ident_pgt in head_64.S:

NEXT_PAGE(level2_ident_pgt)
	/* Since I easily can, map the first 1G.
	 * Don't set NX because code runs from these pages.
	 */
	PMDS(0, __PAGE_KERNEL_LARGE_EXEC, PTRS_PER_PMD)

(The "Since I easily can" comment is there because it used to map
only a subset needed for early kernel startup, not the whole 1G.)

So it's very intentionally leaving NX out there.  I believe the
level2_ident_pgt page appears twice or more in the pagetable layout,
used to map two or more areas of virtual address space - once to
provide the direct map at ffff880000000000 and once to provide
the kernel image virtual mapping at ffffffff80200000.

I think we don't want NX on Linux kernel text ;-?

Before your 2.6.27-rc changes, init_memory_mapping subsequently
replaced the direct map usage by a separately constructed pagetable,
similar to it but with NX set throughout.  After your 2.6.27-rc
changes, level2_ident_pgt is found there already so left untouched -
leaving NX out of that first 1G of the direct map forever (when CPA
splits it up, the smaller pages inherit the lack of NX too).

I noticed when changing /proc/meminfo to show DirectMap in kB,
and tried to fix it, but only gave myself a non-booting system
(I probably shrank my direct map to 0 while implicitly using it).
And at that stage I didn't realize at all that it was a recent
regression arising from your mods.

I'm reluctant to delve in there again at present, and unsure
what the right fix should be: perhaps the code which checks if
an entry is already there, should check if it has the desired
flags?

Hugh
--

From: Jeremy Fitzhardinge
Date: Monday, September 8, 2008 - 10:16 am

Well, if we never want the direct map to be non-executable (which I
think would be OK, since all the code is either core kernel or modules
which are mapped elsewhere), then we can set NX on the level4 for the

Well, the specific reason I made these changes was to make sure that
there was never more than one entry mapping any kernel page, so that you
can update the page permissions on a kernel page with just one update. 
This is pretty much a requirement for Xen, and very convenient at other
times.  Native will use either L2 or L3 mappings for the kernel and
linear space, and Xen uses L1 mappings, so if we break the aliasing at
the L2 level I can still keep the L1s aliased, but it seems simpler to
set NX on the linear mapping's L4.

    J
--

From: Hugh Dickins
Date: Monday, September 8, 2008 - 12:14 pm

That would be much the neatest answer: I hadn't realized
that inheritance (perhaps I'm still living in early-i386 days,
when IIRC there was a bug in inheriting WP from higher levels).

But then I stumbled across static_protections() in pageattr.c
(takes both addr and pfn, latter seems weird), whose BIOS_BEGIN
and BIOS_END seem to echo the ISA_START_ADDR and ISA_END_ADDR
used by is_ISA_range() in ioremap.c.

And peering at the pagetables I've got here for that area of the
direct map in 2.6.26 x86_64, yes, I'm missing NX from 0xc0000 to
0xfffff (presumably nothing tried to ioremap 0xa0000 to 0xbffff).

A simple answer might be to go the way you suggest, but remove
the special casing from pageattr.c and ioremap.c; but I fear
that will slow something down, or introduce further bugs.

Hugh
--

From: Jeremy Fitzhardinge
Date: Monday, September 8, 2008 - 12:45 pm

[Empty message]
From: H. Peter Anvin
Date: Friday, August 29, 2008 - 10:02 am

The 64K number was empirical, of course.  The bottom 64K is somewhat 
special, however, in that it is what you can address in real mode (as 
opposed to big real mode) with your segments parked at zero, so you end 
up with something approaching a flat real mode.  Especially the first 
32K (below 0x7c00) are frequently used by various BIOS items, but I 
believe the corruption observed was at 0x8000, so it's beyond even this 
first barrier.

Obviously, it's extremely hard to predict where BIOS vendors will have 
choosen to scribble, but the observations in this particular case seemed 
to finger this particular area.

	-hpa
--

From: H. Peter Anvin
Date: Friday, August 29, 2008 - 10:03 am

I should point out that I have seen one particular bug quite a few times 
poking around with boot loaders: the BIOS accesses memory at an 
otherwise valid address, but with the segment base set to either zero or 
0x400 instead of whatever it should have been.

	-hpa
--

Previous thread: [PATCH] x86: split e820 reserved entries record to late by Yinghai Lu on Thursday, August 28, 2008 - 12:39 pm. (6 messages)

Next thread: none