Re: Regression in 2.6.23-pre Was: Problems with 2.6.23-rc6 on AMD Geode LX800

Previous thread: [PATCH] Since we have counters in __u64 format we have to print them with %llu macros. by Maxim Uvarov on Wednesday, September 26, 2007 - 7:39 am. (6 messages)

Next thread: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20 by Ingo Molnar on Wednesday, September 26, 2007 - 4:13 am. (9 messages)
From: Joerg Pommnitz
Date: Wednesday, September 26, 2007 - 3:56 am

Hello all,
this is what git bisect told me about the problem:

jpo@jpo-laptop:~/linux-2.6$ git bisect good
4fd06960f120e02e9abc802a09f9511c400042a5 is first bad commit
commit 4fd06960f120e02e9abc802a09f9511c400042a5
Author: H. Peter Anvin <hpa@zytor.com>
Date:   Wed Jul 11 12:18:56 2007 -0700

    Use the new x86 setup code for i386

    This patch hooks the new x86 setup code into the Makefile machinery.  It
    also adapts boot/tools/build.c to a two-file (as opposed to three-file)
    universe, and simplifies it substantially.

    Signed-off-by: H. Peter Anvin <hpa@zytor.com>

 
--  
Regards
 
       Joerg
 


----- Ursprüngliche Mail ----
Von: Jordan Crouse <jordan.crouse@amd.com>
An: Joerg Pommnitz <pommnitz@yahoo.com>
CC: cebbert@redhat.com; linux-kernel@vger.kernel.org
Gesendet: Dienstag, den 25. September 2007, 17:04:52 Uhr
Betreff: Re: Problems with 2.6.23-rc6 on AMD Geode LX800


You'll have to contact Digital Logic and have them check with the BIOS vendor
to see if the fix was made in that version or not.  I don't have that
particular board, so I can't try out the fixes here.

I'm still trying to track down the particulars of the fix from the BIOS 

Sure - you might as well - just to make sure its the same problem.

Jordan

-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.








      __________________________________  
Yahoo! Clever: Sie haben Fragen? Yahoo! Nutzer antworten Ihnen. www.yahoo.de/clever

-

From: H. Peter Anvin
Date: Wednesday, September 26, 2007 - 7:10 am

There is something very fishy.

The only documentation you've given us so far is a screen shot which
contained a message ("BIOS data check successful") which doesn't occur
in the kernel.

The loader string doesn't look all that familiar either; it looks like
an extremely old version of SYSLINUX, but that doesn't contain that
message either.


However, all that tells us is that reserve_bootmem_core() was either
called with a bad address or bdata->node_low_pfn is garbage.  In
particular, without knowing how it got there it's hard to know for sure.

Could you send me the boot messages from a working kernel boot?

	-hpa

-

From: Jordan Crouse
Date: Wednesday, September 26, 2007 - 8:41 am

/me swings a +5 JTAG debugger

Its the latter - max_pfn as read by find_max_pfn() in arch/i386/e820.c
is being set to 9F (640k) in the broken case, this due to the
the e820 map looking something like this:

Address   Size      Type
00000000  0009FC00  1
0009FC00  00000400  2
000E0000  00002000  2

(Yep, thats it - thats the list.  e820.nr_map is indeed 3). 

Long story short, bdata->node_low_pfn gets set to 9F, and When we 
try to allocate the bootmem bitmap (at _pa_symbol(_text), which is 
page 0x100), then the system gets appropriately angry.

As background, I'm using syslinux 3.36 as my loader here - I've used this
exact same version for a very long time, so I don't blame it in the least.
Something is getting confused in the early kernel, and whatever that
something is, a still unknown change in a newer version of the BIOS
fixed it.  The search goes on.

Jordan
-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.


-

From: H. Peter Anvin
Date: Wednesday, September 26, 2007 - 9:57 am

OK, we should put printf's in arch/i386/boot/memory.c and see what
actually gets read out from the BIOS.  This could either be a BIOS
problem or a bug in memory.c (or a bug elsewhere in the code that the
change in memory.c triggers, but that seems less likely.)

	-hpa

P.S. Are you guys in the Bay Area by any chance?
-

From: H. Peter Anvin
Date: Wednesday, September 26, 2007 - 12:14 pm

Please try the following debug patch to let us know what is going on.

	-hpa
From: Jordan Crouse
Date: Wednesday, September 26, 2007 - 1:58 pm

Okay, we have clarity.   Here is the output

e820: err 0 id 0x534d4150 next 15476 00000000:0009fc00 1
e820: err 0 id 0x534d4150 next 15496 0009fc00:00000400 2
e820: err 0 id 0x534d4150 next 15516 000e0000:00020000 2
e820: err 0 id 0x0e7b0000 next 11536 00100000:0e6b0000 1

In the last entry,  id is obviously wrong (it should be 'SMAP' or
0x534d4150).  This is the BIOS bug.

Here's the reason why this bothers us now.  In the old assembly code,
if the returned ID wasn't equal to 'SMAP', we jumped straight to the e801
code.  In the new code in memory.c, if id != SMAP, we break out of the
int15 loop, and return boot_params.e820_entries, which in our case is
3.  detect_memory() considers this to be successful, and no attempt to
parse e801 is made.

So thats where the problem is - in the old code with the buggy BIOS, we
punted to reading the e801 information, and that was enough to keep us 
going.   In the new code, we allow a partial table to be used, and we
blow up.

Attached is a patch to fix this - it returns -1 on error, and only sets
boot_params.e820_entries to be non-zero if we have something useful
in it.  This punts the detection to the e801 code, which then is
then successful.

This fixes the problem with the DB800, and so it probably should
with the other Geode platforms affected by this.

Many thanks to hpa for the guiding hand.

Jordan

-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.
From: H. Peter Anvin
Date: Wednesday, September 26, 2007 - 2:04 pm

This patch is obviously wrong.  There are a lot of e820 BIOSen out there
that terminate with CF=1, and that is a legitimate termination condition
for e820.  Now, as far as what to do when id != SMAP, it probably is
still the right thing to do; since the BOS vendor couldn't get something
that elementary correct, we shouldn't trust the data.

I'll write up a corrected patch.

	-hpa
-

From: Jordan Crouse
Date: Wednesday, September 26, 2007 - 2:15 pm

Hmm - the old code seems to fail to e801 when CF was set too:

	int     $0x15                           # make the call
	jc      bail820                         # fall to e801 if it fails

	cmpl    $SMAP, %eax                     # check the return is `SMAP'
	jne     bail820                         # fall to e801 if it fails

Thats not to say that the old code was correct, mind you.  Failing on a
bad ID and returning without error on a set CF seems to be good to me.

Jordan

-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.


-

From: H. Peter Anvin
Date: Wednesday, September 26, 2007 - 2:20 pm

Testing this patch now:

From: Jordan Crouse
Date: Wednesday, September 26, 2007 - 2:30 pm

Works here with the buggy BIOS.  

Acked-by: Jordan Crouse <jordan.crouse@amd.com>

Thanks.

-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.


-

Previous thread: [PATCH] Since we have counters in __u64 format we have to print them with %llu macros. by Maxim Uvarov on Wednesday, September 26, 2007 - 7:39 am. (6 messages)

Next thread: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20 by Ingo Molnar on Wednesday, September 26, 2007 - 4:13 am. (9 messages)