login
Header Space

 
 

Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

Previous thread: Re: Small System Paging Problem - OOM-killer goes nuts by Ian Kumlien on Monday, November 26, 2007 - 9:47 pm. (1 message)

Next thread: [PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [0/10] introduction by KAMEZAWA Hiroyuki on Monday, November 26, 2007 - 10:55 pm. (20 messages)
To: <kexec@...>
Cc: <ak@...>, <vgoyal@...>, <hbabu@...>, <linux-kernel@...>, <ebiederm@...>, <nhorman@...>
Date: Monday, November 26, 2007 - 9:47 pm

Hey all-
	I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges.  It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown.  This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor.  Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered.  The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay.  I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown.  This patch accomplished that.  Tested by myself and
the origional reporter with successful results.

Regards,
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 arch/x86/kernel/crash.c |   46 ++++++++++++++++++++++++++++++++++++++--------
 include/linux/kexec.h   |    3 +++
 init/main.c             |    6 ++++++
 kernel/kexec.c          |    8 ++++++++
 4 files changed, 55 insertions(+), 8 deletions(-)


diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
 	}
 #endif
 	crash_save_cpu(regs, cpu);
-	disable_local_APIC();
-	atomic_dec(&amp;waiting_for_crash_ipi);
-	/* Assume hlt works */
-	halt();
-	for (;;)
-		cpu_relax();
-
+	if (smp_processor_id() == kexec_boot_cpu) {
+		/*
+		 * This is the boot cpu.  We need to:
+		 * 1) Wait for the other processors to halt
+		 * 2) clear our nmi interrupt
+		 * 3) launch the new kernel
+		 */
+		unsigned long msecs = 1000;
+		while ((atomic_read(&amp;waiting_for_crash_ipi) &gt; ...
To: Neil Horman <nhorman@...>
Cc: <kexec@...>, <ak@...>, <vgoyal@...>, <hbabu@...>, <linux-kernel@...>
Date: Tuesday, November 27, 2007 - 12:12 am

If you can get to calibrate_delay hypertransport is still routing traffic.
Your diagnosis of the problem is wrong.  Most likely it is just an ioapic
programming error in restoring the system to PIC mode.

I agree that there is a problem.

The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.

To make the code kexec on panic code path reliable we need to remove code
not add it.

Frankly I think switching cpus is one of the least reliable things that
we can do in general.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, <hbabu@...>, <vgoyal@...>, <kexec@...>, <ak@...>, <linux-kernel@...>
Date: Tuesday, November 27, 2007 - 9:13 am

What makes you say this?  I don't see any need for interrupts prior to
I understand the sentiment here, but its not like we're adding additional
functionality with this patch.  We're already sending an IPI to all the
processors to halt them, we're just adding logic here so that we can detect the
boot cpu and use it to jump to the kexec image instead of halting.  I don't
think this is any less reliable that what we have currently.

Regards

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Neil Horman <nhorman@...>, <hbabu@...>, <vgoyal@...>, <kexec@...>, <ak@...>, <linux-kernel@...>
Date: Tuesday, November 27, 2007 - 9:28 am

Yes.  calibrate_delay() is the first place we send interrupts over
hypertransport.  However I/O still works.  Thus hypertransport from
the first cpu is working, and hypertransport itself is working.

This is an interrupt specific problem not some generic hypertransport

And we don't care if they halt.  If they don't get the IPI we timeout.
Making the IPI mandatory is a _singificant_ change.

The only reason that code is on the kexec on panic code path is that

It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem). 

Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.

If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode let's remove the need for going into
to legacy PIC mode and do what we should be able to do reliably.  The
reward is much higher, as we kill all possibility of restoring PIC
mode wrong because we don't need to bother.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, <hbabu@...>, <vgoyal@...>, <kexec@...>, <linux-kernel@...>
Date: Tuesday, November 27, 2007 - 9:45 am

Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.

But if CPU #0 has interrupts disabled no interrupts get delivered.

So choices are:
- Move to CPU #0
- Do not use legacy mode during shutdown.
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.

First and last option are probably unreliable for the kdump case.
Second or third sound best. 

I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.

IIrc Eric even had  a patch for that a long time ago, but it broke some 
things so it wasn't included. But perhaps it should be revisited.

-Andi

-
To: Andi Kleen <ak@...>
Cc: Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, <hbabu@...>, <vgoyal@...>, <kexec@...>, <linux-kernel@...>
Date: Tuesday, November 27, 2007 - 10:56 am

Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode.  But I would not be truly
surprised if it happened that there was hardware that only worked that
    (Do not use legacy mode in the kdump kernel. removing it from shutdown

Exactly.  If we can work out the details that should be a much more reliable

My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.

My patch had two parts.  Some cleanups to enable the code to be enabled
early, and the actually early enable.  I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.

I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Andi Kleen <ak@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, <vgoyal@...>
Date: Tuesday, November 27, 2007 - 2:41 pm

The BIOS and kernel writer's guide for Opteron explicitly states that 
the platform will boot on CPU0 kind of by definition. So this seems like 
a fair statement. I can easily see BIOS writers or hardware designers 
interpreting that to mean that they only have to make sure that 
interrupts get to the CPU that the BIOS thinks of as CPU0 when the APIC 
is in legacy mode.

I have a query out to some SuperMicro engineers to find out if it is a 
hardware limitation or if it is APIC misconfiguration. Maybe we can 

I can agree with the fourth option being a very bad one but I really 
haven't seen anything in this discussion which supports the assertion 
that "Move to the CPU that the BIOS originally called CPU#0" is going to 
be unreliable. Admittedly we haven't tried this on every single x86_64 
platform that we have but on the handful that we have tried so far, it 
hasn't been a problem. Why is everybody jumping to the assumption that 


-- 
-ben
-=-
-
To: Ben Woodard <woodard@...>
Cc: Eric W. Biederman <ebiederm@...>, Andi Kleen <ak@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, <vgoyal@...>
Date: Tuesday, November 27, 2007 - 3:42 pm

Ben I tend to agree.  I think re-enabling the APIC early in the boot process
provides a greater degree of reliability in that it more quickly restores the
system to a state where booting on a cpu other than cpu0 will be more likely to
work, but I have to say that overall it seems like booting a secondary kernel on
cpu0, when possible offers the highest degree of reliability.

Perhaps what we need is a 'both solution'.  Re-enabling the apic to full smp
functionality early in the boot process is a good solution for the problems
which we are hypothesizing here, and would be a good thing to do in general, but
it doesn't preclude also attmpting to switch back to cpu0 during a crash.

Perhaps it would be worthwhile to:

1) Investigate the early enablement of the ioapic for x86[_64]
2) implement my prevoiusly proposed patch with the addition of a handshake
element, such that:
	a) when the boot cpu gets the ipi from machine_crash_shutdown it flags
	   the fact that it is going to boot the kexec kernel with a global
	   variable
	b) the crashing cpu loops waiting for either:
		I) a timeout of 1 second
		II) a reduction of the halt count to zero
		III) the setting of the flag mentioned in (a)
	c) the crashing cpu, if it sees that it is not the boot cpu AND
	   that the flag in (III) is set, will halt itself, otherwise it
	   will set the flag and boot the kexec image itself.

With this modification, we can try to relocate to cpu0,  and if we fail, we fall
back to booting on the crashing processor.

I'll work up a patch that implements (2), unless there are strong objections.  I
see no reason why we can't implment this 'both' solution.

Regards
Neil

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, November 27, 2007 - 4:00 pm

On Tue, Nov 27, 2007 at 02:42:20PM -0500, Neil Horman wrote:


Hi Neil,

If we implement first solution, we don't have to implement second. Problem
will automatically be solved.

In general adding more code in crash shutdown path is not good. We are
trying to make that code path as small as possible.

OTOH, I think we have not root caused this problem yet. We don't know yet
why interrupts are not coming to non-boot cpu. I think we can go little
deeper to compare the system state in normal boot and kdump boot and see
what has changed. System state would include, LAPIC and IOAPIC entries
etc.

Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
relying on  until and unless there is something board specific.

Thanks
Vivek
-
To: Vivek Goyal <vgoyal@...>
Cc: Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, November 27, 2007 - 6:24 pm

Yes it's probably virtual wire. For real PIC mode we would need really

The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.

-Andi
-
To: Andi Kleen <andi@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, November 27, 2007 - 7:24 pm

So if the BIOS originally left the IOAPIC in a state where the timer 
interrupts were only going to CPU0 then by restoring that state we could 
be bringing this problem upon ourselves when we restore that state.

Would anyone have any problems with code that simply verified that the 
state which we are restoring allowed interrupts to get to the processor 
that we are currently crashing on and if not, poked in a reasonable value.

Yes this would add some complexity to the code paths where we were 
crashing but it could prevent the problem that we are seeing. It seems 
like a small fairly safe change rather than a big disruptive change like 


-- 
-ben
-=-
-
To: Ben Woodard <woodard@...>
Cc: Andi Kleen <andi@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Wednesday, November 28, 2007 - 11:36 am

Hi Ben,

Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.

                entry.dest.physical.physical_dest =
                                        GET_APIC_ID(apic_read(APIC_ID));

Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.

I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.

Thanks
Vivek
-
To: Vivek Goyal <vgoyal@...>
Cc: Ben Woodard <woodard@...>, Andi Kleen <andi@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Wednesday, November 28, 2007 - 12:02 pm

Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic.  In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be.  If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?

Regards

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Vivek Goyal <vgoyal@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Wednesday, November 28, 2007 - 1:36 pm

Exactly.

It is still interesting to test to see what happens if you plugin the
normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0)
and see what happens.

Having a command line parameter that could do that would be a cheap temporary
solution.

But this is the most likely reason why the timer interrupt is not working.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, Vivek Goyal <vgoyal@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Wednesday, November 28, 2007 - 2:16 pm

Ok, thank you for the explination, this all makes a good deal more sense to me
now.  Ben is near the machine, so hopefully we'll hear from him soon with the
results of this test.

Given that, do you think the cpu-switch test that I proposed would be a good
solution now (with the fallback mechanism I described), or would a command line
8259 solution be better?  I tend to think the former would be better since it
would be transparent to the user, but I'd like to have that debate.

Regards

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Wednesday, November 28, 2007 - 3:05 pm

Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.

I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is 
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?

Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
 
Thanks
Vivek
-
To: Vivek Goyal <vgoyal@...>
Cc: Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Wednesday, November 28, 2007 - 3:42 pm

However things are implemented completely differently now.  I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.

So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.


It's worth a look.

I still think we need to just use apic mode at kernel startup, and
be done with it.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Thursday, November 29, 2007 - 10:16 pm

Neil whipped up a patch to try this and evidently it worked on his test 
boxes but it didn't work very well on our problem tests box. It hung 
after the kernel printed "Ready". i.e. on a normal boot I get:

&lt;snip&gt;
2007-11-29 13:48:29 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 13:48:29 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 13:48:29 Ready.
2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test (ben@wopri) (gcc
version 4.1.2 20070626 (Red Hat 4.1.2-14
)) #10 SMP Thu Nov 29 13:11:49 PST 2007
2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test
loglevel=8 console=ttyS0,115200n8 crashkernel=128M@16M elevator=deadline 
swiotlb=65536 selinux=0 apic=debug 
BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF=
01-00-30-48-57-91-56

With Neil's patch:
2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16  Copyright (C) 1994-2004 H. 
Peter Anvin
2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]:
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux
2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux-2.6.18-52.el5
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test
2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8
2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5
2007-11-29 17:12:55 linux-2.6.18-7chaos
2007-11-29 17:12:55 boot:
2007-11-29 17:13:02 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 17:13:02 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 17:13:02 Ready.
(END)
That's all she wrote. End of story. Had to reboot to another kernel to 
make get it back.

Neil's patch:

--- linux-2.6.18.noarch/arch/x86_...
To: Ben Woodard <woodard@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Thursday, November 29, 2007 - 10:54 pm

Interesting can you please try an early_printk console.


I expect you made it a fair ways and it just didn't show up because you didn't
get as far as the normal serial port setup.

You don't have any output from your linux kernel.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Ben Woodard <woodard@...>, Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Friday, November 30, 2007 - 4:59 am

there is two mode for mcp55. bios should have one option about virtul
wired to LVT0 of BSP or IOAPIC pin 0.
or the option like hpet route to ioapic pin 2.

for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP?

ben,
can you send out
lspci -vvxxx -s 00:1.0

YH
-
To: Yinghai Lu <yhlu.kernel@...>
Cc: Eric W. Biederman <ebiederm@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Friday, November 30, 2007 - 10:35 am

That's interesting. So with BIOS options I can force timer 
interrupts to be routed through IOAPIC? That would enable us to get
timer interrupts on any of the cpus in legacy mode. Ben, can you give it

We would not know the crashing cpu in advance hence can't set it. 

Thanks
Vivek
-
To: Eric W. Biederman <ebiederm@...>
Cc: Ben Woodard <woodard@...>, Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Friday, November 30, 2007 - 10:32 am

I've got a system here that now seems to be behaving in a way that is simmilar
to what Ben describes (although I'm not sure its the same problem).
early_printk shows that we're panic-ing inside check_timer, because we fail to
find any way to route the timer interrupt to the cpu.  Specifically, we're
hitting this panic:

panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel
parameter\n");

This doesn't make much sense to me, as we clearly have managed to get timer
interrupts at this point (since we made it through calibrate_delay)....

Looking at it, I wonder if this isn't a backporting issue.  Ben and I ran this
test on an older kernel (since thats what the production system under test is
based on).  Currently, check_timer is called directly from within setup_IO_APIC,
but in the 2.6.18 kernel that RHEL5 is based on its part of an initcall thats
run from within init near the call site of the origional io apic init code.  I
wonder if this isn't just an 'old kernel' issue, and that I need to move
check_timer to inside setup_IO_APIC.

Ben, is it possible for you to run an upstream kernel on one of these systems so
we can see if the patch works in that case?  In the interim, I'll update my
patch for 2.6.18 so that check_timer gets moved earlier

-
To: Eric W. Biederman <ebiederm@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Wednesday, November 28, 2007 - 5:09 pm

http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21-06.pdf
Table 143 suggest to me that legacy interrupts should be routed to all cpus,
which certainly doesn't seem to be the case in this situation.  Perhaps Nvidia
Certainly, this seems like the best solution long term.

So I'm looking at the implementation for 64 bit system, and it seems a little
cleaner than 32 bit setup.  I'm wondering if we can just call setup_IO_APIC
immediately after init_IRQ in start_kernel?  Could it be that straightforward?


-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Vivek Goyal <vgoyal@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Wednesday, November 28, 2007 - 7:27 pm

Once it hits the coherent hypertransport fabric only AMD has control.
The packet does not have a destination field that could specify which
cpu.  So Nvidia could not have goofed.  It is a magic property of

Pretty much.

The essence of what I did last round was put at the end of
init_IRQ():


if (!disable_apic &amp;&amp; cpu_has_apic) {
	/*
	 * Switch to APIC mode.
	 */
	setup_local_APIC();

	/*
	 * Now start the IO-APICs
	 */
	if (smp_found_config &amp;&amp; !skip_ioapic_setup &amp;&amp; nr_ioapics)
		setup_IO_APIC();
	else
		nr_ioapics = 0;
}

Beyond that it is just dotting i's and crossing t's.

Eric
-
To: Vivek Goyal <vgoyal@...>
Cc: Neil Horman <nhorman@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, November 29, 2007 - 10:12 pm

Here are the ones from a normal bootup.

I was unable to get info from a kdump boot. I haven't figured out why 
yet. With the same patch that I used to capture this, when I tried to 
kdump the kernel, it paused a second or two after the backtrace and then 
dropped to BIOS and came up normally.

Here is a little trick, at the point where we are trying to get the info 
to print out, the kernel command line hasn't been completely parsed yet. 
That tricked me for part of the day. I had apic=debug on the command 
line but the logic in print_local_APIC saw the default value because the 
kernel command line had yet to be parsed.

2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 printing local APIC contents on CPU#0/0:
2007-11-29 17:58:07 ... APIC ID:      00000000 (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 00000000
2007-11-29 17:58:07 ... APIC EOI: 00000000
2007-11-29 17:58:07 ... APIC RRR: 00000002
2007-11-29 17:58:07 ... APIC LDR: 00000000
2007-11-29 17:58:07 ... APIC DFR: ffffffff
2007-11-29 17:58:07 ... APIC SPIV: 0000010f
2007-11-29 17:58:07 ... APIC ISR field:
2007-11-29 17:58:07 ... APIC TMR field:
2007-11-29 17:58:07 ... APIC IRR field:
2007-11-29 17:58:07 ... APIC ESR: 00000000
2007-11-29 17:58:07 ... APIC ICR: 00004630
2007-11-29 17:58:07 ... APIC ICR2: 07000000
2007-11-29 17:58:07 ... APIC LVTT: 00010000
2007-11-29 17:58:07 ... APIC LVTPC: 00010000
2007-11-29 17:58:07 ... APIC LVT0: 00000700
2007-11-29 17:58:07 ... APIC LVT1: 00000400
2007-11-29 17:58:07 ... APIC LVTERR: 0001000f
2007-11-29 17:58:07 ... APIC TMICT: 80000000
2007-11-29 17:58:07 ... APIC TMCCT: 00000000
2007-11-29 17:58:07 ... APIC TDCR: 00000000
2007-11-29 17:58:07
2007-11-29 17:58:07 number of MP IRQ sources: 15.
2007-11-29 17:58:07 number of IO-APIC #8 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #9 ...
To: Ben Woodard <woodard@...>
Cc: Neil Horman <nhorman@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, November 30, 2007 - 10:42 am

Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin
LINT0 as ExtInt. We do the same thing for kdump cpu local apic too.

I am not sure who is the guy in system who encodes the interrupt message
from 8259 to be put on hypertransport and on what basis do they decide
whether it should go to only cpu0 or a broadcast one. Looks like in this
case it is going to cpu0 only. That means we are left with no choice but
to work on patch to initialize IOAPIC early.

Thanks
Vivek
-
To: Vivek Goyal <vgoyal@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, November 30, 2007 - 10:51 am

Thats what I'm doing at the moment.  I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working

And not to split hairs, but techically thats not our _only_ choice.  We could
force kdump boots on cpu0 as well ;)

Thanks

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
-
To: Neil Horman <nhorman@...>
Cc: Vivek Goyal <vgoyal@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, December 6, 2007 - 5:39 pm

Sorry to have been quiet on this issue for a few days. Interesting news to
report, though.  So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on.  Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay).  Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification.  It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus.  Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set.  Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.

I understand that enabling the apic early in the boot process is a nice general
solution, but given the wide range of apic configurations possible, and the
need for large amounts of testing of such a change, this approach seems like it
might be the better way to go, as it corrects a bad behavior only in systems
that would be affected by this problem.  It introduces no further complexity in
the kdump shutdown path, and creates no additional instability in systems that
would otherwise be unaffected by this bug. 

I think this is the best way for us to go forward.  Attached patch applies
cleanly against 2.6.24-rc3-mm2 and works for me on serveral systems unaffected
by the kdump crash problem I origionally reported and fixes the bug on the
affected system.

Thanks &amp; Regards
Neil

Sig...
To: Neil Horman <nhorman@...>
Cc: Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, December 6, 2007 - 6:11 pm

Hi Neil,

Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?

Thanks
Vivek
--
To: Vivek Goyal <vgoyal@...>
Cc: Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Thursday, December 6, 2007 - 8:33 pm

My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.

Eric
--
To: Eric W. Biederman <ebiederm@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Friday, December 7, 2007 - 4:50 am

i checked that bit

http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht...

static void enable_apic_ext_id(u8 node)
{
#if ENABLE_APIC_EXT_ID==1
#warning "FIXME Is the right place to enable apic ext id here?"

      u32 val;

        val = pci_read_config32(NODE_HT(node), 0x68);
        val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST);
        pci_write_config32(NODE_HT(node), 0x68, val);
#endif
}

that bit only be should be set when apic id is lifted and cpu apid is
using 8 bits and that mean broadcast is 0xff instead 0x0f.
for example 8 socket dual core system or 4 socket quad core
system,that you should make BSP start from 0x04, so cpus apic id will
be [0x04, 0x13)


So if you want to enable that in early_quirk, you need to
make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set.

most BIOS already did that. You may ask Supermicro fix their broken
BIOS instead.

YH
--
To: Eric W. Biederman <ebiederm@...>
Cc: Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Friday, December 7, 2007 - 5:22 am

it should be bit 18 (HTTC_APIC_EXT_ID)


YH
--
To: Yinghai Lu <yhlu.kernel@...>
Cc: Eric W. Biederman <ebiederm@...>, Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Friday, December 7, 2007 - 10:21 am

this seems reasonable, I can reroll the patch for this.  As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.

I'll repost as soon as I've tested, thanks!
Neil

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--
To: Neil Horman <nhorman@...>
Cc: Yinghai Lu <yhlu.kernel@...>, Vivek Goyal <vgoyal@...>, Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, Andi Kleen <andi@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>
Date: Friday, December 7, 2007 - 2:36 pm

Thanks.

Neil in your testing please confirm the preconditions for setting 
the Apic Extended Broadcast flag (bit 17) are present.

If that is the case it makes sense to always set that bit on conforming
systems but we will also want to print a message noting that the
BIOS has a bug, and we are working around it.

Thanks,
Eric
--
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>, Vivek Goyal <vgoyal@...>
Date: Friday, December 7, 2007 - 2:48 pm

The systems that I have here do _not_ in fact have that precondition, but the
systems from Ben, who originoally reported the problem do have that
I've got two printk's in this patch, one that indicates that Extended APIC ID's
are in use, and a second that indicates that there is a mismatch between the use
of extended APIC ids (bit 18) and the lack of an extended APIC id dest mask for
interrupt packets (bit 17).  Not sure if that meets you're requirements, but I
think its sufficient.  If you disagree, let me know and we can enhance them.

Thanks
Neil

--
To: Neil Horman <nhorman@...>
Cc: Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>, Eric W. Biederman <ebiederm@...>, Vivek Goyal <vgoyal@...>, <nhorman@...>
Date: Friday, December 7, 2007 - 1:58 pm

Ok, New patch attached.  It preforms the same function as previously described,
but is more restricted in its application.  As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set.  So this patch
now check for that bit to be turned on first.  Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation.  Tested successfully by me.

Thanks &amp; Regards
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 76 insertions(+), 7 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
 #endif /* CONFIG_X86_IO_APIC */
 #endif /* CONFIG_ACPI */
 
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if ((htcfg &amp; (1 &lt;&lt; 18)) == 1) {	
+		printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+		if ((htcfg &amp; (1 &lt;&lt; 17)) == 0) {
+			printk(KERN_INFO "Enabling hypertransport e...
To: Neil Horman <nhorman@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Monday, December 10, 2007 - 9:08 pm

Ok.  This test is broken.  Please remove the == 1.  You are looking

The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.


For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.

Please extend struct chipset to be something like:
struct chipset {
	u16 vendor;
	u16 device;
        u32 class, class_mask;
	void (*f)(void);
};

And then the test for matching the chipset can be something like:
	if ((id-&gt;vendor == PCI_ANY_ID || id-&gt;vendor == dev-&gt;vendor) &amp;&amp;
	    (id-&gt;device == PCI_ANY_ID || id-&gt;device == dev-&gt;device) &amp;&amp;
	    !((id-&gt;class ^ dev-&gt;class) &amp; id-&gt;class_mask))

Essentially a subset of pci_match_one_device from drivers/pci/pci.h

That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.

The extended form should be much more maintainable in the long
run.  Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.

Eric
--
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Monday, December 10, 2007 - 11:43 pm

New patch attached, with suggestions incorporated.

Thanks &amp; regards
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 73 insertions(+), 9 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
 #endif /* CONFIG_X86_IO_APIC */
 #endif /* CONFIG_ACPI */
 
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if (htcfg &amp; (1 &lt;&lt; 18)) {	
+		printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+		if ((htcfg &amp; (1 &lt;&lt; 17)) == 0) {
+			printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+			htcfg |= (1 &lt;&lt; 17);
+			write_pci_config(num, slot, func, 0x68, htcfg);
+		}
+	}
+	
+}
+
+static void __init check_hypertransport_config()
+{
+	int num, slot, func;
+	u32 device, vendor;
+	func = 0;
+	for (num = 0; num &lt; 32; num++) {
+		for (slot = 0; slot &lt; 32; slot++) {
+			vendor = read_pci_config(num,slot,func,
+						PCI_VENDOR_ID); 
+			device = read_pci_config(num,slot,func,
+						PCI_DEVICE_ID);
+			vendor &amp;= 0x0000ffff;
+			device &gt;&gt;= 16;
+			if ((vendor == PCI_VENDOR_ID_AMD) &amp;&amp;
+			    (device == PCI_DEVICE_ID_AMD_K8_NB))
+				fix_hypertransport_config(num,slot,func);
+		}
+	}
+
+	return;
+
+}
+
 static void __init nvidia_bugs(void)
 {
 #ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
 #endif
 }
 
+static void __init amd_host_bugs(void)
+...
To: Neil Horman <nhorman@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Tuesday, December 11, 2007 - 12:48 am

Neil Horman &lt;nhorman@tuxdriver.com&gt; writes:


We should not need check_hypertransport_config as the generic loop

Likewise this function is unneeded and the printk is likely confusing

We don't need to shift device.  Although we can do:
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor &gt;&gt; 16;
vendor = device_vendor &amp; 0xffff;

--
To: Eric W. Biederman <ebiederm@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Tuesday, December 11, 2007 - 10:39 am

cool! :)

Copy that. Fixed

I'm not so sure about this.  In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.


Other than that, new patch attached.  Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.

Thanks &amp; Regards
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   76 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 15 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
 #include &lt;asm/gart.h&gt;
 #endif
 
-static void __init via_bugs(void)
+static int __init via_bugs(int  num, int slot, int func)
 {
 #ifdef CONFIG_GART_IOMMU
 	if ((end_pfn &gt; MAX_DMA32_PFN ||  force_iommu) &amp;&amp;
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
 		gart_iommu_aperture_disabled = 1;
 	}
 #endif
+	return 1;
 }
 
 #ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
 #endif /* CONFIG_X86_IO_APIC */
 #endif /* CONFIG_ACPI */
 
-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if (htcfg &amp; (1 &lt;&lt; 18)) {	
+		printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+		if ((htcfg &amp; (1 &lt;&lt; 17)) == 0) {
+			printk(KERN_INFO "Enabling hy...
To: Neil Horman <nhorman@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Tuesday, December 11, 2007 - 11:29 am

Ok.  I just looked at read_pci_config.  It doesn't do the right thing for
a non-aligned 32bit access.  (Not that I am convinced there is a right
thing we can do).  Please make this read_pci_config_16 instead
and you won't need the shift.

Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.

The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird

You almost got YH's comment.  You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.

I'm actually inclined to remove the return magic and just do something
like:
	static fix_applied;
	if (fix_applied++)
        	return;
In those functions that should be called only once.


Hmm.  I don't think we want this code positioned in the middle of the
				vendor = read_pci_config_16(num, slot, func,

				device = read_pci_config_16(num, slot, func,
--
To: Eric W. Biederman <ebiederm@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>
Date: Tuesday, December 11, 2007 - 2:22 pm

The former seems like a reasonable solution to me.  Corrected in this updated

I like the latter approach better.  It seems less convoluted to me.

New patch attached.

Thanks &amp; Regards
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   90 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 21 deletions(-)


diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
 #include &lt;asm/gart.h&gt;
 #endif
 
-static void __init via_bugs(void)
+static void __init via_bugs(int  num, int slot, int func)
 {
+	static int fix_applied = 0;
+
+	if (fix_applied++)
+		return;
+
 #ifdef CONFIG_GART_IOMMU
 	if ((end_pfn &gt; MAX_DMA32_PFN ||  force_iommu) &amp;&amp;
 	    !gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
 #endif /* CONFIG_X86_IO_APIC */
 #endif /* CONFIG_ACPI */
 
-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
 {
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if (htcfg &amp; (1 &lt;&lt; 18)) {	
+		printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+		if ((htcfg &amp; (1 &lt;&lt; 17)) == 0) {
+			printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+			printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+			htcfg |= (1 &lt;&lt; 17);
+			write_pci_config(num, slot, func, 0x68, htcfg);
+		}
+	}
+
+	
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+	static int fix_applied = 0;
+
+	if (fix_applied++)
+		return;
+
 #ifde...
To: Neil Horman <nhorman@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>
Date: Tuesday, December 11, 2007 - 2:46 pm

Ok. My only remaining nit to pick is that fix_hypertransport_config
is right in the middle of the nvidia quirks, which can be a bit
confusing when reading through the code.  Otherwise I think this
is a version that we can merge.

Let's get a clean description on this thing and send it to the
--
To: Eric W. Biederman <ebiederm@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>
Date: Tuesday, December 11, 2007 - 3:24 pm

Clean Summary:

Recently a kdump bug was discovered in which a system would hang inside 
calibrate_delay during the booting of the kdump kernel.  This was caused by the 
fact that the jiffies counter was not being incremented during timer 
calibration.  The root cause of this problem was found to be a bios 
misconfiguration of the hypertransport bus.  On system affected by this hang, 
the bios had assigned APIC ids which used extended apic bits (more than the 
nominal 4 bit ids's), but failed to configure bit 18 of the hypertransport 
transaction config register, which indicated that the mask for the destination 
field of interrupt packets accross the ht bus (see section 3.3.9 of 
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).  
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will 
not recieve interrupts during the kdump kernel boot, and this hang will be the 
result.  The fix is to add this patch, whcih add an early pci quirk check, to 
forcibly enable this bit in the httcfg register.  This enables all cpus on a 
system to receive interrupts, and allows kdump kernel bootup to procede 

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   90 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 21 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
 #include &lt;asm/gart.h&gt;
 #endif
 
-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
 {
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if (htcfg &amp; (1 &lt;&lt; 18)) {	
+		printk(KERN_INFO "Detect...
To: Neil Horman <nhorman@...>
Cc: Eric W. Biederman <ebiederm@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>
Date: Tuesday, December 11, 2007 - 3:51 pm

should be bit 17.

YH
--
To: Yinghai Lu <yhlu.kernel@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>, Eric W. Biederman <ebiederm@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Tuesday, December 11, 2007 - 4:59 pm

Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel.  This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration.  The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus.  On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result.  The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register.  This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.

Regards
Neil


Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   90 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 21 deletions(-)


diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
 #include &lt;asm/gart.h&gt;
 #endif
 
-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
 {
+	u32 htcfg;
+	/*
+	 *we found a hypertransport bus
+	 *make sure that are broadcasting
+	 *interrupts to all cpus on the ht bus
+	 *if we're using extended apic ids
+	 */
+	htcfg = read_pci_config(num, slot, func, 0x68);
+	if (htcfg &amp; (1 &lt;&lt; 18)) {	
+		printk(KERN_INFO "Detected use o...
To: Neil Horman <nhorman@...>
Cc: Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Andi Kleen <andi@...>, Eric W. Biederman <ebiederm@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 10:21 am

I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling

This looks like the wrong place to do this. Better add a flag or something
in the structure. Dito others.

Also while not a problem here in general it's bad style to add potential
wrapping bugs like this. Never use ++ for flags.

-Andi
--
To: Andi Kleen <ak@...>
Cc: Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Eric W. Biederman <ebiederm@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 11:55 am

I'm not sure that would be fine.  In the situation you describe, not setting
this bit means the second core won't receive interrupts.  If we crash on that
core and boot the kdump kernel with it, we get exactly the same problem that we
I suppose I can, but I'm not sure what benefit that provides.  Can you
I can fix that up.  I'll hold off though until ben redoes all his testing.  He
mentioned earlier this morning, that some of the results he was getting may have
been caused by a kexec utility bug.  He's re-confirming that this patch solves
the reported problem.  Once he does, I'll repost.

Thanks &amp; Regards

-- 
/***************************************************
 *Neil Horman
 *nhorman@tuxdriver.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--
To: Neil Horman <nhorman@...>
Cc: Andi Kleen <ak@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Eric W. Biederman <ebiederm@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 12:07 pm

It could enable the extended APIC IDs but not use them? 



The code would be smaller and cleaner.

-Andi

--
To: Andi Kleen <ak@...>
Cc: Neil Horman <nhorman@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 3:43 pm

In which case complaining is still correct (the BIOS was out of sync),

This has everything to do with how AMD coherent hypertransport works and
little if anything to do with how the NV bridge operated.

Basically the NV bridge seems to be sending a standard hypertransport
x86 legacy interrupt packet (that doesn't have any target information)
and when that packet  hits the coherent hypertransport domain it isn't
being converted into whatever would send it to all cpus. 

.....

The real practical problem is if somehow the BIOS goofs up this way
and it then decides to ask us to boot on one of these cpus with
an extended apic id.  We will hang in calibrate_delay.  So far
this only seems to happen in the kdump case but in theory the BIOS
could be completely crazy.

Eric
--
To: Eric W. Biederman <ebiederm@...>
Cc: Andi Kleen <ak@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 4:22 pm

I think this just leaves us with deciding on a mechanism for how to do
single-application quirks.  I take Andi's point that adding a flag set to the
quirk data structure is a fine solution, but I'm really ok with static integers
in individual functions.  Do we have consensus on how to handle that?  I'm happy
either way, but I'd rather have agreement on how to handle it before I post
another iteration of this patch.  

Thanks &amp; Regards

-- 
/***************************************************
 *Neil Horman
 *nhorman@tuxdriver.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--
To: Neil Horman <nhorman@...>
Cc: Andi Kleen <ak@...>, Yinghai Lu <yhlu.kernel@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, <tglx@...>, <mingo@...>, <hpa@...>
Date: Wednesday, December 12, 2007 - 5:32 pm

As long as the solution is simple, small and concise I don't care.

And since what will make Andi happy seems to meet those criteria,
that should be fine.


Eric
--
To: Eric W. Biederman <ebiederm@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>, <hpa@...>, <tglx@...>, <mingo@...>
Date: Thursday, December 13, 2007 - 10:39 am

Ok, new patch attached, taking into account Andi's request for a cleaner method
to implement single application quirks.  I've spoken with Ben, who is continuing
to retest, and reports that clean methodical testing results in success with
this patch.


Summary:

Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel.  This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration.  The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus.  On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result.  The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register.  This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.


Regards
Neil

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;


 early-quirks.c |   86 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 65 insertions(+), 21 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f4ed3d1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,31 @@
 #include &lt;asm/gart.h&gt;
 #endif
 
-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
 {
+	u32 htcfg;
+	/*
+	 *we fou...
To: Neil Horman <nhorman@...>
Cc: Eric W. Biederman <ebiederm@...>, Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, Andi Kleen <ak@...>, <linux-kernel@...>, <hbabu@...>, Andi Kleen <andi@...>, Yinghai Lu <yhlu.kernel@...>, <hpa@...>, <tglx@...>, <mingo@...>
Date: Thursday, December 13, 2007 - 11:16 am

Sorry for not noticing that earlier, but was there a specific reason this needs
to be an early quirk at all? kexec can only happen after the standard quirks ran.
I think it should be fine as a standard "late" quirk.

-Andi
--
To: Andi Kleen <andi@...>
Cc: Ben Woodard <woodard@...>, Neil Horman <nhorman@...>, <kexec@...>, <linux-kernel@...>, Andi Kleen <ak@...>, <hbabu@...>, Eric W. Biederman <ebiederm@...>, <hpa@...>, <tglx@...>, <mingo@...>, Yinghai Lu <yhlu.kernel@...>
Subject: