Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
Regards,
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
arch/x86/kernel/crash.c | 46 ++++++++++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 3 +++
init/main.c | 6 ++++++
kernel/kexec.c | 8 ++++++++
4 files changed, 55 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
}
#endif
crash_save_cpu(regs, cpu);
- disable_local_APIC();
- atomic_dec(&waiting_for_crash_ipi);
- /* Assume hlt works */
- halt();
- for (;;)
- cpu_relax();
-
+ if (smp_processor_id() == kexec_boot_cpu) {
+ /*
+ * This is the boot cpu. We need to:
+ * 1) Wait for the other processors to halt
+ * 2) clear our nmi interrupt
+ * 3) launch the new kernel
+ */
+ unsigned long msecs = 1000;
+ while ((atomic_read(&waiting_for_crash_ipi) > ...If you can get to calibrate_delay hypertransport is still routing traffic. Your diagnosis of the problem is wrong. Most likely it is just an ioapic programming error in restoring the system to PIC mode. I agree that there is a problem. The reliable fix is to totally skip the PIC interrupt mode and go directly to apic mode. To make the code kexec on panic code path reliable we need to remove code not add it. Frankly I think switching cpus is one of the least reliable things that we can do in general. Eric -
What makes you say this? I don't see any need for interrupts prior to I understand the sentiment here, but its not like we're adding additional functionality with this patch. We're already sending an IPI to all the processors to halt them, we're just adding logic here so that we can detect the boot cpu and use it to jump to the kexec image instead of halting. I don't think this is any less reliable that what we have currently. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Yes. calibrate_delay() is the first place we send interrupts over hypertransport. However I/O still works. Thus hypertransport from the first cpu is working, and hypertransport itself is working. This is an interrupt specific problem not some generic hypertransport And we don't care if they halt. If they don't get the IPI we timeout. Making the IPI mandatory is a _singificant_ change. The only reason that code is on the kexec on panic code path is that It doesn't make things more reliable, and it adds code to a code path that already has to much code to be solid reliable (thus your problem). Putting the system back in PIC legacy mode on the kexec on panic path was supposed to be a short term hack until we could remove the need by always deliver interrupts in apic mode. If you can't root cause your problem and figure out how the apics are misconfigured for legacy mode let's remove the need for going into to legacy PIC mode and do what we should be able to do reliably. The reward is much higher, as we kill all possibility of restoring PIC mode wrong because we don't need to bother. Eric -
Probably legacy mode always routes to CPU #0. Makes sense and is not really a misconfiguration of legacy mode. But if CPU #0 has interrupts disabled no interrupts get delivered. So choices are: - Move to CPU #0 - Do not use legacy mode during shutdown. - Or do not rely on interrupts after enabling legacy mode - Or do not disable interrupts on the other CPUs when they're halted. First and last option are probably unreliable for the kdump case. Second or third sound best. I suspect the real fix would be to enable IOAPIC mode really early and never use the timers in legacy mode. Then the kdump kernel wouldn't care about the legacy mode pointing to the wrong CPU. IIrc Eric even had a patch for that a long time ago, but it broke some things so it wasn't included. But perhaps it should be revisited. -Andi -
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
(Do not use legacy mode in the kdump kernel. removing it from shutdown
Exactly. If we can work out the details that should be a much more reliable
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.
My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.
I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.
Eric
-The BIOS and kernel writer's guide for Opteron explicitly states that the platform will boot on CPU0 kind of by definition. So this seems like a fair statement. I can easily see BIOS writers or hardware designers interpreting that to mean that they only have to make sure that interrupts get to the CPU that the BIOS thinks of as CPU0 when the APIC is in legacy mode. I have a query out to some SuperMicro engineers to find out if it is a hardware limitation or if it is APIC misconfiguration. Maybe we can I can agree with the fourth option being a very bad one but I really haven't seen anything in this discussion which supports the assertion that "Move to the CPU that the BIOS originally called CPU#0" is going to be unreliable. Admittedly we haven't tried this on every single x86_64 platform that we have but on the handful that we have tried so far, it hasn't been a problem. Why is everybody jumping to the assumption that -- -ben -=- -
Ben I tend to agree. I think re-enabling the APIC early in the boot process provides a greater degree of reliability in that it more quickly restores the system to a state where booting on a cpu other than cpu0 will be more likely to work, but I have to say that overall it seems like booting a secondary kernel on cpu0, when possible offers the highest degree of reliability. Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp functionality early in the boot process is a good solution for the problems which we are hypothesizing here, and would be a good thing to do in general, but it doesn't preclude also attmpting to switch back to cpu0 during a crash. Perhaps it would be worthwhile to: 1) Investigate the early enablement of the ioapic for x86[_64] 2) implement my prevoiusly proposed patch with the addition of a handshake element, such that: a) when the boot cpu gets the ipi from machine_crash_shutdown it flags the fact that it is going to boot the kexec kernel with a global variable b) the crashing cpu loops waiting for either: I) a timeout of 1 second II) a reduction of the halt count to zero III) the setting of the flag mentioned in (a) c) the crashing cpu, if it sees that it is not the boot cpu AND that the flag in (III) is set, will halt itself, otherwise it will set the flag and boot the kexec image itself. With this modification, we can try to relocate to cpu0, and if we fail, we fall back to booting on the crashing processor. I'll work up a patch that implements (2), unless there are strong objections. I see no reason why we can't implment this 'both' solution. Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
On Tue, Nov 27, 2007 at 02:42:20PM -0500, Neil Horman wrote: Hi Neil, If we implement first solution, we don't have to implement second. Problem will automatically be solved. In general adding more code in crash shutdown path is not good. We are trying to make that code path as small as possible. OTOH, I think we have not root caused this problem yet. We don't know yet why interrupts are not coming to non-boot cpu. I think we can go little deeper to compare the system state in normal boot and kdump boot and see what has changed. System state would include, LAPIC and IOAPIC entries etc. Are we putting the system back in PIC mode or virtual wire mode? I have not seen systems which support PIC mode. All latest systems seems to be having virtual wire mode. I think in case of PIC mode, interrupts can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC to deliver interrupt to any of the cpus and that's what we have been relying on until and unless there is something board specific. Thanks Vivek -
Yes it's probably virtual wire. For real PIC mode we would need really The code doesn't try to program anything specific, it just restores the state that was left over originally by the BIOS. -Andi -
So if the BIOS originally left the IOAPIC in a state where the timer interrupts were only going to CPU0 then by restoring that state we could be bringing this problem upon ourselves when we restore that state. Would anyone have any problems with code that simply verified that the state which we are restoring allowed interrupts to get to the processor that we are currently crashing on and if not, poked in a reasonable value. Yes this would add some complexity to the code paths where we were crashing but it could prevent the problem that we are seeing. It seems like a small fairly safe change rather than a big disruptive change like -- -ben -=- -
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.
Thanks
Vivek
-Yes, but according to Bens last debug effort, the APIC printout regarding the timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not routed through the ioapic. In those cases, disable_IO_APIC does not take us through the path you reference above, and does not revert to virtual wire mode. Instead, it simply disables legacy vector 0, which if I understand this correctly, simply tells the ioapic to not handle timer interrupts, trusting that the 8259 in the system will deliver that interrupt where it needs to be. If the 8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem that we have, do you? Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Exactly. It is still interesting to test to see what happens if you plugin the normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0) and see what happens. Having a command line parameter that could do that would be a cheap temporary solution. But this is the most likely reason why the timer interrupt is not working. Eric -
Ok, thank you for the explination, this all makes a good deal more sense to me now. Ben is near the machine, so hopefully we'll hear from him soon with the results of this test. Given that, do you think the cpu-switch test that I proposed would be a good solution now (with the fallback mechanism I described), or would a command line 8259 solution be better? I tend to think the former would be better since it would be transparent to the user, but I'd like to have that debate. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Ok. Got it. So in this case we route the interrupts directly through LAPIC and put LVT0 in ExtInt mode and IOAPIC is bypassed. I am looking at Intel Multiprocessor specification v1.4 and as per figure 3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is connected to LINTIN0 pin on all processors. If that is the case, even in this mode, all the CPU should see the timer interrupts (which is coming from 8259)? Can you print the LAPIC registers (print_local_APIC) during normal boot and during kdump boot and paste here? Thanks Vivek -
However things are implemented completely differently now. I don't think the coherent hypertransport domain of AMD processors actually routes ExtINT interrupts to all cpus but instead one (the default route?) is picked. So I think for the kdump case we pretty much need to use an IOAPIC in virtual wire mode for recent AMD systems. It's worth a look. I still think we need to just use apic mode at kernel startup, and be done with it. Eric -
Neil whipped up a patch to try this and evidently it worked on his test boxes but it didn't work very well on our problem tests box. It hung after the kernel printed "Ready". i.e. on a normal boot I get: <snip> 2007-11-29 13:48:29 Loading vmlinuz-2.6.18-13chaos.ben.test................................ 2007-11-29 13:48:29 Loading initrd-2.6.18-13chaos.ben.test......................................................... .............................................................................. 2007-11-29 13:48:29 Ready. 2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test (ben@wopri) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14 )) #10 SMP Thu Nov 29 13:11:49 PST 2007 2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test loglevel=8 console=ttyS0,115200n8 crashkernel=128M@16M elevator=deadline swiotlb=65536 selinux=0 apic=debug BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF= 01-00-30-48-57-91-56 With Neil's patch: 2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H. Peter Anvin 2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]: 2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371 2007-11-29 17:12:55 linux 2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371 2007-11-29 17:12:55 linux-2.6.18-52.el5 2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test 2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8 2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5 2007-11-29 17:12:55 linux-2.6.18-7chaos 2007-11-29 17:12:55 boot: 2007-11-29 17:13:02 Loading vmlinuz-2.6.18-13chaos.ben.test................................ 2007-11-29 17:13:02 Loading initrd-2.6.18-13chaos.ben.test......................................................... .............................................................................. 2007-11-29 17:13:02 Ready. (END) That's all she wrote. End of story. Had to reboot to another kernel to make get it back. Neil's patch: --- linux-2.6.18.noarch/arch/x86_...
Interesting can you please try an early_printk console. I expect you made it a fair ways and it just didn't show up because you didn't get as far as the normal serial port setup. You don't have any output from your linux kernel. Eric -
there is two mode for mcp55. bios should have one option about virtul wired to LVT0 of BSP or IOAPIC pin 0. or the option like hpet route to ioapic pin 2. for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP? ben, can you send out lspci -vvxxx -s 00:1.0 YH -
That's interesting. So with BIOS options I can force timer interrupts to be routed through IOAPIC? That would enable us to get timer interrupts on any of the cpus in legacy mode. Ben, can you give it We would not know the crashing cpu in advance hence can't set it. Thanks Vivek -
I've got a system here that now seems to be behaving in a way that is simmilar
to what Ben describes (although I'm not sure its the same problem).
early_printk shows that we're panic-ing inside check_timer, because we fail to
find any way to route the timer interrupt to the cpu. Specifically, we're
hitting this panic:
panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel
parameter\n");
This doesn't make much sense to me, as we clearly have managed to get timer
interrupts at this point (since we made it through calibrate_delay)....
Looking at it, I wonder if this isn't a backporting issue. Ben and I ran this
test on an older kernel (since thats what the production system under test is
based on). Currently, check_timer is called directly from within setup_IO_APIC,
but in the 2.6.18 kernel that RHEL5 is based on its part of an initcall thats
run from within init near the call site of the origional io apic init code. I
wonder if this isn't just an 'old kernel' issue, and that I need to move
check_timer to inside setup_IO_APIC.
Ben, is it possible for you to run an upstream kernel on one of these systems so
we can see if the patch works in that case? In the interim, I'll update my
patch for 2.6.18 so that check_timer gets moved earlier
-http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21-06.pdf Table 143 suggest to me that legacy interrupts should be routed to all cpus, which certainly doesn't seem to be the case in this situation. Perhaps Nvidia Certainly, this seems like the best solution long term. So I'm looking at the implementation for 64 bit system, and it seems a little cleaner than 32 bit setup. I'm wondering if we can just call setup_IO_APIC immediately after init_IRQ in start_kernel? Could it be that straightforward? -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Once it hits the coherent hypertransport fabric only AMD has control.
The packet does not have a destination field that could specify which
cpu. So Nvidia could not have goofed. It is a magic property of
Pretty much.
The essence of what I did last round was put at the end of
init_IRQ():
if (!disable_apic && cpu_has_apic) {
/*
* Switch to APIC mode.
*/
setup_local_APIC();
/*
* Now start the IO-APICs
*/
if (smp_found_config && !skip_ioapic_setup && nr_ioapics)
setup_IO_APIC();
else
nr_ioapics = 0;
}
Beyond that it is just dotting i's and crossing t's.
Eric
-Here are the ones from a normal bootup. I was unable to get info from a kdump boot. I haven't figured out why yet. With the same patch that I used to capture this, when I tried to kdump the kernel, it paused a second or two after the backtrace and then dropped to BIOS and came up normally. Here is a little trick, at the point where we are trying to get the info to print out, the kernel command line hasn't been completely parsed yet. That tricked me for part of the day. I had apic=debug on the command line but the logic in print_local_APIC saw the default value because the kernel command line had yet to be parsed. 2007-11-29 17:58:07 ***Here is the info you requested 2007-11-29 17:58:07 2007-11-29 17:58:07 printing local APIC contents on CPU#0/0: 2007-11-29 17:58:07 ... APIC ID: 00000000 (0) 2007-11-29 17:58:07 ... APIC VERSION: 80050010 2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00) 2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00) 2007-11-29 17:58:07 ... APIC PROCPRI: 00000000 2007-11-29 17:58:07 ... APIC EOI: 00000000 2007-11-29 17:58:07 ... APIC RRR: 00000002 2007-11-29 17:58:07 ... APIC LDR: 00000000 2007-11-29 17:58:07 ... APIC DFR: ffffffff 2007-11-29 17:58:07 ... APIC SPIV: 0000010f 2007-11-29 17:58:07 ... APIC ISR field: 2007-11-29 17:58:07 ... APIC TMR field: 2007-11-29 17:58:07 ... APIC IRR field: 2007-11-29 17:58:07 ... APIC ESR: 00000000 2007-11-29 17:58:07 ... APIC ICR: 00004630 2007-11-29 17:58:07 ... APIC ICR2: 07000000 2007-11-29 17:58:07 ... APIC LVTT: 00010000 2007-11-29 17:58:07 ... APIC LVTPC: 00010000 2007-11-29 17:58:07 ... APIC LVT0: 00000700 2007-11-29 17:58:07 ... APIC LVT1: 00000400 2007-11-29 17:58:07 ... APIC LVTERR: 0001000f 2007-11-29 17:58:07 ... APIC TMICT: 80000000 2007-11-29 17:58:07 ... APIC TMCCT: 00000000 2007-11-29 17:58:07 ... APIC TDCR: 00000000 2007-11-29 17:58:07 2007-11-29 17:58:07 number of MP IRQ sources: 15. 2007-11-29 17:58:07 number of IO-APIC #8 registers: 0. 2007-11-29 17:58:07 number of IO-APIC #9 ...
Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin LINT0 as ExtInt. We do the same thing for kdump cpu local apic too. I am not sure who is the guy in system who encodes the interrupt message from 8259 to be put on hypertransport and on what basis do they decide whether it should go to only cpu0 or a broadcast one. Looks like in this case it is going to cpu0 only. That means we are left with no choice but to work on patch to initialize IOAPIC early. Thanks Vivek -
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment (since thats whats on the production system thats failing), and will forward port it once its working And not to split hairs, but techically thats not our _only_ choice. We could force kdump boots on cpu0 as well ;) Thanks -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Sorry to have been quiet on this issue for a few days. Interesting news to report, though. So I was working on a patch to do early apic enabling on x86_64, and had something working for the old 2.6.18 kernel that we were origionally testing on. Unfortunately while it worked on 2.6.18 it failed miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the timer interrupt wasn't getting received (even though we could successfully run calibrate_delay). Vivek and I were digging into this, when I ran accross the description of the hypertransport configuration register in the opteron specification. It contains a bit that, suprise, configures the ht bus to either unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast it to all cpus. Since it seemed more likely that the 8259 in the nvidia southbridge was transporting legacy mode interrupts over the ht bus than directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk for nvidia chipsets, which scanned for hypertransport controllers, and ensured that that broadcast bit was set. Test results indicate that this solves the problem, and kdump kernels boot just fine on the affected system. I understand that enabling the apic early in the boot process is a nice general solution, but given the wide range of apic configurations possible, and the need for large amounts of testing of such a change, this approach seems like it might be the better way to go, as it corrects a bad behavior only in systems that would be affected by this problem. It introduces no further complexity in the kdump shutdown path, and creates no additional instability in systems that would otherwise be unaffected by this bug. I think this is the best way for us to go forward. Attached patch applies cleanly against 2.6.24-rc3-mm2 and works for me on serveral systems unaffected by the kdump crash problem I origionally reported and fixes the bug on the affected system. Thanks & Regards Neil Sig...
Hi Neil, Should we disable this broadcasting feature once we are through? Otherwise in normal systems it might mean extra traffic on hypertransport. There is no need for every interrupt to be broadcasted in normal systems? Thanks Vivek --
My feel is that if it is for legacy interrupts only it should not be a problem. Let's investigate and see if we can unconditionally enable this quirk for all opteron systems. Eric --
i checked that bit http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht... static void enable_apic_ext_id(u8 node) { #if ENABLE_APIC_EXT_ID==1 #warning "FIXME Is the right place to enable apic ext id here?" u32 val; val = pci_read_config32(NODE_HT(node), 0x68); val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST); pci_write_config32(NODE_HT(node), 0x68, val); #endif } that bit only be should be set when apic id is lifted and cpu apid is using 8 bits and that mean broadcast is 0xff instead 0x0f. for example 8 socket dual core system or 4 socket quad core system,that you should make BSP start from 0x04, so cpus apic id will be [0x04, 0x13) So if you want to enable that in early_quirk, you need to make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set. most BIOS already did that. You may ask Supermicro fix their broken BIOS instead. YH --
it should be bit 18 (HTTC_APIC_EXT_ID) YH --
this seems reasonable, I can reroll the patch for this. As I think about it I'm also going to update the patch to make this check occur for any pci class 0600 device from vendor AMD, since its possible that more than just nvidia chipsets can be affected. I'll repost as soon as I've tested, thanks! Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
Thanks. Neil in your testing please confirm the preconditions for setting the Apic Extended Broadcast flag (bit 17) are present. If that is the case it makes sense to always set that bit on conforming systems but we will also want to print a message noting that the BIOS has a bug, and we are working around it. Thanks, Eric --
The systems that I have here do _not_ in fact have that precondition, but the systems from Ben, who originoally reported the problem do have that I've got two printk's in this patch, one that indicates that Extended APIC ID's are in use, and a second that indicates that there is a mismatch between the use of extended APIC ids (bit 18) and the lack of an extended APIC id dest mask for interrupt packets (bit 17). Not sure if that meets you're requirements, but I think its sufficient. If you disagree, let me know and we can enhance them. Thanks Neil --
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport e...Ok. This test is broken. Please remove the == 1. You are looking
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.
Please extend struct chipset to be something like:
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};
And then the test for matching the chipset can be something like:
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))
Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.
The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.
Eric
--New patch attached, with suggestions incorporated.
Thanks & regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+...Neil Horman <nhorman@tuxdriver.com> writes: We should not need check_hypertransport_config as the generic loop Likewise this function is unneeded and the printk is likely confusing We don't need to shift device. Although we can do: device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID); device = device_vendor >> 16; vendor = device_vendor & 0xffff; --
cool! :)
Copy that. Fixed
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.
Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 76 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static int __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
gart_iommu_aperture_disabled = 1;
}
#endif
+ return 1;
}
#ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hy...Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
like:
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
Hmm. I don't think we want this code positioned in the middle of the
vendor = read_pci_config_16(num, slot, func,
device = read_pci_config_16(num, slot, func,
--The former seems like a reasonable solution to me. Corrected in this updated
I like the latter approach better. It seems less convoluted to me.
New patch attached.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static void __init via_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifde...Ok. My only remaining nit to pick is that fix_hypertransport_config is right in the middle of the nvidia quirks, which can be a bit confusing when reading through the code. Otherwise I think this is a version that we can merge. Let's get a clean description on this thing and send it to the --
Clean Summary: Recently a kdump bug was discovered in which a system would hang inside calibrate_delay during the booting of the kdump kernel. This was caused by the fact that the jiffies counter was not being incremented during timer calibration. The root cause of this problem was found to be a bios misconfiguration of the hypertransport bus. On system affected by this hang, the bios had assigned APIC ids which used extended apic bits (more than the nominal 4 bit ids's), but failed to configure bit 18 of the hypertransport transaction config register, which indicated that the mask for the destination field of interrupt packets accross the ht bus (see section 3.3.9 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF). If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will not recieve interrupts during the kdump kernel boot, and this hang will be the result. The fix is to add this patch, whcih add an early pci quirk check, to forcibly enable this bit in the httcfg register. This enables all cpus on a system to receive interrupts, and allows kdump kernel bootup to procede Signed-off-by: Neil Horman <nhorman@tuxdriver.com> early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 69 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 88bb83e..c0d0c69 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -21,8 +21,36 @@ #include <asm/gart.h> #endif -static void __init via_bugs(void) +static void __init fix_hypertransport_config(int num, int slot, int func) { + u32 htcfg; + /* + *we found a hypertransport bus + *make sure that are broadcasting + *interrupts to all cpus on the ht bus + *if we're using extended apic ids + */ + htcfg = read_pci_config(num, slot, func, 0x68); + if (htcfg & (1 << 18)) { + printk(KERN_INFO "Detect...
should be bit 17. YH --
Recently a kdump bug was discovered in which a system would hang inside calibrate_delay during the booting of the kdump kernel. This was caused by the fact that the jiffies counter was not being incremented during timer calibration. The root cause of this problem was found to be a bios misconfiguration of the hypertransport bus. On system affected by this hang, the bios had assigned APIC ids which used extended apic bits (more than the nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport transaction config register, which indicated that the mask for the destination field of interrupt packets accross the ht bus (see section 3.3.9 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF). If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will not recieve interrupts during the kdump kernel boot, and this hang will be the result. The fix is to add this patch, whcih add an early pci quirk check, to forcibly enable this bit in the httcfg register. This enables all cpus on a system to receive interrupts, and allows kdump kernel bootup to procede normally. Regards Neil Signed-off-by: Neil Horman <nhorman@tuxdriver.com> early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 69 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 88bb83e..c0d0c69 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -21,8 +21,36 @@ #include <asm/gart.h> #endif -static void __init via_bugs(void) +static void __init fix_hypertransport_config(int num, int slot, int func) { + u32 htcfg; + /* + *we found a hypertransport bus + *make sure that are broadcasting + *interrupts to all cpus on the ht bus + *if we're using extended apic ids + */ + htcfg = read_pci_config(num, slot, func, 0x68); + if (htcfg & (1 << 18)) { + printk(KERN_INFO "Detected use o...
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling This looks like the wrong place to do this. Better add a flag or something in the structure. Dito others. Also while not a problem here in general it's bad style to add potential wrapping bugs like this. Never use ++ for flags. -Andi --
I'm not sure that would be fine. In the situation you describe, not setting this bit means the second core won't receive interrupts. If we crash on that core and boot the kdump kernel with it, we get exactly the same problem that we I suppose I can, but I'm not sure what benefit that provides. Can you I can fix that up. I'll hold off though until ben redoes all his testing. He mentioned earlier this morning, that some of the results he was getting may have been caused by a kexec utility bug. He's re-confirming that this patch solves the reported problem. Once he does, I'll repost. Thanks & Regards -- /*************************************************** *Neil Horman *nhorman@tuxdriver.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
It could enable the extended APIC IDs but not use them? The code would be smaller and cleaner. -Andi --
In which case complaining is still correct (the BIOS was out of sync), This has everything to do with how AMD coherent hypertransport works and little if anything to do with how the NV bridge operated. Basically the NV bridge seems to be sending a standard hypertransport x86 legacy interrupt packet (that doesn't have any target information) and when that packet hits the coherent hypertransport domain it isn't being converted into whatever would send it to all cpus. ..... The real practical problem is if somehow the BIOS goofs up this way and it then decides to ask us to boot on one of these cpus with an extended apic id. We will hang in calibrate_delay. So far this only seems to happen in the kdump case but in theory the BIOS could be completely crazy. Eric --
I think this just leaves us with deciding on a mechanism for how to do single-application quirks. I take Andi's point that adding a flag set to the quirk data structure is a fine solution, but I'm really ok with static integers in individual functions. Do we have consensus on how to handle that? I'm happy either way, but I'd rather have agreement on how to handle it before I post another iteration of this patch. Thanks & Regards -- /*************************************************** *Neil Horman *nhorman@tuxdriver.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
As long as the solution is simple, small and concise I don't care. And since what will make Andi happy seems to meet those criteria, that should be fine. Eric --
Ok, new patch attached, taking into account Andi's request for a cleaner method to implement single application quirks. I've spoken with Ben, who is continuing to retest, and reports that clean methodical testing results in success with this patch. Summary: Recently a kdump bug was discovered in which a system would hang inside calibrate_delay during the booting of the kdump kernel. This was caused by the fact that the jiffies counter was not being incremented during timer calibration. The root cause of this problem was found to be a bios misconfiguration of the hypertransport bus. On system affected by this hang, the bios had assigned APIC ids which used extended apic bits (more than the nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport transaction config register, which indicated that the mask for the destination field of interrupt packets accross the ht bus (see section 3.3.9 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF). If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will not recieve interrupts during the kdump kernel boot, and this hang will be the result. The fix is to add this patch, whcih add an early pci quirk check, to forcibly enable this bit in the httcfg register. This enables all cpus on a system to receive interrupts, and allows kdump kernel bootup to procede normally. Regards Neil Signed-off-by: Neil Horman <nhorman@tuxdriver.com> early-quirks.c | 86 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 65 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 88bb83e..f4ed3d1 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -21,8 +21,31 @@ #include <asm/gart.h> #endif -static void __init via_bugs(void) +static void __init fix_hypertransport_config(int num, int slot, int func) { + u32 htcfg; + /* + *we fou...
Sorry for not noticing that earlier, but was there a specific reason this needs to be an early quirk at all? kexec can only happen after the standard quirks ran. I think it should be fine as a standard "late" quirk. -Andi --
