Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.Regards,
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
arch/x86/kernel/crash.c | 46 ++++++++++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 3 +++
init/main.c | 6 ++++++
kernel/kexec.c | 8 ++++++++
4 files changed, 55 insertions(+), 8 deletions(-)diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
}
#endif
crash_save_cpu(regs, cpu);
- disable_local_APIC();
- atomic_dec(&waiting_for_crash_ipi);
- /* Assume hlt works */
- halt();
- for (;;)
- cpu_relax();
-
+ if (smp_processor_id() == kexec_boot_cpu) {
+ /*
+ * This is the boot cpu. We need to:
+ * 1) Wait for the other processors to halt
+ * 2) clear our nmi interrupt
+ * 3) launch the new kernel
+ */
+ unsigned long msecs = 1000;
+ while ((atomic_read(&waiting_for_crash_ipi) > ...
If you can get to calibrate_delay hypertransport is still routing traffic.
Your diagnosis of the problem is wrong. Most likely it is just an ioapic
programming error in restoring the system to PIC mode.I agree that there is a problem.
The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.To make the code kexec on panic code path reliable we need to remove code
not add it.Frankly I think switching cpus is one of the least reliable things that
we can do in general.Eric
-
What makes you say this? I don't see any need for interrupts prior to
I understand the sentiment here, but its not like we're adding additional
functionality with this patch. We're already sending an IPI to all the
processors to halt them, we're just adding logic here so that we can detect the
boot cpu and use it to jump to the kexec image instead of halting. I don't
think this is any less reliable that what we have currently.Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Yes. calibrate_delay() is the first place we send interrupts over
hypertransport. However I/O still works. Thus hypertransport from
the first cpu is working, and hypertransport itself is working.This is an interrupt specific problem not some generic hypertransport
And we don't care if they halt. If they don't get the IPI we timeout.
Making the IPI mandatory is a _singificant_ change.The only reason that code is on the kexec on panic code path is that
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode let's remove the need for going into
to legacy PIC mode and do what we should be able to do reliably. The
reward is much higher, as we kill all possibility of restoring PIC
mode wrong because we don't need to bother.Eric
-
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.But if CPU #0 has interrupts disabled no interrupts get delivered.
So choices are:
- Move to CPU #0
- Do not use legacy mode during shutdown.
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.First and last option are probably unreliable for the kdump case.
Second or third sound best.I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.-Andi
-
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
(Do not use legacy mode in the kdump kernel. removing it from shutdownExactly. If we can work out the details that should be a much more reliable
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.Eric
-
The BIOS and kernel writer's guide for Opteron explicitly states that
the platform will boot on CPU0 kind of by definition. So this seems like
a fair statement. I can easily see BIOS writers or hardware designers
interpreting that to mean that they only have to make sure that
interrupts get to the CPU that the BIOS thinks of as CPU0 when the APIC
is in legacy mode.I have a query out to some SuperMicro engineers to find out if it is a
hardware limitation or if it is APIC misconfiguration. Maybe we canI can agree with the fourth option being a very bad one but I really
haven't seen anything in this discussion which supports the assertion
that "Move to the CPU that the BIOS originally called CPU#0" is going to
be unreliable. Admittedly we haven't tried this on every single x86_64
platform that we have but on the handful that we have tried so far, it
hasn't been a problem. Why is everybody jumping to the assumption that--
-ben
-=-
-
Ben I tend to agree. I think re-enabling the APIC early in the boot process
provides a greater degree of reliability in that it more quickly restores the
system to a state where booting on a cpu other than cpu0 will be more likely to
work, but I have to say that overall it seems like booting a secondary kernel on
cpu0, when possible offers the highest degree of reliability.Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp
functionality early in the boot process is a good solution for the problems
which we are hypothesizing here, and would be a good thing to do in general, but
it doesn't preclude also attmpting to switch back to cpu0 during a crash.Perhaps it would be worthwhile to:
1) Investigate the early enablement of the ioapic for x86[_64]
2) implement my prevoiusly proposed patch with the addition of a handshake
element, such that:
a) when the boot cpu gets the ipi from machine_crash_shutdown it flags
the fact that it is going to boot the kexec kernel with a global
variable
b) the crashing cpu loops waiting for either:
I) a timeout of 1 second
II) a reduction of the halt count to zero
III) the setting of the flag mentioned in (a)
c) the crashing cpu, if it sees that it is not the boot cpu AND
that the flag in (III) is set, will halt itself, otherwise it
will set the flag and boot the kexec image itself.With this modification, we can try to relocate to cpu0, and if we fail, we fall
back to booting on the crashing processor.I'll work up a patch that implements (2), unless there are strong objections. I
see no reason why we can't implment this 'both' solution.Regards
Neil--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
On Tue, Nov 27, 2007 at 02:42:20PM -0500, Neil Horman wrote:
Hi Neil,
If we implement first solution, we don't have to implement second. Problem
will automatically be solved.In general adding more code in crash shutdown path is not good. We are
trying to make that code path as small as possible.OTOH, I think we have not root caused this problem yet. We don't know yet
why interrupts are not coming to non-boot cpu. I think we can go little
deeper to compare the system state in normal boot and kdump boot and see
what has changed. System state would include, LAPIC and IOAPIC entries
etc.Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
relying on until and unless there is something board specific.Thanks
Vivek
-
Yes it's probably virtual wire. For real PIC mode we would need really
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.-Andi
-
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could
be bringing this problem upon ourselves when we restore that state.Would anyone have any problems with code that simply verified that the
state which we are restoring allowed interrupts to get to the processor
that we are currently crashing on and if not, poked in a reasonable value.Yes this would add some complexity to the code paths where we were
crashing but it could prevent the problem that we are seeing. It seems
like a small fairly safe change rather than a big disruptive change like--
-ben
-=-
-
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.Thanks
Vivek
-
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Exactly.
It is still interesting to test to see what happens if you plugin the
normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0)
and see what happens.Having a command line parameter that could do that would be a cheap temporary
solution.But this is the most likely reason why the timer interrupt is not working.
Eric
-
Ok, thank you for the explination, this all makes a good deal more sense to me
now. Ben is near the machine, so hopefully we'll hear from him soon with the
results of this test.Given that, do you think the cpu-switch test that I proposed would be a good
solution now (with the fallback mechanism I described), or would a command line
8259 solution be better? I tend to think the former would be better since it
would be transparent to the user, but I'd like to have that debate.Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?Thanks
Vivek
-
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.Eric
-
Neil whipped up a patch to try this and evidently it worked on his test
boxes but it didn't work very well on our problem tests box. It hung
after the kernel printed "Ready". i.e. on a normal boot I get:<snip>
2007-11-29 13:48:29 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 13:48:29 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 13:48:29 Ready.
2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test (ben@wopri) (gcc
version 4.1.2 20070626 (Red Hat 4.1.2-14
)) #10 SMP Thu Nov 29 13:11:49 PST 2007
2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test
loglevel=8 console=ttyS0,115200n8 crashkernel=128M@16M elevator=deadline
swiotlb=65536 selinux=0 apic=debug
BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF=
01-00-30-48-57-91-56With Neil's patch:
2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H.
Peter Anvin
2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]:
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux
2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux-2.6.18-52.el5
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test
2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8
2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5
2007-11-29 17:12:55 linux-2.6.18-7chaos
2007-11-29 17:12:55 boot:
2007-11-29 17:13:02 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 17:13:02 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 17:13:02 Ready.
(END)
That's all she wrote. End of story. Had to reboot to another kernel to
make get it back.Neil's patch:
--- linux-2.6.18.noarch/arch/x86_...
Interesting can you please try an early_printk console.
I expect you made it a fair ways and it just didn't show up because you didn't
get as far as the normal serial port setup.You don't have any output from your linux kernel.
Eric
-
there is two mode for mcp55. bios should have one option about virtul
wired to LVT0 of BSP or IOAPIC pin 0.
or the option like hpet route to ioapic pin 2.for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP?
ben,
can you send out
lspci -vvxxx -s 00:1.0YH
-
That's interesting. So with BIOS options I can force timer
interrupts to be routed through IOAPIC? That would enable us to get
timer interrupts on any of the cpus in legacy mode. Ben, can you give itWe would not know the crashing cpu in advance hence can't set it.
Thanks
Vivek
-
I've got a system here that now seems to be behaving in a way that is simmilar
to what Ben describes (although I'm not sure its the same problem).
early_printk shows that we're panic-ing inside check_timer, because we fail to
find any way to route the timer interrupt to the cpu. Specifically, we're
hitting this panic:panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel
parameter\n");This doesn't make much sense to me, as we clearly have managed to get timer
interrupts at this point (since we made it through calibrate_delay)....Looking at it, I wonder if this isn't a backporting issue. Ben and I ran this
test on an older kernel (since thats what the production system under test is
based on). Currently, check_timer is called directly from within setup_IO_APIC,
but in the 2.6.18 kernel that RHEL5 is based on its part of an initcall thats
run from within init near the call site of the origional io apic init code. I
wonder if this isn't just an 'old kernel' issue, and that I need to move
check_timer to inside setup_IO_APIC.Ben, is it possible for you to run an upstream kernel on one of these systems so
we can see if the patch works in that case? In the interim, I'll update my
patch for 2.6.18 so that check_timer gets moved earlier-
http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21...
Table 143 suggest to me that legacy interrupts should be routed to all cpus,
which certainly doesn't seem to be the case in this situation. Perhaps Nvidia
Certainly, this seems like the best solution long term.So I'm looking at the implementation for 64 bit system, and it seems a little
cleaner than 32 bit setup. I'm wondering if we can just call setup_IO_APIC
immediately after init_IRQ in start_kernel? Could it be that straightforward?--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Once it hits the coherent hypertransport fabric only AMD has control.
The packet does not have a destination field that could specify which
cpu. So Nvidia could not have goofed. It is a magic property ofPretty much.
The essence of what I did last round was put at the end of
init_IRQ():if (!disable_apic && cpu_has_apic) {
/*
* Switch to APIC mode.
*/
setup_local_APIC();/*
* Now start the IO-APICs
*/
if (smp_found_config && !skip_ioapic_setup && nr_ioapics)
setup_IO_APIC();
else
nr_ioapics = 0;
}Beyond that it is just dotting i's and crossing t's.
Eric
-
Here are the ones from a normal bootup.
I was unable to get info from a kdump boot. I haven't figured out why
yet. With the same patch that I used to capture this, when I tried to
kdump the kernel, it paused a second or two after the backtrace and then
dropped to BIOS and came up normally.Here is a little trick, at the point where we are trying to get the info
to print out, the kernel command line hasn't been completely parsed yet.
That tricked me for part of the day. I had apic=debug on the command
line but the logic in print_local_APIC saw the default value because the
kernel command line had yet to be parsed.2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 printing local APIC contents on CPU#0/0:
2007-11-29 17:58:07 ... APIC ID: 00000000 (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 00000000
2007-11-29 17:58:07 ... APIC EOI: 00000000
2007-11-29 17:58:07 ... APIC RRR: 00000002
2007-11-29 17:58:07 ... APIC LDR: 00000000
2007-11-29 17:58:07 ... APIC DFR: ffffffff
2007-11-29 17:58:07 ... APIC SPIV: 0000010f
2007-11-29 17:58:07 ... APIC ISR field:
2007-11-29 17:58:07 ... APIC TMR field:
2007-11-29 17:58:07 ... APIC IRR field:
2007-11-29 17:58:07 ... APIC ESR: 00000000
2007-11-29 17:58:07 ... APIC ICR: 00004630
2007-11-29 17:58:07 ... APIC ICR2: 07000000
2007-11-29 17:58:07 ... APIC LVTT: 00010000
2007-11-29 17:58:07 ... APIC LVTPC: 00010000
2007-11-29 17:58:07 ... APIC LVT0: 00000700
2007-11-29 17:58:07 ... APIC LVT1: 00000400
2007-11-29 17:58:07 ... APIC LVTERR: 0001000f
2007-11-29 17:58:07 ... APIC TMICT: 80000000
2007-11-29 17:58:07 ... APIC TMCCT: 00000000
2007-11-29 17:58:07 ... APIC TDCR: 00000000
2007-11-29 17:58:07
2007-11-29 17:58:07 number of MP IRQ sources: 15.
2007-11-29 17:58:07 number of IO-APIC #8 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #9 ...
Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin
LINT0 as ExtInt. We do the same thing for kdump cpu local apic too.I am not sure who is the guy in system who encodes the interrupt message
from 8259 to be put on hypertransport and on what basis do they decide
whether it should go to only cpu0 or a broadcast one. Looks like in this
case it is going to cpu0 only. That means we are left with no choice but
to work on patch to initialize IOAPIC early.Thanks
Vivek
-
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its workingAnd not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)Thanks
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.I understand that enabling the apic early in the boot process is a nice general
solution, but given the wide range of apic configurations possible, and the
need for large amounts of testing of such a change, this approach seems like it
might be the better way to go, as it corrects a bad behavior only in systems
that would be affected by this problem. It introduces no further complexity in
the kdump shutdown path, and creates no additional instability in systems that
would otherwise be unaffected by this bug.I think this is the best way for us to go forward. Attached patch applies
cleanly against 2.6.24-rc3-mm2 and works for me on serveral systems unaffected
by the kdump crash problem I origionally reported and fixes the bug on the
affected system.Thanks & Regards
NeilSig...
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?Thanks
Vivek
--
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.Eric
--
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amd...
static void enable_apic_ext_id(u8 node)
{
#if ENABLE_APIC_EXT_ID==1
#warning "FIXME Is the right place to enable apic ext id here?"u32 val;
val = pci_read_config32(NODE_HT(node), 0x68);
val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST);
pci_write_config32(NODE_HT(node), 0x68, val);
#endif
}that bit only be should be set when apic id is lifted and cpu apid is
using 8 bits and that mean broadcast is 0xff instead 0x0f.
for example 8 socket dual core system or 4 socket quad core
system,that you should make BSP start from 0x04, so cpus apic id will
be [0x04, 0x13)So if you want to enable that in early_quirk, you need to
make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set.most BIOS already did that. You may ask Supermicro fix their broken
BIOS instead.YH
--
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
--
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.I'll repost as soon as I've tested, thanks!
Neil--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
Thanks.
Neil in your testing please confirm the preconditions for setting
the Apic Extended Broadcast flag (bit 17) are present.If that is the case it makes sense to always set that bit on conforming
systems but we will also want to print a message noting that the
BIOS has a bug, and we are working around it.Thanks,
Eric
--
The systems that I have here do _not_ in fact have that precondition, but the
systems from Ben, who originoally reported the problem do have that
I've got two printk's in this patch, one that indicates that Extended APIC ID's
are in use, and a second that indicates that there is a mismatch between the use
of extended APIC ids (bit 18) and the lack of an extended APIC id dest mask for
interrupt packets (bit 17). Not sure if that meets you're requirements, but I
think its sufficient. If you disagree, let me know and we can enhance them.Thanks
Neil--
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.Thanks & Regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport e...
Ok. This test is broken. Please remove the == 1. You are looking
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.Please extend struct chipset to be something like:
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};And then the test for matching the chipset can be something like:
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.Eric
--
New patch attached, with suggestions incorporated.
Thanks & regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}+static void __init amd_host_bugs(void)
+...
Neil Horman <nhorman@tuxdriver.com> writes:
We should not need check_hypertransport_config as the generic loop
Likewise this function is unneeded and the printk is likely confusing
We don't need to shift device. Although we can do:
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;--
cool! :)
Copy that. Fixed
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.Thanks & Regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 76 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 15 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
#include <asm/gart.h>
#endif-static void __init via_bugs(void)
+static int __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
gart_iommu_aperture_disabled = 1;
}
#endif
+ return 1;
}#ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hy...
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weirdYou almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.I'm actually inclined to remove the return magic and just do something
like:
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.Hmm. I don't think we want this code positioned in the middle of the
vendor = read_pci_config_16(num, slot, func,device = read_pci_config_16(num, slot, func,
--
The former seems like a reasonable solution to me. Corrected in this updated
I like the latter approach better. It seems less convoluted to me.
New patch attached.
Thanks & Regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
#include <asm/gart.h>
#endif-static void __init via_bugs(void)
+static void __init via_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifde...
Ok. My only remaining nit to pick is that fix_hypertransport_config
is right in the middle of the nvidia quirks, which can be a bit
confusing when reading through the code. Otherwise I think this
is a version that we can merge.Let's get a clean description on this thing and send it to the
--
Clean Summary:
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 18 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/...).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procedeSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detect...
should be bit 17.
YH
--
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/...).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.Regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use o...
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling
This looks like the wrong place to do this. Better add a flag or something
in the structure. Dito others.Also while not a problem here in general it's bad style to add potential
wrapping bugs like this. Never use ++ for flags.-Andi
--
I'm not sure that would be fine. In the situation you describe, not setting
this bit means the second core won't receive interrupts. If we crash on that
core and boot the kdump kernel with it, we get exactly the same problem that we
I suppose I can, but I'm not sure what benefit that provides. Can you
I can fix that up. I'll hold off though until ben redoes all his testing. He
mentioned earlier this morning, that some of the results he was getting may have
been caused by a kexec utility bug. He's re-confirming that this patch solves
the reported problem. Once he does, I'll repost.Thanks & Regards
--
/***************************************************
*Neil Horman
*nhorman@tuxdriver.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
It could enable the extended APIC IDs but not use them?
The code would be smaller and cleaner.
-Andi
--
In which case complaining is still correct (the BIOS was out of sync),
This has everything to do with how AMD coherent hypertransport works and
little if anything to do with how the NV bridge operated.Basically the NV bridge seems to be sending a standard hypertransport
x86 legacy interrupt packet (that doesn't have any target information)
and when that packet hits the coherent hypertransport domain it isn't
being converted into whatever would send it to all cpus......
The real practical problem is if somehow the BIOS goofs up this way
and it then decides to ask us to boot on one of these cpus with
an extended apic id. We will hang in calibrate_delay. So far
this only seems to happen in the kdump case but in theory the BIOS
could be completely crazy.Eric
--
I think this just leaves us with deciding on a mechanism for how to do
single-application quirks. I take Andi's point that adding a flag set to the
quirk data structure is a fine solution, but I'm really ok with static integers
in individual functions. Do we have consensus on how to handle that? I'm happy
either way, but I'd rather have agreement on how to handle it before I post
another iteration of this patch.Thanks & Regards
--
/***************************************************
*Neil Horman
*nhorman@tuxdriver.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
As long as the solution is simple, small and concise I don't care.
And since what will make Andi happy seems to meet those criteria,
that should be fine.Eric
--
Ok, new patch attached, taking into account Andi's request for a cleaner method
to implement single application quirks. I've spoken with Ben, who is continuing
to retest, and reports that clean methodical testing results in success with
this patch.Summary:
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/...).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.Regards
NeilSigned-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 86 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 65 insertions(+), 21 deletions(-)diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f4ed3d1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,31 @@
#include <asm/gart.h>
#endif-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we fou...
Sorry for not noticing that earlier, but was there a specific reason this needs
to be an early quirk at all? kexec can only happen after the standard quirks ran.
I think it should be fine as a standard "late" quirk.-Andi
--
Early quirk seemed like the right thing to do to me. Starting from boot up,
this (mis)configuration by the bios can mean that come cpus just don't get
interrupts. I could imagine situations like serial console not working if the
serial port interrupt was routed to a cpu that used extended APIC id. I've
never actually observed it happening, but making sure that all cpus were
eligible to get interrupts early in the boot process made sense to me.--
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric
--
We may need to go back and do some additional work on this. It doesn't
seem to be quite as cut and dried as we initially thought.This quirk doesn't appear to work on virtually the same motherboard with
the barcelona processors in it. It also may be sensitive to the firmware
version. More extensive testing on a larger number of pre-production is
not showing it to be as effective as it appeared to be initially on the
testbed.I'm doing some retesting to figure out what exact situations and
collection of patches were able to make it work before.-ben
--
-ben
-=-
--
Ben, please lets be clear about this. You say this patch doesn't help on a new
system. Even thought its almost the exact same system, its not the same system.
Does this patch work consistently on the system you initially reported the
problem on? I've done enough work on this at this point that I'm invested in
not abandoning this fix. If this solves the problem on dual core system, but
not quad core, I'd much rather move forward with this fix and address your quad
core problem as a separate issue.--
==>
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB,
PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
+ { PCI_VENDOR_ID_AMD, 0x1200 , PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID,
fix_hypertransport_config },I still think good way is that you ask Supermicro to update their BIOS
to use newer code from AMD.YH
--
it seems we need to have two tables. one for northbridge (sweep all
the NB_K8) and another for SB ( like Nvidia, ati..., one touch and
leave)YH
--
I like Erics idea better I think. My origional patch had two tables, and it
seems that it made the early quirk detection logic that much more convoluted.
This way each quirk can determine if it needs to be applied to more than one pci
device.--
nvidia or ati chip will come first, and then amd NB ( K8). So you need
to make sure "fix_applied return" is not going to skip your fix to
K8_NB.YH
--
Agreed.
struct chipset {
u16 vendor;
u16 device;
u32 class;
+ u32 class_mask;
void (*f)(void); ============> int (*f) (int num, int slot, int func);
===>
int status;
status = early_qrk[i].f(num, slot, bus);
if (status == 1)
break;
else if (status == 2)
return;YH
--
Sorry to reply to myself, but do we have consensus on this patch? I'd like to
figure out its disposition if possible.Thanks & Regards
Neil--
What the patch tries to do looks like the right thing. So if we can get
a version that is clean and actually works we should merge it.Eric
--
I agree with the approach taken. Somebody needs to review the changes done
for applying early_quirks. I am not well versed with it.Thanks
Vivek
--
1. k8 northbridge always on bus 00, and dev0x18~dev0x1f, some time
later it could be on bus 0xff
2. need to check for first node (BSP) bit 18 and bit 17 is already
set, then don't need to go wild to check and set all over the bus.YH
--
Not really sure what you're saying here. I understand that the k8 northbridge
will for the forseeable future be on bus 00, but why not look for all ht hosts?
If we find an ht host, and bit 18 is set while bit 17 is not, we can safely
correct it.--
Copy that. Thus far, I've tested it on a pure AMD engineering sample, an intel
x86_64 box, and the affected system, a quad socket dual core AMD system with an
nvidia chipset. That last system is currently the only system that this patch
will check/enable the broadcast flag on. I did try a variant of the patch on
the AMD engineering sample where it enabled the bit unconditionally.
Interestingly enough, the bit was already turned on on that system. I'm
wondering if most systems don't already have this bit turned on. You should be
able to universally enable this bit, by moving the call to
check_hypertransport_config to the top of early_quirks()Regards
Neil--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding). The only systems what
you are suggesting would help with are systems that have no apic at all, which I
can only imagine on 64 bit systems is rare, to say the least. The affected
domain is further reduced by the fact that this quirk is only currently being
applied to systems with nvidia PCI bridges, since those are the only systems
that this problem has manifested on. That seems like a rather small subset, if
it exists at all. I suppose we could only optionally enable the quirk if we
are booting a kdump kernel (implying that we would need to do something like
detect the reset_devices command line option), but I think given the limited
affect this patch, its not really needed.Regards
Neil--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
I think all interrupt message travel on hypertransport. Even after APICS
have been enabled.Look at the following document.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/...
Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3
That's a different thing that once IOAPIC has formed the vectored message,
Hypertransport might not touch the destination field.Having said that, I am wondering what will happen if a system continues
to operate the timer through IOAPIC in ExtInt mode. Will hypertransport
keep on broadcasting that interrupt to every cpu? And every cpu will
process that interrupt.Hence, I feel it is safe to restore the broadcast bit back to BIOS value once
we are through calibrate_delay().Thanks
Vivek
--
I don't think so. IIRC once the other cpus are started they all disable the
timer interrupt, except for one cpu, opting instead to get the timer tick via
ipi, So while they all might see the interrupt packet on the ht bus, only one
I disagree. Looking at what Yinghai said, the default setting for the broadcast
bit isn't actually to unicast the interrupt, its just to set the broadcast mask
to 0xF, or to 0xFF. Its use is actually to allow cpus with an extended 8 bit
apic id see interrupts. So its not so much to direct interrupts to cpu0, but
rather to the first 16 cpus rather than to all 255 available cpus. From what
I've seen in my testing, systems that 'work' already have this bit set by bios,
and my quirk patch above does nothing to them. Disabling this bit after
calibrate_dealy is going to introduce more uncertainty in systems that have been
proven to work. We should leave well enough alone, and just enable the bit if
its off, and we see that we are using extended apic ids via bit 18 of the same
register, as Yinghai pointed out. By enabling the quirk that way, all we are
really doing is bringing into alignment two bits that should arguably be
set/cleared in unison anyway.Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
Does LAPIC allow to disable a specific vector and not accept interrupts? I
don't think so. If a timer interrupt is broadcasted to every cpu I think
everybody will accept it (like broadcast IPI). That's why intelligence
is built into IOAPIC and direct interrupts to a cpu or group of cpu.I am just trying to understand the functionality better. Can somebody help me
understand how do we make sure that same timer interrupt is not processed byAgain for my understanding, I got few questions.
- Why does nvidia choose not to broadcast the interrupts and still works
fine? Does that mean nvidia chipse will not work the extended cpu apic
ids?- IOW, why do other chipsets choose to broadcast the interrupts and
nvidia chooses not to and still works well.- Why do I need to broadcast the interrupts and not target specific cpus?
- If I am broadcasting interrupts, how do I make sure only one cpu
picks it up.Thanks
Vivek
--
I understand your desire, but clearly, something prevents it. Note our earlier
conversation, this bit doesn't actually force a unicast of an interrupt packet,
but simply masks the destination field. When set to zero, it simply means that
the ht interrupt packet destination field is restricted to 4 bits rather than 8.
So its not like when its set to zero we are guaranteed that it is forced to a
single processor anyway. All setting this bit does is ensure that if any apics
out on a system are addresed using an extended apic id, that interrupts can
reach them. Thats why it was suggested that this bit only be forcibly set if
It doesn't! When booting normally getting interrupts to apics that use 4 bit
apic ids is sufficient since cpu0 is in that set, but if we crash on a cpu with
It doesnt! It hangs in the kdump kernel. It works well normally because
Its not a forced broadcast, its a mask on the apic id. The IOAPIC still
The IOAPIC handles that.Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
--
Yes. The local allows us to choose to accept ExtInt interrupts or not.
We can't do it per vector but we can do by interrupt delivery path.
External Interrupt.
External NMI.We should only have a single cpus local apic configured to accept the
interrupt. Further it would not surprise me if there was some first
come first served logic with accepting the interrupt in thereYes. This sounds like a BIOS bug at present. A chipset feature
This is a broadcast for legacy mode interrupts which don't have a cpu
destination field. Once we are running the timer through the ioapicWell the local apics. If you are in ioapic mode and give the
cpus a choice of which cpu to deliver to it is a hardware anycast irq
transmission. That is the hardware magically picks one cpu of the
set of allow cpus to deliver the irq to. How that happens is
implementation specific.Eric
--
But longer (or not so long) term moving the IOAPIC earlier is the better option,
simply because the short use of PIC mode traditionally was a source of problems
on a lot of boxes.And it does not really make sense to keep this source of trouble just for a short
time during boot when we could as well go directly into IO-APIC mode. This would
probably also match what other OS are doing better and that is always a good idea
for stable operation.-Andi
-
Agreed, assuming:
1) The problems we have been hypothesising are accurate. As you note below, Ben
and I have dug deep to find the problem, but we could try to go deeper.2) There are no other issues with (re)booting a system on the non-boot cpu. It
seems to me that if its possible to reboot on cpu0, we should try. I understand
that we're trying to keep that code small for obvious reasons, but if we have aThis is actually a very interesting question. Looking at disable_IO_APIC in the
latest git tree, which is used to revert the APIC to a legacy mode, we do one of
two things. If the on board 8259 is routed through the IOAPIC, then we
configure the APIC to be in virtual wire mode, so that the interrupt is
delivered via the APIC to whatever processor you want to configure. If however,
the 8259 bypasses the IOAPIC, then we simply disable the LAPICS LVT0 interrupt,
so that any legacy timer interrupts from the apic are ignored, ostesibly because
the 8259 will assert the interrupt pin on the processor it is wired to directly.
I wonder if most (almost all) modern systems use the former configuration, while
the supermicro board in question in a rare exception, uses the latter. If the
8259 was routed directly to cpu0, that would explain this hang.Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
do you have a pointer to the old patch set? I'd like to try it out on the failing system here.
Regards
-
Not sure if this is applicable, but I assume not relying on interrupts in legacy
mode would be equivalent to specifying irqpoll on the kdump kernel command line?
If so, there seems to be a problem with that solution, as doing so still results
in the same hang on the system in question.As for solution 2, that brings me to my previous question. Is that really as
simple as just not moving the apic to legacy mode? It would seem some
additional programming would be in order to route the interrupt in question to
the proper cpu.Regards
-
The Linux kernel right now relies on being in legacy mode at bootup.
So obviously kexec has to switch back to that.Not relying on legacy mode would require moving APIC setup much earlier which
is difficult because that's quite fragile. Longer term it might be a good
idea though anyways -- at least the timer code was always fragile
and eliminating one failure case and only ever running it in true
APIC mode would be probably a good thing.-Andi
-
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?Thanks & Regards
-
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.Anyway I might just take another look at this if I can find enough
moments to string together.Eric
-
Here is what I get on a normal boot of the problem motherboard:
--
-ben
-=-
-
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
Close. There are two options with virtual wire mode.
- Either the local apic is in virtual wire mode, and somehow the
legacy interrupts make it to the local cpu.
- Or an ioapic is in virtual wire mode and the legacy interrupt
controller is connected to it.So I guess fundamentally for any SMP system that only supports the
cpu being in local apic mode and only routes interrupts to the boot
strap processor we could be in trouble. That is what our current
information about your system suggests.However most systems actually connect the i8254 PIC interrupt
controller to the ioapic in virtual wire mode. As I recall the
standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being
the timer interrupt (Possibly it is the other way around).So as a test we could feed those values into ioapic_8259 and see
if the kdump case works. I believe we prefer putting the ioapic
into virtual wire mode over putting the cpu into virtual wire
mode. We can only control which cpu receives the legacy interrupts if
we are putting the ioapic in virtual wire mode.It may also be an interesting test to just enable the timer for the
ioapic in early boot, as you have suggested. I don't have a clue what
that will do.Eric
-
nVidia MCP55 pro
It is the original version of
http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfmi.e. not the -2. However, they don't seem to advertise the original
version. Supermicro assures me that they are practically the same but I--
-ben
-=-
-
That is enough for an initial approximation. Unless something has
changed radically the Nvidia chipsets can put the ioapic instead of
the local apic in virtual wire mode so that is worth testing.Eric
-
I assume this is the case if the ioapic is also in virtual wire mode and the
destination field for the appropriate interrupt(s) (the timer interrupt in this
case) is set to either physical mode with a destination id of the lapic for the
running cpu, or if it is set to logical mode and the destination id has the
I thought we only had one ioapic in this system (Ben correct me if I'm wrong on
that please). I thought the above printk told us that, because apic2 and pin2
are both -1, that means that the 8259 isn't physically connected to any cpu, and
If that were the case, then we would need to support moving kexec boot to cpu0,
at least in some limited cases. I've got a patch together that enables the
handshaking I was brainstorming earlier, which should allow an attempted jump to
cpu0 on a crash, with a fallback to booting on the crashing processor. If we
Sorry, to split hairs here, but you mean the 8259 right? Just want to be sure
I'm sorry, I can't find ioapic_8259 defined anywhere. Where is that supposed to
Unfortunately nothing. We've tried using the local apic timer in a previous
test, and it resulted in no change, as did transitioning the cpu to the apic
timer via a call to switch_ipi_to_APIC_timer. Its possible I did something
wrong however.Currently I'm writing a patch that calls setup_ioapic_dest after we call
disable_IO_APIC. Looking at the implementation, it appears that calling this
function should rewrite the irq routing table in the ioapic to deliver
interrupts to the set of online cpus, as defined by the TARGET_CPUS macro. I
asusme that if the ioapic is in virtual wire mode from the call to
disablie_IO_APIC, then calling setup_ioapic_dest will force interrupts to be
delivered to the crashing cpu, as it should be the only bit set in the online
cpu mask. Please feel free to poke holes in this idea.Thanks & Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg ke...
Grr. Yes. The i8259. Got the timer and the PIC numbers confused Oops.
The legacy configuration is the timer to the PIC to either the localGrr. That should have been ioapic_i8259. The second value we print out
Well if you didn't have the local apic enabled beyond virtual wire
mode that could have cause problems.Sure. I suspect the probably is that you were still in a legacy irq
Just try and make certain: ioapic_i8259.pin != -1
Which should cause disable_IO_APIC to put the ioapic and not the local
apic in virtual wire mode.Anything else is likely to do strange things.
Eric
-
I'm sorry to reply to myself, but I think I actually had this backwards. The
fact that the 8259 does not get reported as being routed through the ioapic,
means that in disable_IO_APIC during a crash shutdown, we do _not_ go into
virtual wire mode. Instead we simply disable the LVT0 on the ioapic (which as I
understand it from above is the timer interrupt), and we trust the 8259 to
route the interrupt to the appropriate cpu. If the 8259 is hardwired to deliver
interrupts to cpu0 only, then we do in fact need to switch processors during
shutdown. Perhaps what we need to do is detect if the 8259 is not routed
through the ioapic on shutdown, and if it is, only them implement my patch to
jump to the bsp.Thoughts?
--
/***************************************************
*Neil Horman
*nhorman@tuxdriver.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
-
irqpoll won't fix the timer interrupt.
-Andi
-
Is it possible that the hypertansport bus can be in a state where I/O would
work, but not interrupt routing? I confess my knoweldge of this system bus is
But how likely is a kdump kernel to work properly if an errant cpu is running
unhalted while we try to boot? I understand your point regarding the
significance of the need for reliable IPI's, but in fairness, I think that we
I understand your suggestion, but to do that don't we need to do more than just
not move the apic to legacy pic mode? It was my understanding that the ioapic
delivered timer interrupts to one cpu, who's interrupt handler then distributed
it to the other cpu's via IPI. That suggests to me that we will need to
re-write the apic config so that the crashing processor is the target of the
ioapic interrupt delivery.And if this is truly the case, I would really like to furhter understand why
this isn't working on this specific system before I implment anything. Any
suggestions for how to further root cause this problem?Regards
-
That would be hard to believe. Linux cannot shut down CPUs
completely (put them back to SINIT state) because there is no generic
interface for this and it might be impossible. They just get put
into a HLT loop.And even if they were in SINIT they would still need the HT connections
otherwise it would be impossible to ever wake them up.The way HT setup is normally done is that the BIOS sets it all up
and then it never changes (except for some error conditions thatWhile that may help I doubt your analysis of the source of the problem
is correct. Most likely something is broken with the IO-APIC, but not
related to HyperTransport.It would be better to properly root cause before applying.
-Andi
-
If forget off the top of my head. But my memory is that if you send
the proper IPI cpus go back into SINIT. However linux doesn't do thisAgreed.
Eric
-
I understand that this is how its supposed to work, but my analysis nevertheless
in my mind points to an inability to route irqs to any processor other than the
boot cpu. I conducted a test whereby I forced a crash on cpu0, and then again
on cpu3. I've found that on all the systems I have available, I'm able to boot
to a kexec kernel without issue. However, on this system:
http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfmCrashing on any cpu other than the boot cpu leads to a hang in calibrate_delay
on reboot. I certainly make room for the notion that this could be an ioapic
programming error, but I don't see that this system has a different ioapic that
my other test systems. I've also simply tried booting the system that is
failing with noapic, to force the system to not use the ioapic, to no avail. If
you could suggest another test to point to another root cause, I would happily
run it, but the evidence that I have at the moment suggests to me that the
ioapic, while normally getting succefully programmed to deliver interrupts to
the appropriate cpu, is unable to on this system, and removing the traversal of
the system bus on the affected system restores functionality. That suggests a
I think thats part of the issue. Somehow (and in fairness I don't know how this
occurs), the crash of the kernel affects the functionality of the hypertransport
bus in such a way that it can no longer deliver interrupts. I'm not sur how,
beyond the testing that I describe above, that I can further prove or disproveIf you could suggest a test or observation to make so that I could further
diagnose this, I would appreciate it. Currently the code seems to configure the
ioapic properly on all systems available to me, except the supermicro board
above. Any thoughts welcome here.Thanks & Regards
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman@redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*...
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Andi Kleen | [PATCH x86] [0/16] Various i386/x86-64 changes |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Linus Torvalds | Re: [GIT]: Networking |
| Jeff Kirsher | [net-next PATCH 1/7] e1000e: enable CRC stripping by default |
| Jukka Andberg | ata/wdc vs gcc3 on amiga |
| YAMAMOTO Takashi | Re: wd.c patch to reduce kernel stack usage |
| Jason Thorpe | Re: ksyms patches. |
| rick | NFS transport |
