Hello,
I have several IBM x series 336 servers and attempted to upgrade them
today. My usual way is to use a build server which makes a release for
my servers. It went well on that server (which is the only one not
being an IBM x336, that will teach me...) so decided to deploy the new
build to the IBM servers.
When applied and i issued a reboot, the server rebooted after locking
at this line:
"Intel E7520 Error Reporting" rev 0x0c at pci0 dev 0 function 1 not configured
ppb0 at pci0 dev 2 function 0 "Intel E7520 PCIE" rev 0x0c
At this stage, server reboots and its BIOS issues the following:
re-booting due to unexpected NMI at 0000:0000
Now, I have tested my build and the official 4.6 ISO which both show
exactly the same behavior. Thinking it might have been a system issue,
I tried 3 other servers which ALL reported the same NMI issue. That
leads me to believe that my systems do not have a hardware issue (as
the NMI message would imply).
So, it looks like something in the 4.6 kernel code triggers that
behavior and I can test many things and provide output, please let me
know where I can start.
# dmesg
OpenBSD 4.5-stable (GENERIC) #0: Tue Aug 18 09:09:22 IST 2009
root@puffy:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Intel(R) Xeon(TM) CPU 3.20GHz ("GenuineIntel" 686-class) 3.21 GHz
cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,SSE3,MWAIT,DS-CPL,CNXT-ID,CX16,xTPR
real mem = 3623231488 (3455MB)
avail mem = 3517079552 (3354MB)
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 02/15/07, BIOS32 rev. 0 @
0xfd6f1, SMBIOS rev. 2.3 @ 0xf5f9e (52 entries)
bios0: vendor IBM version "-[APE137AUS-1.14]-" date 02/15/2007
bios0: IBM eserver xSeries 336 -[883722Y]-
acpi0 at bios0: rev 2
acpi0: tables DSDT FACP APIC MCFG
acpi0: wakeup devices PCI0(S5)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: ...[cut] I've seen the same on IBM x346 - install goes fine, reboot, and then it does not want to play nice. Also got the unexpected NMI at 0000:0000 message. (This was a clean install, not an upgrade, so don't know if 4.5 works on this box or not.) Thanks.
Installs and boot of all previous versions up until 4.6 work. I rolled back the server to 4.5 home release and it is back and running.
Hello
I have generated a verbose trace using a com port on the server, this
is by booting the official 4.6 i386 install CD.
boot> boot -c
booting cd0a:/4.6/i386/bsd.rd: 5651156+913072 [52+211008+196339]=0x6a6260
entry point at 0x200120
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
Copyright (c) 1995-2009 OpenBSD. All rights reserved. http://www.OpenBSD.org
OpenBSD 4.6 (RAMDISK_CD) #53: Thu Jul 9 21:41:35 MDT 2009
deraadt@i386.openbsd.org:/usr/src/sys/arch/i386/compile/RAMDISK_CD
cpu0: Intel(R) Xeon(TM) CPU 3.20GHz ("GenuineIntel" 686-class) 3.21 GHz
cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CR
real mem = 3623231488 (3455MB)
avail mem = 3518279680 (3355MB)
User Kernel Config
UKC> verbose
autoconf verbose enabled
UKC> quit
bios0 at mainbus0: AT/286+ BIOS, date 02/15/07, BIOS32 rev. 0 @ 0xfd6f1, SMBIOS)
bios0: vendor IBM version "-[APE137AUS-1.14]-" date 02/15/2007
acpi0 at bios0: rev 2
cpu0 at mainbus0: apid 0 (boot processor)
acpiprt3 at acpi0: bus 0 (PCI0)
pci0 at mainbus0 bus 0: configuration mode 1 (bios)
pci1 at ppb0 bus 2
Hope this helps...
Alright, disabling ACPI allows me to install the system, but then on
reboot, even disabling ACPI makes the system restart:
boot> boot -c
booting hd0a:/bsd: 6563548+1052072 [52+345584+327881]=0x7e7ce8
entry point at 0x200120
[ using 673892 bytes of bsd ELF symbol table ]
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
Copyright (c) 1995-2009 OpenBSD. All rights reserved. http://www.OpenBSD.org
OpenBSD 4.6 (GENERIC) #58: Thu Jul 9 21:24:42 MDT 2009
deraadt@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Intel(R) Xeon(TM) CPU 3.20GHz ("GenuineIntel" 686-class) 3.21 GHz
cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CR
real mem = 3623231488 (3455MB)
avail mem = 3516555264 (3353MB)
User Kernel Config
UKC> disable acpi
466 acpi0 disabled
UKC> quit
Continuing...
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 02/15/07, BIOS32 rev. 0 @ 0xfd6f1, SMBIOS)
bios0: vendor IBM version "-[APE137AUS-1.14]-" date 02/15/2007
bios0: IBM eserver xSeries 336 -[883722Y]-
acpi at bios0 function 0x0 not configured
mpbios0 at bios0: Intel MP Specification 1.4
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: apic clock running at 200MHz
cpu at mainbus0: not configured
mpbios0: bus 0 is type PCI
mpbios0: bus 1 is type PCI
mpbios0: bus 2 is type PCI
mpbios0: bus 3 is type PCI
mpbios0: bus 4 is type PCI
mpbios0: bus 5 is type PCI
mpbios0: bus 6 is type PCI
mpbios0: bus 7 is type PCI
mpbios0: bus 8 is type ISA
ioapic0 at mainbus0: apid 14 pa 0xfec00000, version 20, 24 pins
ioapic1 at mainbus0: apid 13 pa 0xfec82000, version 20, 24 pins
ioapic2 at mainbus0: apid 12 pa 0xfec82400, version 20, 24 pins
pcibios0 at bios0: rev 2.1 @ 0xf0000/0xffff
pcibios0: PCI BIOS has 11 Interrupt Routing table entries
pcibios0: PCI Exclusive IRQs: 9 10 11 15
pcibios0: PCI Interrupt Router at 000:31:0 ("Intel 82801EB/ER LPC" rev 0x00)
pcibios0: PCI bus #7 is the last bus
bios0: ROM ...My $0.02: try to disable intagp, agp, inteldrm, drm devices.
--
Best wishes,
Vadim Zhukov
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
Thanks, tried disabling a fair few based on successful boot log but still fails after wsdisplay0 Also, as pointed out, before 4.6, it just works. Cheers, Steph
What else can i provide to help fixing this? I am no developer but really would love to see that issue fixed :) Cheers, Steph
Does it have broadcom nics? if do disable those and try again.
I do. I'll try that tomorrow. On a related matter, can anyone tell me which switches are disabled during an OpenBSD install (using the official ISO) ? That would help me narrowing the problem down since I was able to install 4.6 from the official CD without hassle. Cheers, Steph
Hello still the same problem. Out of curiosity, tried to boot off the amd64 CD but failing the same. Suggestions? As I asked, can anyone tell me which flags are disabled during the install ? (disabling acpi during install was enough to get the system installed but then it won't boot...) Cheers, Steph
You can just diff /usr/src/sys/arch/`uname -m`/conf/GENERIC
and /usr/src/sys/arch/`uname -m`/conf/RAMDISK.
--
Best wishes,
Vadim Zhukov
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
Just to check the obvious: did you disable acpi when booting after the
install? (And did you try both bsd and bsd.mp? The latter is less like
the install kernel than the former.)
Otherwise, you could look at
/usr/src/sys/arch/<foo>/conf/{GENERIC,RAMDISK_CD}. But that's likely a
bit daunting.
Joachim
On Wed, Oct 28, 2009 at 4:13 PM, Joachim Schipper Hello, the problem is related to the network cards alright. Disabling ppb* allows it to boot. My problem is that even if I disable a card in the bios, i cannot boot the system. I tried to disable ppb2 but it doesn't seem to take it. What am I missing ? Cheers, Steph
I'm not really sure what you are asking. Is your question answered by pointing you at the -u option of config(8) (i.e. showing you how to get the 'disable ppb*' to stick)? If not, you'll have to rephrase it or hope someone else understands it... Joachim
On Thu, Oct 29, 2009 at 7:11 PM, Joachim Schipper Sorry, let me rephrase this. I have established that the problem lies with the PCI Express bus as disabling ppb* allows the server to boot. Unfortunately in that state, you have no network, slightly annoying... When doing a boot -c, i try to specify : disable ppb2 but it does not take it, only disable ppb* reports : ppb* disabled. Is there a way to disable only part of it? Another test I did was to disable both network cards in the bios but that still doesn't work. I have however noticed that a shit load of devices share the same IRQ. Unfortunately IBM Bios does not allow you to disable one device at a time, you can just select another IRQ. If anyone has insight on what else I can do to get workable systems, i'd be grateful. The option of sticking an alternate PCI network card is not an option as I have about 10 more servers in prod awaiting 4.6 Cheers, Steph
Run a -current system. This could be a pci resource allocation issue.
Hello again, It surely is, even leaving acpi enabled, the only way to allow the machine to boot is to disable ppb* It affect all the IBM Xseries 336 that I have. I just tried snapshot 28/10/2009 which has the same symptoms... only disabling ppb* allows it to boot. It seems that my problem is ppb2, but i cannot disable that one only, can I? Steph
no. kettenis needs to see a pci -v -xx of this machine. Send in the acpidump -o as well. I can't volunteer his time so he'll look at it whenever he'll look at it.
# pcidump -v -xx
Domain /dev/pci0:
0:0:0: Intel E7520 Host
0x0000: Vendor ID: 8086 Product ID: 3590
0x0004: Command: 0146 Status ID: 0090
0x0008: Class: 06 Subclass: 00 Interface: 00 Revision: 0c
0x000c: BIST: 00 Header Type: 80 Latency Timer: 00 Cache Line Size:
00
0x0010: BAR empty (00000000)
0x0014: BAR empty (00000000)
0x0018: BAR empty (00000000)
0x001c: BAR empty (00000000)
0x0020: BAR empty (00000000)
0x0024: BAR empty (00000000)
0x0028: Cardbus CIS: 00000000
0x002c: Subsystem Vendor ID: 1014 Product ID: 02dc
0x0030: Expansion ROM Base Address: 00000000
0x0038: 00000000
0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00
0x0040: Capability 0x09: Vendor Specific
0x0000: 35908086 00900146 0600000c 00800000
0x0010: 00000000 00000000 00000000 00000000
0x0020: 00000000 00000000 00000000 02dc1014
0x0030: 00000000 00000040 00000000 00000000
0x0040: 41050009 00000010 00000000 00000000
0x0050: 000a200c 00000000 01111000 11110000
0x0060: 10100808 20201818 00000000 00000000
0x0070: 0e0e0e0e 00000000 555e1144 2c20021e
0x0080: 00411248 00000000 f0000180 00000000
0x0090: 00000000 39092a00 301caaaa 070208d5
0x00a0: 00000001 00000000 00000001 00000000
0x00b0: 77bbddee 00000000 00000000 00000000
0x00c0: 3350c044 0040d800 000a0049 e0000020
0x00d0: 0e002802 00000007 b5930000 01040000
0x00e0: 00000000 00000000 00004036 00000000
0x00f0: 00000000 00420132 000c0f80 00000000
0:0:1: Intel E7520 Error Reporting
0x0000: Vendor ID: 8086 Product ID: 3591
0x0004: Command: 0100 Status ID: 0000
0x0008: Class: ff Subclass: 00 Interface: 00 Revision: 0c
0x000c: BIST: 00 Header Type: 00 Latency Timer: 00 Cache Line Size:
00
0x0010: BAR empty (00000000)
0x0014: BAR empty (00000000)
...Here is the acpidump from 4.5 running on the same server:
# acpidump
/*
RSD PTR: Checksum=85, OEMID=IBM, RsdtAddress=0xd7fcff80
*/
/*
RSDT: Length=48, Revision=1, Checksum=61,
OEMID=IBM, OEM Table ID=SERONYXP, OEM Revision=0x1001,
Creator ID=IBM, Creator Revision=0x45444f43
*/
/*
Entries={ 0xd7fcfe40, 0xd7fcfd80, 0xd7fcfd40 }
*/
/*
DSDT=0xd7fccf00
INT_MODEL=APIC
SCI_INT=9
SMI_CMD=0xb2, ACPI_ENABLE=0xf0, ACPI_DISABLE=0xf1, S4BIOS_REQ=0x0
PM1a_EVT_BLK=0x580-0x583
PM1a_CNT_BLK=0x584-0x585
PM2_TMR_BLK=0x588-0x58b
PM2_GPE0_BLK=0x5a8-0x5af
P_LVL2_LAT=101ms, P_LVL3_LAT=1001ms
FLUSH_SIZE=0, FLUSH_STRIDE=0
DUTY_OFFSET=1, DUTY_WIDTH=3
DAY_ALRM=68, MON_ALRM=69, CENTURY=0
Flags={WBINVD,PROC_C1,SLP_BUTTON}
*/
/*
DSDT: Length=8990, Revision=2, Checksum=246,
OEMID=IBM, OEM Table ID=SERTURQU, OEM Revision=0x1000,
Creator ID=INTL, Creator Revision=0x20041203
*/
DefinitionBlock (
"acpi_dsdt.aml", //Output filename
"DSDT", //Signature
0x2, //DSDT Revision
"IBM", //OEMID
"SERTURQU", //TABLE ID
0x1000 //OEM Revision
)
{
Scope(\) {
Method(CWRT, 3) {
Name(TMPB, Buffer(0x10) {0x88, 0xd, 0x0, 0x0, 0xc, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 })
Store(Arg2, Index(TMPB, 0x3))
If(LEqual(Arg2, 0x0)) {
Store(0x1, Index(TMPB, 0x5))
}
Store(And(Arg0, 0xff), Index(TMPB, 0x8))
Store(And(ShiftRight(Arg0, 0x8), 0xff), Index(TMPB, 0x9))
Store(And(Arg1, 0xff), Index(TMPB, 0xa))
Store(And(ShiftRight(Arg1, 0x8), 0xff), Index(TMPB, 0xb))
Store(Add(Subtract(Arg1, Arg0), 0x1), Local7)
Store(And(Local7, 0xff), Index(TMPB, 0xe))
Store(And(ShiftRight(Local7, 0x8), 0xff), Index(TMPB, 0xf))
Return(TMPB)
}
Method(CDRT, 3) {
...*bump*
Hello, I ran into the same issue with IBM x336 while trying to launch 4.6 after installation. I checked 4.6 ISO and -current, neither of them booted successfully. My x336s are fairly standard machines (4GB of RAM, 2x3.2GHz Xeon, 2x73GB SCSI) however I have Intel em(4) adapter installed. I equipped my test machine with management card so I am happy to provide more information in addition to what has been already sent by Steph. I can also confirm another problem with IBM x336 - while loading the kernel it freezes for several minutes, just after "entry point at ....." message. The freeze can be skipped by pressing any key - the same behaviour was observed with 4.5. Finally - does anyone successfully use ipmi with x336? I was hoping to use watchdog, but it behaved very unstable and lead to kernel panic. Many thanks, Marcin
I can confirm that latest snapshot does not boot on my x336 servers either. Marcin, can you run the following on your server: pcidump -v -xx acpidump Then paste in reply? Developers here might find something in there explaining why this happens to these servers. Right now i have 7 servers in production stuck in 4.5 until this can be fixed. I hope someone can really look into that, that would be really appreciated :) Cheers, Steph
Sure - hope it is of any use. Please let me know if anything more is required. pcidump -v ########################################### Domain /dev/pci0: 0:0:0: Intel E7520 Host 0x0000: Vendor ID: 8086 Product ID: 3590 0x0004: Command: 0146 Status ID: 0090 0x0008: Class: 06 Subclass: 00 Interface: 00 Revision: 0c 0x000c: BIST: 00 Header Type: 80 Latency Timer: 00 Cache Line Size: 00 0x0010: BAR empty (00000000) 0x0014: BAR mem 32bit addr: 0xff000000 0x0018: BAR empty (00000000) 0x001c: BAR empty (00000000) 0x0020: BAR empty (00000000) 0x0024: BAR empty (00000000) 0x0028: Cardbus CIS: 00000000 0x002c: Subsystem Vendor ID: 1014 Product ID: 02dc 0x0030: Expansion ROM Base Address: 00000000 0x0038: 00000000 0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00 0x0040: Capability 0x09: Vendor Specific 0:0:1: Intel E7520 Error Reporting 0x0000: Vendor ID: 8086 Product ID: 3591 0x0004: Command: 0100 Status ID: 0000 0x0008: Class: ff Subclass: 00 Interface: 00 Revision: 0c 0x000c: BIST: 00 Header Type: 00 Latency Timer: 00 Cache Line Size: 00 0x0010: BAR empty (00000000) 0x0014: BAR empty (00000000) 0x0018: BAR empty (00000000) 0x001c: BAR empty (00000000) 0x0020: BAR empty (00000000) 0x0024: BAR empty (00000000) 0x0028: Cardbus CIS: 00000000 0x002c: Subsystem Vendor ID: 1014 Product ID: 02dc 0x0030: Expansion ROM Base Address: 00000000 0x0038: 00000000 0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00 0:2:0: Intel E7520 PCIE 0x0000: Vendor ID: 8086 Product ID: 3595 0x0004: Command: 0147 Status ID: 0010 0x0008: Class: 06 Subclass: 04 Interface: 00 Revision: 0c 0x000c: BIST: 00 Header Type: 01 Latency Timer: 00 Cache Line Size: 10 0x0010: 00000000 0x0014: 00000000 0x0018: Primary Bus: 0 Secondary Bus: 2 Subordinate Bus: 2 Secondary Latency Timer: 00 0x001c: I/O Base: 40 I/O Limit: 30 Secondary Status: 0000 0x0020: Memory Base: df00 Memory Limit: def0 0x0024: Prefetch Memory Base: df01 Prefetch Memory ...
I just joined this thread today, but had a similar issue with an IBM 305 machine. On 4.5, it would randomly just shut down. No reason. Nothing in any logs, it was as if the power was pulled. I have 2 identical IBM 305 machines and it was happening on both, so that technically ruled out any specific hardware failure. What I did notice (in the BIOS events) was that the IBM reported fan #1,2,3 loss. Something seemed to disrupt the fan speed to bios reporting and I suspect the machine powered down since it thought it was overheating? - I could go a day or 2 weeks. Very random. 4.6 hasn't done this (yet) and uptime has been over a month. However, eventhough both IBMs are the same in every way, 4.6-REL will boot on machine #2 but I have no networking. If I use a 4.6-CUR snapshot, it comes up fine. That makes NO sense, yet another user reported the same exact thing. -- J.D. Bronson
Please try -current as of today (Jan 13, 2010 Melbourne time), there were number of significant fixes committed in the last couple of days. .... Ken
I would try a -current but the 4.6-STABLE I have in use on Machine #1 has been running fine and I am not seeing reboots or unexpected shutdowns as the OP has been experiencing. The Machine #2 will only run -current and I can't figure that out when they are identical. I suspect 4.7 will run fine on both machines.. -- J.D. Bronson
Hi, I tried current - the good news is the problem with freeze at startup is gone - kernel boots immediately. However, it hangs later on just after printing out following lines: pci0 at mainbus0 bus 0: configuration mode 1 (bios) mem address conflict 0xff000000/0x1000 pchb0 at pci0 dev 0 function 0 "Intel E7520 Host" rev 0x0c "Intel E7520 Error Reporting" rev 0x0c at pci0 dev 0 function 1 not configured ppb0 at pci0 dev 2 function 0 "Intel E7520 PCIE" rev 0x0c Thanks, Marcin
Yup, same error here, precisely at that line. Just to confirm that we have the same issue, can you try disabling ppb* on boot -c then see if it goes to the login prompt? Cheers, Steph
Was even worse for me here, as although disabling ppb* makes the kernel go slightly further, it has a nasty side effect of disabling scsi controllerl. However, I have just checked out and compiled -current and can confirm the issue is gone - machine booted and all network interfaces are accessible. Many thanks to everyone involved in fixing that! Regards, Marcin
One more time, for the record. If the kernel hangs after printing out a line, that's NOT the device that caused trouble. The lines are printed out mean "I did this", not "I'm about to do this." This should be obvious if you think about it, network cards print out their MAC addresses. How could the kernel do that if the device wasn't attached yet? [There are some more details, but that's the high level.] The reason disabling the last line /sometimes/ works is that if it's a bus, you then prevent probing of all the attached devices.
Sorry, my formulation was not the most accurate. I meant that in my case, when disabling the whole ppb*, it is the only way to get the server booting to login prompt on 4.6, and as a side effect of disabling devices on that bus, i have no network card available then. Cheers, Steph
Hey guys, sent an acpi dump with dmesg info a couple of months ago to this list hoping the developers might be able to fix this. Just letting you know that 4.7 snapshot still reboots the box unless you disable ppb*. Any way i can help? Cheers, Steph
The issue has already been investigated and kettenis@ committed a fix during
n2k10. However, the fix that allowed these servers to work happened to break
other systems that were previously working, hence the change was backed out.
I believe that an alternate fix is being worked on, however if you want to
use this hardware in the meantime you can revert dev/pci/pci.c to r1.72.
--
"Stop assuming that systems are secure unless demonstrated insecure;
start assuming that systems are insecure unless designed securely."
- Bruce Schneier
Hey guys, just to let you know, the issue is still present on stock 4.7 CDs. Any hope that I might use a current containing a fix or do you consider that chaning that would pretty much break too many other types of servers? Thanks for your replies, Steph
