Im not sure whether this is a hardware issue or not. I recently purcahsed and AMD64 3200+ and K8N Neo4 Platinum motherboard and a Geforce NX6600 /w pci-express. Ive installed two distrobutions today, ArchLinux and Gentoo, and in each distrobution, when all the services are being initialized, I recieve this:
CPU 0: Machine Check Exception: 4 Bank 4: b20000000000070F0F
TSC 1945bcca3c
Kernel panic - not syncing: CPU context corrupt
Or something to that extent (on both distros).
To temporarily fix this I applied ide=nodma to the kernel in grub. In my bios it says IDE DMA Access [ENABLED] so it is turned on.
My fear is that ive installed the processor incorrectly or something (it works fine on windows xp pro though with no issues).
The kernel used on Arch was 2.6.11.3-ARCH and on gentoo was which ever the latest is (2.6.11-r6?). None the less, it kernel panics everytime and im worried.
Does anyone know of any fixes?
a common nforce4 problems?
Hmm, you're not alone, seems like nforce4 are a bit buggy when dma is enabled. There's already a thread at the "linux kernel mailing list" (lkml.org) so post your MCE there ... (or you'll have wait ... and wait... and wait until someone hopefully fixes it)...
I have pretty much exactly th
I have pretty much exactly the same problem... My machine is a:
AMD64 athlon 3200+, 939Dual-stat2 motherboard, geforce 6600gt pci-e, and a sata connected 200 GB HD..
The computer had been working for a few days with a dual boot windows XP/debian w/ kernel 2.6.14.2 and then it gave me an error:
CPU 0: Machine check exception: 4 bank 4 b20000000000070f0f
TSC 585602ff47
Kernel Panic - not synching: machine check
I did a full reinstall once but after a few days it gave me the same error!
i got the same setup as above
i got the same setup as above, except a 3500+ processor
my errors are
CPU 0: Machine Check Exception: 0000000000000004
Bank 4: b200000000070f0f
Kernel panic - not syncing: CPU context corrupt
Kernel panic
Hey .
AMD64 Sempron Processor 2600+
I get the same problem after running system a couple of days .
CPU 0: Machine Check Exception: 0000000000000004
Bank 4: b200000000070f0f
Kernel panic - not syncing: CPU context corrupt
I got it after trying to check hdd temp on sata disk, looks like debian looses the sata from time to time. Mobo find everything on bootup .
So when it comes to mounting the lvm it fails because of this .
Johannes
kernel panic not syncing. any fix known?
Hi,
running an amd64 3200+ with ubuntu 6.10,
if using ide (cp /media/cdrom/* ~) machine hangs up within seconds with
"CPU 0: Machine Check Exception: 4 Bank 4: b20000000000070F0F
TSC 1945bcca3c
Kernel panic - not syncing:"
tried the workaround
"ide=nodma" in grub, without better result.
are any fixes known?
soenke
http://en.wikipedia.org/wiki/
http://en.wikipedia.org/wiki/Machine_Check_Exception
This is a hardware problem. Check for bad caps on your motherboard, bad RAM, broken fans, etc. Don't over clock.
Since it happens when you use the CD-ROM, which uses a lot of power, perhaps your PSU isn't powerful enough.
... power supply.
"perhaps your PSU isn't powerful enough."
-> it was not.
Thankx a lot for helping me out!
I'm getting the same error
I'm getting the same error on Fedora:
"CPU 0: Machine Check Exception: 4 Bank 4: b20000000000070F0F"
It used to work, even with the same kernel. Windows still works fine, no blue screens. If I disable SATA in the BIOS, it doesn't panic anymore. Wich makes me think it has something to do with that.
I reinstalled the kernel, and tried the previous version, but that didn't help.
I getting same eroor
I getting same eroor (((
uname -a
Linux *** 2.6.21-ARCH #1 SMP PREEMPT Sun May 6 22:27:01 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3700+ AuthenticAMD GNU/Linux
Hardware Error
Machine Check Exception: 4 Bank 4: b200000000070f0f
TSC 6dbb576aa9
Kernel Panic
Also i get Gentoo 2006.1 64 Crashed (when tryed to install)
Yeah i got 2 Sata Hard Drives (Seagate barracuda 250 gb SATA-I), Got trubles with it on windows early (sometimes it needed 4-7 times to boot!!!! [winxp]) Think its cuz of it... (((
Maybe i realy need more powerfull power supply? (450 wt installed)
Ppls u got SATA too? :(
My experience
Not that I'm happy to see I'm not alone, but it's nearly that... Curious that I find this page just wandering and suffering this problem since I upgraded to an Athlon X2 4200+ two weeks ago.
I'm still investigating what goes wrong (the new CPU is broken / the psu is not sufficient for that new CPU / the "old noname but dual channel" memory chips that were sufficient for the previous CPU shows at least their limits / the BIOS is buggy and do not configure the hardware properly).
mcelog tells me this is a Northbridge error. As the chipset is an ULi with integrated Northbridge and southbridge, I'm in trouble. However I managed to suppress the problem by reducing BIOS parameters on all fronts: CPU core speed at 195MHz instead 200MHz, RAM at 2*133MHz instead of 2*166MHz, CPU voltage at max 1.4V, and timings at max for memory. Now I'm still trying to restore some default values, I now that the CPU core, voltage and RAM speed are important for now (RAM timings seems to be irrelevant). RAM speed at 2*133MHz alone is not sufficient either. Of course, Memtest has not seen anything, and the temperature is indeed low (the machine crashes curiously at relatively low loads, and not at high ones!).
For the record, and as a quick and dirty FAQ for these errors, I give below my experience (I got this twice on two different machines, and it was different causes). So let's go:
A. General misbehaviour
If your machine randomly freeze (nothing alive in X11), reboots, either with or **without** any special activity, and if you have a recent processor (AMD or Intel), then you should consider yourself as a victime of the MCE (Machine Check Exception).
This is an exception raised by the processor to signal a hardware problem. Some can be recoverable, other not and crash the computer.
B. Diagnostic
To be sure:
1. Switch to text mode console (no framebuffer);
2. Reproduce the problem several times;
3. Look at the console output or in the result of the dmesg command.
If you get something like that:
HARDWARE ERROR
CPU 0: Machine Check Exception: 4 Bank 4: b200001000010C0F
TSC 18af428adee
Kernel panic - not syncing: Machine check
then you have a (serious) problem.
Technically, as the log says, this is indeed a hardware error. Not a software. But it can be triggered by any kind of usage of the computer. Let's be clear: you'll have hard time to find what is wrong, and you may eventually have to replace the whole hardware!
C. Getting details
There is a tool, mcelog, that can decode the exception code and give some clues. However, this is not always very usefull. For instance:
mcelog --ascii < crashlog.txt
Gives me:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 18af428adee
Northbridge CRC error
link number = 1
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'local node observed, request didn't time out
generic error mem transaction
generic access, level generic'
STATUS b200001000010c0f MCGSTATUS 4
Kernel panic - not syncing: Machine check
This is better than nothing but finding the cause of the problem is still difficult. In my case, the Northbridge is integrated with the southbridge on my ULi chipset, so this can be virtually anything... Here it seems to be a memory error. But is it really the RAM, or the chipset?
D. Probable causes
Here is my top cause list. In general, this is not the CPU . This is:
1. Most probably, overclocking or high temperature in the case/near the RAM, CPU or chipset.
2. Very probably, the RAM chips or settings (mem testing tools may report no error but this is not sufficient, Linux/users can stress more memory than any tool).
3. Probably, the PSU that is insufficient (especially when the error occurs when accessing disks or after having added a hardware, or plugging a USB device).
4. Dometimes, cheap hardware on chipset/mobo.
5. Rarely, the BIOS wrongly configures something (my case since manually I managed to stabilize the box!).
E. How to fix it?
Iterate tries and tests!
1. Check the temperature of the CPU, chipset, RAM, the whole case. Do not overclock.
2. Check your memory with memtest. Try to lower memory banks speed, deactivate dual channel mode. Test with different RAM.
3. Remove some disks/PCI cards, unplug USB devices, use self powered USB device or a powered hub. Test with a different graphic card - not a gamer one. Test with a different (more powerfull) power supply unit.
4. Reduce speed of core components in the BIOS, increase voltage, increase delays of RAM/PCI components (I'm here ;-).
5. Test with a different mobo, a different CPU. If you arrive here, you are near to change the whole PC, and you should consider it to spare time, money, and have newer and better hardware. I'm not here :-o
Hope this could be usefull for anybody wondering what this mess means...
Best regards
My experience (2)
C. Getting details
There is a tool, mcelog, that can decode the exception code and give some clues. However, this is not always very usefull. For instance:
mcelog --ascii < crashlog.txt
Gives me:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 18af428adee
Northbridge CRC error
link number = 1
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'local node observed, request didn't time out
generic error mem transaction
generic access, level generic'
STATUS b200001000010c0f MCGSTATUS 4
Kernel panic - not syncing: Machine check
This is better than nothing but finding the cause of the problem is still difficult. In my case, the Northbridge is integrated with the southbridge on my ULi chipset, so this can be virtually anything... Here it seems to be a memory error. But is it really the RAM, or the chipset?
D. Probable causes
Here is my top cause list. In general, this is not the CPU . This is:
1. Most probably, overclocking or high temperature in the case/near the RAM, CPU or chipset.
2. Very probably, the RAM chips or settings (mem testing tools may report no error but this is not sufficient, Linux/users can stress more memory than any tool).
3. Probably, the PSU that is insufficient (especially when the error occurs when accessing disks or after having added a hardware, or plugging a USB device).
4. Dometimes, cheap hardware on chipset/mobo.
5. Rarely, the BIOS wrongly configures something (my case since manually I managed to stabilize the box!).
E. How to fix it?
Iterate tries and tests!
1. Check the temperature of the CPU, chipset, RAM, the whole case. Do not overclock.
2. Check your memory with memtest. Try to lower memory banks speed, deactivate dual channel mode. Test with different RAM.
3. Remove some disks/PCI cards, unplug USB devices, use self powered USB device or a powered hub. Test with a different graphic card - not a gamer one. Test with a different (more powerfull) power supply unit.
4. Reduce speed of core components in the BIOS, increase voltage, increase delays of RAM/PCI components (I'm here ;-).
5. Test with a different mobo, a different CPU. If you arrive here, you are near to change the whole PC, and you should consider it to spare time, money, and have newer and better hardware. I'm not here :-o
Hope this could be usefull for anybody wondering what this mess means...
Best regards
My experience (second part due to < tag!)
C. Getting details
There is a tool, mcelog, that can decode the exception code and give some clues. However, this is not always very usefull. For instance:
mcelog --ascii < crashlog.txt
Gives me:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 18af428adee
Northbridge CRC error
link number = 1
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'local node observed, request didn't time out
generic error mem transaction
generic access, level generic'
STATUS b200001000010c0f MCGSTATUS 4
Kernel panic - not syncing: Machine check
This is better than nothing but finding the cause of the problem is still difficult. In my case, the Northbridge is integrated with the southbridge on my ULi chipset, so this can be virtually anything... Here it seems to be a memory error. But is it really the RAM, or the chipset?
D. Probable causes
Here is my top cause list. In general, this is not the CPU . This is:
1. Most probably, overclocking or high temperature in the case/near the RAM, CPU or chipset.
2. Very probably, the RAM chips or settings (mem testing tools may report no error but this is not sufficient, Linux/users can stress more memory than any tool).
3. Probably, the PSU that is insufficient (especially when the error occurs when accessing disks or after having added a hardware, or plugging a USB device).
4. Dometimes, cheap hardware on chipset/mobo.
5. Rarely, the BIOS wrongly configures something (my case since manually I managed to stabilize the box!).
E. How to fix it?
Iterate tries and tests!
1. Check the temperature of the CPU, chipset, RAM, the whole case. Do not overclock.
2. Check your memory with memtest. Try to lower memory banks speed, deactivate dual channel mode. Test with different RAM.
3. Remove some disks/PCI cards, unplug USB devices, use self powered USB device or a powered hub. Test with a different graphic card - not a gamer one. Test with a different (more powerfull) power supply unit.
4. Reduce speed of core components in the BIOS, increase voltage, increase delays of RAM/PCI components (I'm here ;-).
5. Test with a different mobo, a different CPU. If you arrive here, you are near to change the whole PC, and you should consider it to spare time, money, and have newer and better hardware. I'm not here :-o
Hope this could be usefull for anybody wondering what this mess means...
Best regards
Leaking caps on the
Leaking caps on the motherboard can also cause these errors, and leaking caps are pretty common. (Do a google image search for leaking caps)
Finally
I've found what is the main cause of MCEs on my machines.
As I said in my precedent posts, I've reduced everything in the BIOS settings. But it wasn't the source of the MCEs.
Indeed, I found that the problem was the use of PowerNow (ondemand cpufreq governor) with dual core. If I switch of the second core, there is no more crashes. If I fix the frequency, no crash (even if I fix it at the maximum speed).
My test was to encode TV from an old TV card. I suspect therefore that switching mencoder from one CPU to another, and changing the CPU speed dynamically (thousands of times, because the power needed seems to be exactly at the limit of CPU usage) leads to bus corruption. I'm not sure it's an hardware error anymore therefore, and maybe the bttv driver is simply not so SMP aware as we could expect.
Note that the fact that dual core is involved may be only due to the fact that the second core halps the first one and brings the CPU usage at the limit of the fpufreq governor. Deactivating the second core may have put the first one away from cpufreq frequency changes. I could also conceive that the CPU frequency scaling is not well integrated with other hardware on my computer, such as the bttv card.
I have resolved with upgrade
I have resolved with upgrade the bios of the mainboard. ciao
MCE...
This problem just appeared on my wife's PC (gentoo). I have some old RAM (nice in it's day), 1 stick of which has been suspect for a about a year. Recently I opened up the case to clean everything out and my thought is that I may have put my modules back in, in a different order than they were initially. Regardless, memtest86+ is currently running and I have quite a few errors being reported. This memory is not currently being overclocked but it had been in the past. When the test is done running I will swap the RAM modules and hope for the best but I think I will be buying some new RAM. Hopefully, if I play my cards right I can smooth talk the wife into an upgrade. Cheers!
MCE - LVM related issue?
I ran into this installing FC8 (2.6.23 kernel) on an MSI Neo4 Platinum (MS-7125 v1.0 board) this past weekend. This was a fresh install, and while installing and wiping the hard-drive (almost finished) the machine locked up.
I attempted reboot on older release, and enough of the kernel was there to show the MCE.
Subsequent attempts to install had the installation failing much earlier - just after picking language and keyboard type.
Playing around with BIOS parameters didn't help (dma, PCI window size, power now, etc), nor did upgrading to the latest bios for the board (v1.D).
Finally, since the original failure happened during drive access, I went into rescue mode and re-partitioned the drive with 200MB boot, 4GB swap, and rest as '/' with ext3 file systems.
I was then able to get past the language pick, setup the drive as ext3 only (no LVM) and do the installation flawlessly.
The only thing that makes sense out of this is that somehow the LVM information went odd on the drive and the kernel (or LVM software) reading this information managed to go off to la-la land.
re:
This error has recently happened to me as well (the error with the kernel trying to use init before / has mounted), using Nvidia's 7050 pv chipset on kernel 2.6.22-15. I tend to agree with the Northbridge theory, and have also noticed after many hours searching that this tends to happen on a lot of Nvidia's chipsets. I was able to get certain distros working on my box, but some never would after install.