Hi there,
Users of different Linux distros (includes Ubuntu, Mandriva, ArchLinux and
possibly Fedora) are reporting that kernel 2.6.26.2 is OOPSing in the
Virtual Box emulator[1].
It is not clear if this is a kernel or Virtual Box bug, but as the kernel
is also OOPsing in QEMU (although with different behaivor) I have decided
to post my debug results here in case someone is interested in debugging
the kernel part further.
I have done a bisection by hand among kernel versions and found that
the commit which triggers the oops in _Virtual Box_ was introduced in
2.6.26-rc1 and the problem also happens with latest Linus tree.
By using git bisect I found that the commit is this:
"""
commit e587cadd8f47e202a30712e2906a65a0606d5865
Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu Mar 6 08:48:49 2008 -0500
x86: enhance DEBUG_RODATA support - alternatives
[...]
"""
By reverting this commit I don't get the OOPS anymore. I have
tested with 2.6.26-rc1 and latest Linus tree (2.6.27-rc3).
What puzzles me though is that a similar problem happens with
QEMU, but it also OOPSes with kernels before 2.6.26-rc1,
reverting the patch above makes no difference and it works with
current Linus tree.
Does this look like a kernel bug?
All my tests have been done with vanilla kernels, but I have
built .iso installation images with them and I'm not sure of
what the build script does.
[1] http://en.wikipedia.org/wiki/Virtual_box
Thanks for reading this.
--
Luiz Fernando N. Capitulino
--
No, it looks like a very common virtualizer bug. Does the attached patch work for you? -hpa
Also, in addition to this, please try tip:master. There is a patch in tip:master which I hope should fix this problem, but the details are important. -hpa --
access coordinates would be at: http://people.redhat.com/mingo/tip.git/README Ingo --
Em Fri, 22 Aug 2008 08:50:12 +0200 Ingo Molnar <mingo@elte.hu> escreveu: | | * H. Peter Anvin <hpa@zytor.com> wrote: | | > H. Peter Anvin wrote: | >>> | >>> Does this look like a kernel bug? | >>> | >> | >> No, it looks like a very common virtualizer bug. Does the attached | >> patch work for you? | >> | > | > Also, in addition to this, please try tip:master. There is a patch in | > tip:master which I hope should fix this problem, but the details are | > important. | | access coordinates would be at: | | http://people.redhat.com/mingo/tip.git/README As I already have Linus tree downloaded I have cloned it in the usual way. Got the same results: OOPS in virtualbox but it works on QEMU. The OOPS's output follows and I have attached the .config I'm using to reproduce the problem. """ BUG: unable to handle kernel NULL pointer dereference at 00000246 IP: [<c01310f1>] vprintk+0x181/0x440 *pde = 00000000 Oops: 0002 [#1] SMP Modules linked in: Pid: 1, comm: swapper Not tainted (2.6.27-rc4-test24-tip #3) EIP: 0060:[<c01310f1>] EFLAGS: 00010246 CPU: 0 EIP is at vprintk+0x181/0x440 EAX: 00000246 EBX: 00000000 ECX: c0130ca9 EDX: 0000dedd ESI: c0474ae3 EDI: c04cf6bc EBP: c7435f24 ESP: c7435eb0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 Process swapper (pid: 1, ti=c7434000 task=c7438000 task.ti=c7434000) Stack: 0000dedd c0130ca9 c7435f40 00000000 a026104f a026106c c7434000 c7435ee6 00000006 00000246 00000000 a0260cf3 0000001c c7434000 00000282 00000046 c11a85a0 c7435efc c0135c6f c7435f14 c0115fcb a0296e91 c0104c2c 00000000 Call Trace: [<c0130ca9>] ? release_console_sem+0x199/0x1e0 [<c0135c6f>] ? irq_exit+0x3f/0x90 [<c0115fcb>] ? smp_apic_timer_interrupt+0x5b/0x90 [<c0104c2c>] ? apic_timer_interrupt+0x28/0x30 [<c0474ae3>] ? net_ns_init+0x0/0x1ad [<c0474ae3>] ? net_ns_init+0x0/0x1ad [<c0346ed9>] ? printk+0x18/0x1f [<c0474b00>] ? net_ns_init+0x1d/0x1ad [<c0474ae3>] ? net_ns_init+0x0/0x1ad [<c0101116>] ? ...
Can you try booting with the kernel argument : debug_alternative The dmesg of the kernel bootup up to the oops would be helpful. My guess is that there may be something wrong with irq disabling which protects text_poke_early in apply_alternatives(). -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Em Fri, 22 Aug 2008 11:34:52 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | > Em Fri, 22 Aug 2008 08:50:12 +0200 | > Ingo Molnar <mingo@elte.hu> escreveu: | > | > | | > | * H. Peter Anvin <hpa@zytor.com> wrote: | > | | > | > H. Peter Anvin wrote: | > | >>> | > | >>> Does this look like a kernel bug? | > | >>> | > | >> | > | >> No, it looks like a very common virtualizer bug. Does the attached | > | >> patch work for you? | > | >> | > | > | > | > Also, in addition to this, please try tip:master. There is a patch in | > | > tip:master which I hope should fix this problem, but the details are | > | > important. | > | | > | access coordinates would be at: | > | | > | http://people.redhat.com/mingo/tip.git/README | > | > As I already have Linus tree downloaded I have cloned it in | > the usual way. | > | > Got the same results: OOPS in virtualbox but it works on QEMU. | > | > The OOPS's output follows and I have attached the .config I'm using | > to reproduce the problem. | > | | Can you try booting with the kernel argument : | debug_alternative | | The dmesg of the kernel bootup up to the oops would be helpful. | | My guess is that there may be something wrong with irq disabling which | protects text_poke_early in apply_alternatives(). I have attached two files: - normal.txt: normal boot with no debug options - debug-alternative.txt ignore_loglevel and debug-alternative boot options I had to pass ignore_loglevel otherwise it wouldn't print anything. -- Luiz Fernando N. Capitulino
Ok, now can you try booting with either of those args : noreplace-paravirt noreplace-smp And see which one(s) works ? Thanks, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Em Fri, 22 Aug 2008 12:35:20 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | > Em Fri, 22 Aug 2008 11:34:52 -0400 | > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | > | > | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | > | > Em Fri, 22 Aug 2008 08:50:12 +0200 | > | > Ingo Molnar <mingo@elte.hu> escreveu: | > | > | > | > | | > | > | * H. Peter Anvin <hpa@zytor.com> wrote: | > | > | | > | > | > H. Peter Anvin wrote: | > | > | >>> | > | > | >>> Does this look like a kernel bug? | > | > | >>> | > | > | >> | > | > | >> No, it looks like a very common virtualizer bug. Does the attached | > | > | >> patch work for you? | > | > | >> | > | > | > | > | > | > Also, in addition to this, please try tip:master. There is a patch in | > | > | > tip:master which I hope should fix this problem, but the details are | > | > | > important. | > | > | | > | > | access coordinates would be at: | > | > | | > | > | http://people.redhat.com/mingo/tip.git/README | > | > | > | > As I already have Linus tree downloaded I have cloned it in | > | > the usual way. | > | > | > | > Got the same results: OOPS in virtualbox but it works on QEMU. | > | > | > | > The OOPS's output follows and I have attached the .config I'm using | > | > to reproduce the problem. | > | > | > | | > | Can you try booting with the kernel argument : | > | debug_alternative | > | | > | The dmesg of the kernel bootup up to the oops would be helpful. | > | | > | My guess is that there may be something wrong with irq disabling which | > | protects text_poke_early in apply_alternatives(). | > | > I have attached two files: | > | > - normal.txt: normal boot with no debug options | > - debug-alternative.txt ignore_loglevel and debug-alternative boot | > options | > | > I had to pass ignore_loglevel otherwise it wouldn't print | > anything. | > | | Ok, ...
Hi Luiz, two more tests: 1. a small program to run in userspace and tell us what you get; 2. a patch against -linus for testing. -hpa
Em Fri, 22 Aug 2008 11:11:25 -0700 "H. Peter Anvin" <hpa@zytor.com> escreveu: | Hi Luiz, two more tests: | | 1. a small program to run in userspace and tell us what you get; 88776655:44332211 It is the same output in the virtualized system and the host system. | 2. a patch against -linus for testing. I have tried this patch with Linus tree early today, should I try it with Ingo's tree too? -- Luiz Fernando N. Capitulino --
It doesn't apply to tip. This did not fix the problem? -hpa --
Em Fri, 22 Aug 2008 13:31:49 -0700 "H. Peter Anvin" <hpa@zytor.com> escreveu: | Luiz Fernando N. Capitulino wrote: | > | > | 2. a patch against -linus for testing. | > | > I have tried this patch with Linus tree early today, should I try | > it with Ingo's tree too? | > | | It doesn't apply to tip. This did not fix the problem? No, it did not. :( -- Luiz Fernando N. Capitulino --
Em Fri, 22 Aug 2008 14:20:54 -0300 "Luiz Fernando N. Capitulino" <lcapitulino@mandriva.com.br> escreveu: | Em Fri, 22 Aug 2008 12:35:20 -0400 | Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | | | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | | > Em Fri, 22 Aug 2008 11:34:52 -0400 | | > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | | > | | > | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | | > | > Em Fri, 22 Aug 2008 08:50:12 +0200 | | > | > Ingo Molnar <mingo@elte.hu> escreveu: | | > | > | | > | > | | | > | > | * H. Peter Anvin <hpa@zytor.com> wrote: | | > | > | | | > | > | > H. Peter Anvin wrote: | | > | > | >>> | | > | > | >>> Does this look like a kernel bug? | | > | > | >>> | | > | > | >> | | > | > | >> No, it looks like a very common virtualizer bug. Does the attached | | > | > | >> patch work for you? | | > | > | >> | | > | > | > | | > | > | > Also, in addition to this, please try tip:master. There is a patch in | | > | > | > tip:master which I hope should fix this problem, but the details are | | > | > | > important. | | > | > | | | > | > | access coordinates would be at: | | > | > | | | > | > | http://people.redhat.com/mingo/tip.git/README | | > | > | | > | > As I already have Linus tree downloaded I have cloned it in | | > | > the usual way. | | > | > | | > | > Got the same results: OOPS in virtualbox but it works on QEMU. | | > | > | | > | > The OOPS's output follows and I have attached the .config I'm using | | > | > to reproduce the problem. | | > | > | | > | | | > | Can you try booting with the kernel argument : | | > | debug_alternative | | > | | | > | The dmesg of the kernel bootup up to the oops would be helpful. | | > | | | > | My guess is that there may be something wrong with irq disabling which | | > | protects text_poke_early in apply_alternatives(). | | > | | > I have attached two files: | | > | | > - normal.txt: normal boot with ...
Yes, the big issue is exactly what VirtualBox screws up in this matter, how to detect it, and how to work around it. It's pretty clear it's a VirtualBox f*ckup at this point, but the failure mechanism isn't at all obvious and so far the workaround is elusive. I'm strongly suspect this is a VirtualBox tcache management failure, but that doesn't help the situation without knowing how it happens. -hpa --
On Archlinux we have the same problem. We have a bugreport here: http://bugs.archlinux.org/task/11141 Myself test it with a LiveCD/Install-ISO which has 2.6.26 as install kernel. We have the guest oops on virtualbox-ose, virtualbox-sun and both on i686 or x86_64 hosts. Some things i noticed: - The system boots always when i either enable VT-x in guest settings or disable acpi and run the guest with acpi=off. - The oops occurs always on (disk)-io, no matter which file system i use. - When the oops has occured and the guest has to close and restart then, if i don't use VT-x or acpi=off, i always get an oops directly when initrd/kernel is starting. Last screen message before the oops then is "Freeing SMP alternatives". Here is also an archive with guest dmesg and messages.log from such an oops when heavy disk io leads to the oops: Gerhard -- Standards sind eine tolle Sache. Ich finde, jeder sollte einen haben. --
Hrm, can you try this ?
1 - Make sure you kernel is not CONFIG_DEBUG_RODATA
2 - Change the whole text_poke implementation in
arch/x86/kernel/alternative.c to this :
void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
return text_poke_early(addr, opcode, len);
}
If this works, I suspect that the problem comes from a vmap/vunmap
problem. If it still fails, the problem would likely come from a race
with interrupt disabling probably due to missing data/instruction cache
flush.
Then, after having tested (2), try this on top of it :
In arch/x86/kernel/alternative.c, alternatives_smp_switch()
Add unsigned long flags;
Change
spin_lock -> spin_lock_irqsave(&smp_alt, flags);
spin_unlock(&smp_alt); -> spin_unlock_irqrestore(&smp_alt, flags);
This will help testing if there is a problem with interrupts coming
shortly after the modification. If it fixes the problem, my guess is
that we should flush the instruction cache (and maybe the data cache ?)
in text_poke and text_poke early when interrupts are off.
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Em Tue, 26 Aug 2008 10:53:38 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | * Gerhard Brauer (gerhard.brauer@web.de) wrote: | > On Fri, Aug 22, 2008 at 02:08:13PM -0700, H. Peter Anvin wrote: | > > Luiz Fernando N. Capitulino wrote: | > >> | > >> I have asked Mandriva and Ubuntu users to test this and all of | > >> them so far are saying that noreplace-paravirt works. | > >> | > >> It makes the system slower, but it works. | > >> | > > | > > Yes, the big issue is exactly what VirtualBox screws up in this matter, | > > how to detect it, and how to work around it. | > > | > > It's pretty clear it's a VirtualBox f*ckup at this point, but the failure | > > mechanism isn't at all obvious and so far the workaround is elusive. | > > | > > I'm strongly suspect this is a VirtualBox tcache management failure, but | > > that doesn't help the situation without knowing how it happens. | > | > On Archlinux we have the same problem. We have a bugreport here: | > http://bugs.archlinux.org/task/11141 | > | > Myself test it with a LiveCD/Install-ISO which has 2.6.26 as install | > kernel. We have the guest oops on virtualbox-ose, virtualbox-sun and both on | > i686 or x86_64 hosts. | > | > Some things i noticed: | > - The system boots always when i either enable VT-x in guest settings or | > disable acpi and run the guest with acpi=off. | > - The oops occurs always on (disk)-io, no matter which file system i | > use. | > - When the oops has occured and the guest has to close and restart then, | > if i don't use VT-x or acpi=off, i always get an oops directly when | > initrd/kernel is starting. Last screen message before the oops then is | > "Freeing SMP alternatives". | > | > Here is also an archive with guest dmesg and messages.log from such an | > oops when heavy disk io leads to the oops: | > http://bugs.archlinux.org/task/11141?getfile=2445 | > | | Hrm, can you try this ? | | 1 - Make sure you kernel is not ...
Em Tue, 26 Aug 2008 10:53:38 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | Then, after having tested (2), try this on top of it : | | In arch/x86/kernel/alternative.c, alternatives_smp_switch() | | Add unsigned long flags; | Change | spin_lock -> spin_lock_irqsave(&smp_alt, flags); | spin_unlock(&smp_alt); -> spin_unlock_irqrestore(&smp_alt, flags); Hmm, I can't find spin_lock functions in alternatives_smp_switch() looks like the current implementation is now using mutexes. What tree are you referring to? -- Luiz Fernando N. Capitulino --
Sorry, I was looking directly at the commit which caused the problem. Yes, these modif should go on top of the text_poke -> text_poke_early. So in current mainline, change, in alternatives_smp_switch() : mutex_lock(&smp_alt); ... mutex_unlock(&smp_alt); to mutex_lock(&smp_alt); local_irq_save(flags); ... local_irq_restore(flags); mutex_unlock(&smp_alt); Thanks, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I have been unable to replicate this on my own hardware mostly because my testing machine decided to blow its DVD drive in some very strange way, but I did pick apart the data from Luiz, and found it very interesting: The code sequence before patching looks like: c012fc69: 51 push %ecx c012fc6a: 52 push %edx c012fc6b: ff 15 40 b9 41 c0 call *0xc041b940 c012fc71: 5a pop %edx c012fc72: 59 pop %ecx After patching: 50 9d 0f 1f 84 00 00 00 <00> 00 ... which disassembles to (in Intel notation): C012FC69 50 push eax C012FC6A 9D popfd C012FC6B 0F1F840000000000 nop dword [eax+eax+0x0] We do, indeed have a return point that falls in the *middle* of a patched instruction, and if the patching happens in the middle of the instruction call, then, well, bad things happen. Furthermore, why on Earth is %ecx/%edx pushed and popped in-line here? Surely it should be the responsibility of the PV call to present a no-clobber interface (using an assembly wrapper if necessary[*]), rather than bloating every callsite like this? -hpa [*] One can compile gcc code with -fcall-saved-* to use nonstandard register conventions. Unfortunately stock gcc only lets you do this with a file parameter, and doesn't support doing this with attributes. --
Em Tue, 26 Aug 2008 13:18:22 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | > Em Tue, 26 Aug 2008 10:53:38 -0400 | > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | > | > | Then, after having tested (2), try this on top of it : | > | | > | In arch/x86/kernel/alternative.c, alternatives_smp_switch() | > | | > | Add unsigned long flags; | > | Change | > | spin_lock -> spin_lock_irqsave(&smp_alt, flags); | > | spin_unlock(&smp_alt); -> spin_unlock_irqrestore(&smp_alt, flags); | > | > Hmm, I can't find spin_lock functions in alternatives_smp_switch() | > looks like the current implementation is now using mutexes. | > | | Sorry, I was looking directly at the commit which caused the problem. | Yes, these modif should go on top of the text_poke -> text_poke_early. | | So in current mainline, change, in alternatives_smp_switch() : | | mutex_lock(&smp_alt); | ... | | mutex_unlock(&smp_alt); | | to | | mutex_lock(&smp_alt); | local_irq_save(flags); | ... | | local_irq_restore(flags); | mutex_unlock(&smp_alt); Did not help, same oops here. -- Luiz Fernando N. Capitulino --
Ok, it might still be caused by paravirt and alternatives instruction
patching. What if you also do :
alternative_instructions()
+ unsigned long flags;
/* The patching is not fully atomic, so try to avoid local interruptions
that might execute the to be patched code.
Other CPUs are not running. */
stop_nmi();
#ifdef CONFIG_X86_MCE
stop_mce();
#endif
+ local_irq_save(flags);
...
+ local_irq_restore(flags);
restart_nmi();
#ifdef CONFIG_X86_MCE
restart_mce();
#endif
?
Hrm,
Since those local_irq_save/restore occur _before_ the paravirt patching
is done, I wonder if there would be a race in the way cli/sti traps are
handled by Virtualbox wrt incoming interrupt ?
Thanks,
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
One thing that I think really needs to be considered is that the current PV stubs are (a) large, and (b) non-atomic. In the case at hand we have: c012fc69: 51 push %ecx c012fc6a: 52 push %edx c012fc6b: ff 15 40 b9 41 c0 call *0xc041b940 c012fc71: 5a pop %edx c012fc72: 59 pop %ecx Ten bytes replacing a two-byte native sequence. If this was done as a call to an out-of-line stub, it would be only five bytes, which would reduce native icache overhead from 400% to 150%, but perhaps more importantly, it would not be subject to returns inside the sequence itself (since the out-of-line stub would still exist.) As an optional bonus, at least on 32 bits the indirect call could be replaced with a direct call in the out-of-line stub. -hpa --
Hej! This last changes (in addition to the others you mentioned) seems
to be a good shot. I could reboot 8 times the guest, compile several
packages (something which always leeds to the oops) and currently i
build two big packages simultan. So this is heavy IO.
I will try tomorrow more heavy build tests (to gain the good feeling to
the vbox+guest kernel again like it was with 2.6.25), but i think your
changes goes in the right direction.
Here is the diff what i've changed on your hints:
,----[ arch/x86/kernel/alternative.c ]
| --- alternative.c.org 2008-07-13 23:51:29.000000000 +0200
| +++ alternative.c 2008-08-26 21:35:20.000000000 +0200
| @@ -343,6 +343,7 @@
| void alternatives_smp_switch(int smp)
| {
| struct smp_alt_module *mod;
| + unsigned long flags;
|
| #ifdef CONFIG_LOCKDEP
| /*
| @@ -359,7 +360,7 @@
| return;
| BUG_ON(!smp && (num_online_cpus() > 1));
|
| - spin_lock(&smp_alt);
| + spin_lock_irqsave(&smp_alt, flags);
|
| /*
| * Avoid unnecessary switches because it forces JIT based VMs to
| @@ -383,7 +384,7 @@
| mod->text, mod->text_end);
| }
| smp_mode = smp;
| - spin_unlock(&smp_alt);
| + spin_unlock_irqrestore(&smp_alt, flags);
| }
|
| #endif
| @@ -420,6 +421,7 @@
|
| void __init alternative_instructions(void)
| {
| + unsigned long flags;
| /* The patching is not fully atomic, so try to avoid local interruptions
| that might execute the to be patched code.
| Other CPUs are not running. */
| @@ -427,6 +429,7 @@
| #ifdef CONFIG_X86_MCE
| stop_mce();
| #endif
| + local_irq_save(flags);
|
| apply_alternatives(__alt_instructions, __alt_instructions_end);
|
| @@ -465,6 +468,7 @@
| (unsigned long)__smp_locks,
| (unsigned long)__smp_locks_end);
|
| + local_irq_restore(flags);
| restart_nmi();
| #ifdef CONFIG_X86_MCE
| restart_mce();
| @@ -508,33 +512,5 @@
| */
| void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
| {
| - unsigned ...OK, so we have a problem with interrupts coming while we are doing the alternatives patching. First thing, I wonder if Virtualbox expects the OS to patch all its paravirt instructions in one go ? Also, could you then try to : - to revert all those changes - Do this to text_poke_early and text_poke : - put the sync_core() within the irq off critical section (test) - add a wbinvd(); just after the sync_core() in both functions (test). Thanks, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Could you please explain more what to change? I don't see where to put Thank you Gerhard --
Sure, First patch to test : x86 alternative text_poke move sync_core Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> --- arch/x86/kernel/alternative.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6-lttng/arch/x86/kernel/alternative.c =================================================================== --- linux-2.6-lttng.orig/arch/x86/kernel/alternative.c 2008-08-26 17:26:41.000000000 -0400 +++ linux-2.6-lttng/arch/x86/kernel/alternative.c 2008-08-26 17:26:58.000000000 -0400 @@ -488,8 +488,8 @@ void *text_poke_early(void *addr, const unsigned long flags; local_irq_save(flags); memcpy(addr, opcode, len); - local_irq_restore(flags); sync_core(); + local_irq_restore(flags); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ return addr; @@ -529,9 +529,9 @@ void *__kprobes text_poke(void *addr, co BUG_ON(!vaddr); local_irq_save(flags); memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); + sync_core(); local_irq_restore(flags); vunmap(vaddr); - sync_core(); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ Second patch to apply on top of the first one : x86 alternative text_poke add wbinvd Add a cache flush instruction before reenabling interrupts in text_poke. If this works, we could use clflush() (which is sadly buggy on some archs) which is faster since it only clear a cacheline instead of the entire cache. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> --- arch/x86/kernel/alternative.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6-lttng/arch/x86/kernel/alternative.c =================================================================== --- linux-2.6-lttng.orig/arch/x86/kernel/alternative.c 2008-08-26 17:27:33.000000000 -0400 +++ linux-2.6-lttng/arch/x86/kernel/alternative.c 2008-08-26 17:27:53.000000000 -0400 @@ -489,6 ...
Well, in this case it's VirtualBox we're talking about, a virtual architecture. It's hard to know what it will do under *any* circumstances. -hpa --
With this got the oops again when compiling in guest. Reboot afterwards With second patch i get the early oops after "freeing smp". Seems no way to get the guest bootet normaly (ony with replace-paravirt). So the changes from the other mail took more effect IMHO. Gerhard --
Em Tue, 26 Aug 2008 22:34:49 +0200 Gerhard Brauer <gerhard.brauer@web.de> escreveu: | On Tue, Aug 26, 2008 at 02:15:58PM -0400, Mathieu Desnoyers wrote: | > | > Ok, it might still be caused by paravirt and alternatives instruction | > patching. What if you also do : | > | > alternative_instructions() | > | > + unsigned long flags; | > /* The patching is not fully atomic, so try to avoid local interruptions | > that might execute the to be patched code. | > Other CPUs are not running. */ | > stop_nmi(); | > #ifdef CONFIG_X86_MCE | > stop_mce(); | > #endif | > + local_irq_save(flags); | > | > | > ... | > + local_irq_restore(flags); | > restart_nmi(); | > #ifdef CONFIG_X86_MCE | > restart_mce(); | > #endif | > | > ? | | Hej! This last changes (in addition to the others you mentioned) seems | to be a good shot. I could reboot 8 times the guest, compile several | packages (something which always leeds to the oops) and currently i | build two big packages simultan. So this is heavy IO. Yeah, it works for me too and it's good to know that you are doing additional tests. I'm doing only boot tests... I was testing lots of kernels and doing additional tests would take a lot of time. Now, what does this mean? Is VirtualBox issuing interrupts when it shouldn't or should this section of the code be better protected? -- Luiz Fernando N. Capitulino --
Since this problem appears while we are using a simple memcpy (the text_poke_early version), but disappears when we disable interrupts for a longer period of this, I suspect a problem with irq disabling in Virtualbox. We could try to add some nsleep() or msleep() calls within text_poke and text_poke_early before and after the code modificatoin to see if the problem disappears. If it does, then that would somewhat confirm the racy irq disable thesis. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Em Wed, 27 Aug 2008 19:33:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | * Luiz Fernando N. Capitulino (lcapitulino@mandriva.com.br) wrote: | > Em Tue, 26 Aug 2008 22:34:49 +0200 | > Gerhard Brauer <gerhard.brauer@web.de> escreveu: | > | > | On Tue, Aug 26, 2008 at 02:15:58PM -0400, Mathieu Desnoyers wrote: | > | > | > | > Ok, it might still be caused by paravirt and alternatives instruction | > | > patching. What if you also do : | > | > | > | > alternative_instructions() | > | > | > | > + unsigned long flags; | > | > /* The patching is not fully atomic, so try to avoid local interruptions | > | > that might execute the to be patched code. | > | > Other CPUs are not running. */ | > | > stop_nmi(); | > | > #ifdef CONFIG_X86_MCE | > | > stop_mce(); | > | > #endif | > | > + local_irq_save(flags); | > | > | > | > | > | > ... | > | > + local_irq_restore(flags); | > | > restart_nmi(); | > | > #ifdef CONFIG_X86_MCE | > | > restart_mce(); | > | > #endif | > | > | > | > ? | > | | > | Hej! This last changes (in addition to the others you mentioned) seems | > | to be a good shot. I could reboot 8 times the guest, compile several | > | packages (something which always leeds to the oops) and currently i | > | build two big packages simultan. So this is heavy IO. | > | > Yeah, it works for me too and it's good to know that you are doing | > additional tests. I'm doing only boot tests... I was testing lots of | > kernels and doing additional tests would take a lot of time. | > | > Now, what does this mean? Is VirtualBox issuing interrupts when it | > shouldn't or should this section of the code be better protected? | > | | Since this problem appears while we are using a simple memcpy (the | text_poke_early version), but disappears when we disable interrupts for | a longer period of this, I suspect a problem with irq disabling in | Virtualbox. | | We ...
Ok, some news from archlinux side: Our distribution kernel was upgraded from 2.6.26.2 to 2.6.26.3. With this upgrade to patchlevel .3 the "early oops"(freeing smp...) has gone. My virtual machines boots always fine with this, and i have one confirmation from a user about this. Kernel upgrade does not solve the kernel panic during work with the VM, when there is heavy disk IO. I test and could reproduce this by untar 2 big files in seperate dirs: bsdtar -x -f VirtualBox-1.6.2-OSE.tar.bz2. Doing this simultan crashed the VM always. SreenShot: http://users.archlinux.de/~gerbra/tmp/2008-08-31-110449_724x456_scrot.png This heavy IO oops does not occur under 2.6.26.2 when using the "3-changes-patch" against alternatives.c, which we have tested in the other mails. There must be something irq related which fix this 3-changes-patch, and what was not fixed in 2.6.26.3 On the other hand: I never have stressed a VM like this before researching for this problem. So it could also be that the heavy-IO problem way a total seperate problem from that we're talking about here. Doing my "normal" work now in VM (it's my devel VM for compiling and testing), until now i don't have had this IO oops. We use a mostly unpatched kernel as distribution kernel. So short summary from my side: a) With "3-changes-patch" i got a rock solide VM b) 2.6.26.2 have the early oops on boot and IO oops when sometimes bootet. c) 2.6.26.3 have only the heavy-IO oops I'll try a fresh VM, where i will test: a) Using sata controller emulation as bus (now i have ide(piix3)) b) Using different filesystems (With 2.6.26.2 early oops and heavy-io oops could be reproduced with any filesystem). Regards Gerhard --
Hi On Sonntag, 31. August 2008, Gerhard Brauer wrote: Sorry, I can't confirm this here on Debian unstable (with virtualbox-ose 1.6.2 or 1.6.4), are you sure that other configuration options didn't change between the different kernel versions? Preemption and paravirt can influence the probability of the early boot panic seriously, without really avoiding it alltogether. Actually I still get the same issues with implanting ftp://ftp5.gwdg.de/pub/linux/archlinux/core/os/i686/kernel26-2.6.26.3-1-i686.pkg.tar.gz Regards Stefan Lippers-Hollmann --
Only changes between our 2.6.26.2-1 and 2.6.26.3-1 are some minor framebuffer changes in config. If i have a look at the different patchsets between the two versions i don't see something which could be Hmm, one user also reports that he have no problem when using a vanilla 2.6.26 as guest kernel. But there must be some reasons when different distributions notice a major problem between 2.6.25 and 2.6.26 with their stock kernels. Although i don't even know if our few reports here Gerhard -- Heute ist das Morgen wovor du gestern Angst hattest... --
Em Sun, 31 Aug 2008 11:29:23 +0200 Gerhard Brauer <gerhard.brauer@web.de> escreveu: | On Thu, Aug 28, 2008 at 10:30:13AM -0300, Luiz Fernando N. Capitulino wrote: | > Em Wed, 27 Aug 2008 19:33:28 -0400 | > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> escreveu: | > | | > | Since this problem appears while we are using a simple memcpy (the | > | text_poke_early version), but disappears when we disable interrupts for | > | a longer period of this, I suspect a problem with irq disabling in | > | Virtualbox. | > | | > | We could try to add some nsleep() or msleep() calls within text_poke and | > | text_poke_early before and after the code modificatoin to see if the | > | problem disappears. If it does, then that would somewhat confirm the | > | racy irq disable thesis. | > | > Well, a Ubuntu kernel guy has reported in the virtualbox's ticket[1] | > that the oops doesn't happen if he puts a printk() in the crash site. | > | > The funny thing is that someone (who might be a virtualbox developer) | > used the same race argument to say that this is a bug in the kernel. | > | > What concerns me though is that how can virtualbox be worth using | > in the Linux community if it's probably not working for various distros | > (currently Fedora, Ubuntu, Mandriva and ArchLinux). | > | > Thanks for the effort, guys. | > | > [1] http://www.virtualbox.org/ticket/1875 | | Ok, some news from archlinux side: | Our distribution kernel was upgraded from 2.6.26.2 to 2.6.26.3. With | this upgrade to patchlevel .3 the "early oops"(freeing smp...) has gone. | My virtual machines boots always fine with this, and i have one | confirmation from a user about this. | | Kernel upgrade does not solve the kernel panic during work with the VM, | when there is heavy disk IO. I test and could reproduce this by untar 2 | big files in seperate dirs: bsdtar -x -f VirtualBox-1.6.2-OSE.tar.bz2. | Doing this simultan crashed the VM always. | SreenShot: | ...
Am Sonntag, den 31.08.2008, 11:09 -0300 schrieb Luiz Fernando N. I was away the last days, but i notice that the virtualbox update to 2.0.2 solve all problems i mentioned. We also have same responses from other arch users. I never saw the "early boot oops" (this have had gone still with our kernel update, mystical....). But also the "heavy IO oops" has gone. So it was a virtualbox problem, they fixed it in: http://www.virtualbox.org/ticket/1875 So from my side this is solved. Gerhard --
great - the VirtualBox recompiler didnt notice the paravirt code modification sequence. Any Linux kernel side NOP issue (which caused that early oops) should be solved in 2.6.27-rc7 as well. so the combo of 2.0.2 and later VirtualBox plus v2.6.27-rc7 and later should have no known bugs. Ingo --
Em Sun, 21 Sep 2008 15:41:39 +0200 Gerhard Brauer <gerhard.brauer@web.de> escreveu: | Am Sonntag, den 31.08.2008, 11:09 -0300 schrieb Luiz Fernando N. | Capitulino: | | > Mandriva kernel was 2.6.26.3 based at the time I started testing | > this and all my last tests have been done on 2.6.27-rc4. I think it's | > very unusual to have a change in a -stable kernel not present in the | > latest -rc. | | I was away the last days, but i notice that the virtualbox update to | 2.0.2 solve all problems i mentioned. We also have same responses from | other arch users. | I never saw the "early boot oops" (this have had gone still with our | kernel update, mystical....). But also the "heavy IO oops" has gone. So | it was a virtualbox problem, they fixed it in: | http://www.virtualbox.org/ticket/1875 | | So from my side this is solved. Yeah, we have ran some tests here as well and it is solved. Thanks a lot for the people involved in debugging this problem. -- Luiz Fernando N. Capitulino --
nsleep isn't known here as a function, only references i found is maybe in posix-timers.c. msleep() is known, but each time i add for ex. msleep(100); in any place in text_poke and/or text_poke_early it get a kernel panic on boot. Here's a screenie: http://users.archlinux.de/~gerbra/tmp/2008-08-28-132337_724x456_scrot.png I also tried to work with the isolated changes we have last made, but it seems that only the 3 changes together work. Also i tried to went back to older versions of alternatives.c referenced in: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.26.y.git;a=history;f=arch/x... But with my few knowledges i ran in too many errors. So, have you any further ideas, code that i/we could test? Or - i'm naive - are the "3 changes" we made ready to go in the kernel Gerhard --
Sorry for the delay but i need to build a complete distribution kernel and my machine is not the fastest. My host: archlinux 2.6.26 P4 2Ghz VirtualBox: Sun xVM 1.6.4 gcc 4.4.1-3 My guest: archlinux 2.6.26 My "tests": I could sometimes boot the guest with the "tricks" (VT-x enabled, acpi off,...). But i always get an oops if i compile something bigger on this guest (ex. virtualbox-modules where the tarball must be untarrt with bsdtar -> disk io) If this happens the next reboot leads always to the early oops (Freeing smp....). Each reboot do this. Then i close virtualbox application, unload/reload vboxdrv from host and start vbox again. Then i could mostimes boot the guest again. But next heavy disk IO leads again to the oops. If i could boot without oops, and reboot or halt the guest, then the With our distribution kernel i could change these spin_lock/unlock in alternatives.c. Fist thought was that there was a slightly better behavior (first boot goes on, i could compile something, but next package i build thee opps (heavy io opps) comes again. And then also after reboot the early oops (freeing smp...) Here is a screenie from oops when building something: http://users.archlinux.de/~gerbra/tmp/2008-08-26-210724_724x456_scrot.png Sometimes (could not be reproduced) the virtualbox app also traps with an error dialog (Guru message), which offers a log from the VM and a scren shot. Maybe this could be helpfull. Log and screenie could be found here: Regards Gerhard --
Em Tue, 26 Aug 2008 16:18:51 +0200 Gerhard Brauer <gerhard.brauer@web.de> escreveu: | On Fri, Aug 22, 2008 at 02:08:13PM -0700, H. Peter Anvin wrote: | > Luiz Fernando N. Capitulino wrote: | >> | >> I have asked Mandriva and Ubuntu users to test this and all of | >> them so far are saying that noreplace-paravirt works. | >> | >> It makes the system slower, but it works. | >> | > | > Yes, the big issue is exactly what VirtualBox screws up in this matter, | > how to detect it, and how to work around it. | > | > It's pretty clear it's a VirtualBox f*ckup at this point, but the failure | > mechanism isn't at all obvious and so far the workaround is elusive. | > | > I'm strongly suspect this is a VirtualBox tcache management failure, but | > that doesn't help the situation without knowing how it happens. | | On Archlinux we have the same problem. We have a bugreport here: | http://bugs.archlinux.org/task/11141 | | Myself test it with a LiveCD/Install-ISO which has 2.6.26 as install | kernel. We have the guest oops on virtualbox-ose, virtualbox-sun and both on | i686 or x86_64 hosts. | | Some things i noticed: | - The system boots always when i either enable VT-x in guest settings or | disable acpi and run the guest with acpi=off. Yes, lots of ubuntu users have reported the same but another "lots" of them have reported that the trick didn't work. Thanks for joining! -- Luiz Fernando N. Capitulino --
I must relativate above: i have two test enviroments, one is our LiveCD/Install-ISO with 2.6.26 which we made special for a linux conference last weekend (our official iso comes still with 2.6.25). With this iso the "trick" with VT-x or noacpi works. But on an installed archlinux (with distribution kernel 2.6.26) this does'nt work. Sometimes it works when i restart the virtualbox application, but mostly not. So on this installed guest system the only working solution seems to add noreplace-paravirt as kernel parameter. But this makes the system terrible slow (mostly on udev things). I try Mathieu's hints currently by building a new distribution kernel with the changes. But i think the biggest problem to maybe solve this from the sight of kernel devs is that we all have different "test" enviroments (vbox versions, architectures, distribution kernels,...) where the oops (i think) not appears for all on the same place. On the other hand, the more we we try such patches in different enviroments there is a better chance to get a real fix - from kernel dev Gerhard -- www,archlinux.de --
Was looking at the code stream, and noticed this:
Code: c0 0f 84 0b 01 00 00 b8 d0 bf 41 c0 c7 05 6c c0 41 c0 ff ff ff ff
e8 7f 82 21 00 e8 1a 03 02 00 8b 45 b0 50 9d 0f 1f 84 00 00 00 <00> 00
8b 45 bc 83 c4 60 5b 5e 5f 5d c3 66 90 a1 6c c0 41 c0 e8
Code: c0 0f 84 0b 01 00 00 b8 d0 bf 41 c0 c7 05 6c c0 41 c0 ff ff ff ff
e8 7f 82 21 00 e8 1a 03 02 00 8b 45 b0 50 9d 0f 1f 84 00 00 00 <00> 00
8b 45 bc 83 c4 60 5b 5e 5f 5d c3 66 90 a1 6c c0 41 c0 e8
The EIP is in the *MIDDLE* of a NOPL instruction:
C012FC46 C00F84 ror byte [edi],0x84
C012FC49 0B01 or eax,[ecx]
C012FC4B 0000 add [eax],al
C012FC4D B8D0BF41C0 mov eax,0xc041bfd0
C012FC52 C7056CC041C0FFFF mov dword [dword 0xc041c06c],0xffffffff
-FFFF
C012FC5C E87F822100 call dword 0xc0347ee0
C012FC61 E81A030200 call dword 0xc014ff80
C012FC66 8B45B0 mov eax,[ebp-0x50]
C012FC69 50 push eax
C012FC6A 9D popfd
C012FC6B 0F1F840000000000 nop dword [eax+eax+0x0]
C012FC73 8B45BC mov eax,[ebp-0x44]
C012FC76 83C460 add esp,byte +0x60
C012FC79 5B pop ebx
C012FC7A 5E pop esi
C012FC7B 5F pop edi
C012FC7C 5D pop ebp
C012FC7D C3 ret
C012FC7E 6690 xchg ax,ax
C012FC80 A16CC041C0 mov eax,[0xc041c06c]
There are two possibilities: VirtualBox mis-executes (not merely traps,
which is what tip:master looks for) the NOPL instruction, or something
is jumping into the middle of the sequence that is then replaced by the
NOPL.
So, Luiz: the DEBUG_INFO version of vmlinux would be helpful. It would
also help to know the exact version of VirtualBox you're running, what
source you got it from, and what your host system looks like.
-hpa
--
The patch which turns on this bug this this important change to the apply paravirt : it disables interrupts _near_ the code patching, _within_ the loop. Before, interrupts were disabled outside of the loop. It needs to disable interrupts within the loop to be able to use vmap in text_poke(). So I bet VirtualBox has a race in the way it handles interrupt disabling. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
That seems a bit far-fetched. The fault is in an initcall, and there are no interrupts involved. Perhaps VirtualBox doesn't manage its tcache correctly, but I don't see this as being interrupt-related. -hpa --
Em Fri, 22 Aug 2008 10:16:07 -0700 "H. Peter Anvin" <hpa@zytor.com> escreveu: | So, Luiz: the DEBUG_INFO version of vmlinux would be helpful. It would | also help to know the exact version of VirtualBox you're running, what | source you got it from, and what your host system looks like. You will find vmlinux with DEBUG_INFO enabled at: http://users.mandriva.com.br/~lcapitulino/virtualbox-oops/ I'm running Mandriva's VirtualBox 1.6.4 OSE, my host kernel is 2.6.26-3mnb (patched). I could try with upstream's VirtualBox just to be sure it's not something else, but I don't think it is since there are reports for ArchLinux and Ubuntu as well: https://bugs.launchpad.net/ubuntu/intrepid/+source/linux/+bug/246067 -- Luiz Fernando N. Capitulino --
Not necessary, but I wanted to get the information so I can try to reproduce locally. -hpa --
What is your host *system* like -- CPU especially, and is your host kernel 32 or 64 bits? -hpa --
Em Fri, 22 Aug 2008 12:18:21 -0700 "H. Peter Anvin" <hpa@zytor.com> escreveu: | Luiz Fernando N. Capitulino wrote: | > | > I'm running Mandriva's VirtualBox 1.6.4 OSE, my host kernel is 2.6.26-3mnb | > (patched). | > | | What is your host *system* like -- CPU especially, and is your host | kernel 32 or 64 bits? 32 bits, /proc/cpuinfo output: """ processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 1 cpu MHz : 2410.462 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pebs bts pni monitor ds_cpl cid xtpr bogomips : 4825.33 clflush size : 64 power management: """ I have 1G of RAM and a VIA mobo. -- Luiz Fernando N. Capitulino --
Em Thu, 21 Aug 2008 14:34:07 -0700 "H. Peter Anvin" <hpa@zytor.com> escreveu: | > | > Does this look like a kernel bug? | > | | No, it looks like a very common virtualizer bug. Does the attached | patch work for you? Unfortunately it does not. -- Luiz Fernando N. Capitulino --
