Cannot boot xen DomU > 2.6.23.1

Previous thread: [PATCH] x86: fix unconditional arch/x86/kernel/pcspeaker.c compiling by Michael Opdenacker on Thursday, January 17, 2008 - 11:43 am. (34 messages)

Next thread: kexec, initramdisk and dmcrypt questions by Christoph Anton Mitterer on Thursday, January 17, 2008 - 11:50 am. (1 message)
To: <linux-kernel@...>
Date: Thursday, January 17, 2008 - 12:13 pm

Hi,

I finally found the piece of code that prevents me from booting Xen DomU
with vallina kernel > 2.6.23.1.

The problem is that with every kernel (> 2.6.32.1 including 2.6.24 RCs)
will just hang with "too much" console activity. Sometimes (well most
of the time) boot msg is too much. When I can boot into the kernel,
generating a lots of cosole out it will hang, no oops, no more
console/network. Generating with the same way through ssh will not
hang the domU.

When I reverse the following patch, things work as before, tried this
with 2.6.23.14 and 2.6.14-rc8. But I don't have the knowledge to
understand the reason behind this.

BTW, I am not subscribed.

--- a/include/xen/interface/vcpu.h
+++ b/include/xen/interface/vcpu.h
@@ -160,8 +160,9 @@ struct vcpu_set_singleshot_timer {
*/
#define VCPUOP_register_vcpu_info 10 /* arg == struct vcpu_info */
struct vcpu_register_vcpu_info {
- uint32_t mfn; /* mfn of page to place vcpu_info */
- uint32_t offset; /* offset within page */
+ uint64_t mfn; /* mfn of page to place vcpu_info */
+ uint32_t offset; /* offset within page */
+ uint32_t rsvd; /* unused */
};

#endif /* __XEN_PUBLIC_VCPU_H__ */

I am running Xen 3.1.2 PAE

# uname -a
Linux builder 2.6.24-rc8 #2 SMP Thu Jan 17 16:37:19 CET 2008 i686
AMD Athlon(tm) X2 Dual Core Processor BE-2300 AuthenticAMD GNU/Linux

# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) X2 Dual Core Processor BE-2300
stepping : 1
cpu MHz : 1899.930
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu tsc msr pae mce cx8 mca cmov pat pse36 clflush
mmx fxsr sse sse2 ht nx mm...

To: xming <xmingske@...>
Cc: <linux-kernel@...>
Date: Thursday, January 17, 2008 - 1:15 pm

Uh, that's very strange. That patch fixes an outright bug, unless
you're using one specific changeset out of the xen-unstable mercurial
tree. What version of Xen are you using? Is it one you've built from
xenbits, or distributed by someone? Are you using a 32 or 64 bit xen/dom0?

I don't understand where the "too much console activity" message comes
from. Could you post an actual cut'n'paste of the message, or a
screenshot? Are you using a serial console on your dom0?

Thanks,

--

To: <linux-kernel@...>
Cc: Jeremy Fitzhardinge <jeremy@...>
Date: Thursday, January 17, 2008 - 3:13 pm

I am running Xen 3.1.2 from Gentoo, tried 3.1.1. 32 bit PAE as I wrote in the

I never had any problem (at least THIS problem) with any PV DomU (2.6.18,
2.6.20, 2.6.21, 2.6.23-RCs and 2.6.23.1). The problem started with 2.6.23.3
and today I finally found time to track it down.

This only affects PV domU, so I don't undestand your question about serial
console of Dom0.

The symptom is (with a lot of subjective judgment) when there is a lot (or
too quick) output on the console of the domU (hvc0 connected with either
"xm crea file.cfg -c" or "xm cons id") the whole PV domU hangs. It will
really hang at random places, sometimes right after init and sometime
after I logged in and just generate some ouput (on hvc0) like "find /". IIRC
I have never seen a hang before init.

When it hangs there is no output on console any more and network to
that domU is dead too, nothing affects dom0. "xm list" still reports the
domU as "r", nothing special in the logs (of dom0 xen logs) and no
OOPS nor panic reported in the domU. It's seem that it's running in a
infinite loop.

So, I can make screenshots, but they won't tell you anything, the is no
message, it's just dead.

Gentoo's xen is just a src tarball made from the mercurial repro, w/o
any patches (AFAIK).

cheers

Ming-Wei Shih
--

To: xming <xmingske@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Friday, January 18, 2008 - 3:32 am

OK, I misunderstood your original report to mean that something was
complaining about "too much" output. You're saying that lots of console
output seems to lock the domain.

I've had a report about heavy disk IO seems to lock up as well. Perhaps
they're both related to high event rates. Do you think you could try an
IO-intensive workload to see if you can get a similar lockup?

When the domain is locked up, what does /usr/lib/xen/bin/xenctx say?

Hm. Rather than backing out the structure-change patch, could you try
this workaround:

diff -r be3ca4e0e19e arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c Thu Jan 17 14:25:07 2008 -0800
+++ b/arch/x86/xen/enlighten.c Thu Jan 17 16:37:42 2008 -0800
@@ -95,7 +95,7 @@ struct shared_info *HYPERVISOR_shared_in
*
* 0: not available, 1: available
*/
-static int have_vcpu_info_placement = 1;
+static int have_vcpu_info_placement = 0;

static void __init xen_vcpu_setup(int cpu)
{

Reverting the structure shape could cause crashes or random data
corruption, but it has the side-effect of disabling the vpu_info
structure placement mechanism. This patch disables it cleanly.

Thanks,
J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Friday, January 18, 2008 - 8:38 am

First of all this patch solves the lock-ups, it works as advertised :) The DomU
works as before. Just for the record for people trying to apply this to 2.6.23.x
you need to change the /x86/ to /i386/, unified x86 is since 2.6.24.

I tried to create 2 tests, one is IO intensive and the other is console
output intensive:

test1. bonnie++ -s 1024 -u nobody
test2. for i in `seq 1 50000`; do echo 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ; done

In all acese where it crashed(hanged) there was no oops/panic.

scenario 1 (booted 2.6.23.14 as is)
--------------------------------------------------
(but with init=/bin/bash, otherwise I couldn't get a prompt)

test1: crashed

# /usr/lib/xen/bin/xenctx 108
eip: c037c0c7
esp: c0343f90
eax: 00000000 ebx: 00000001 ecx: 00000000 edx: c0342000
esi: c0373004 edi: c1210df4 ebp: 00001b7d
cs: 00000061 ds: 0000007b fs: 000000d8 gs: 00000000

Stack:
c0100add c0378980 c0101962 c0104821 c120a000 c0378df4 c0348cff 00000025
c0348430 00000004 00009000 00006df4 00ea1000 c0363be0 c0343fe8 c03dd007
00000000 c0343fec c0349868 c0343fe0 178bc1f1 00002001 01020800 00060fb1
00000000 c03dd000 00000000 00000000

Code:
cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 06 00 00 00 cd 82 <c3> cc
cc cc cc cc cc cc cc cc cc

Call Trace:
[<c037c0c7>] <--
[<c0100add>]
[<c0378980>]
[<c0101962>]
[<c0104821>]
[<c120a000>]
[<c0378df4>]
[<c0348cff>]
[<c0348430>]
[<c0363be0>]
[<c0343fe8>]
[<c03dd007>]
[<c0343fec>]
[<c0349868>]
[<c0343fe0>]
[<178bc1f1>]
[<c03dd000>]

test2: crashed after many many retries and sometimes with strange output

00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAA
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAA...

To: xming <xmingske@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Friday, January 18, 2008 - 12:19 pm

OK, good. I guess events are getting lost somewhere with vcpu_info

Would it be possible to map the eip and some top parts of the stack back
to kernel symbols? Seems to be the same place in both traces, which is

Hm, I guess some of the output is getting dropped. Does this happen
with 2.6.18-xen?

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Friday, January 18, 2008 - 12:56 pm

yes it does

00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ
00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ

# uname -a
Linux builder 2.6.18-xen-r8 #3 SMP Thu Dec 20 15:07:20 CET 2007 i686
AMD Athlon(tm) X2 Dual Core Processor BE-2300 AuthenticAMD GNU/Linux

cheers

xming
--

To: xming <xmingske@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Friday, January 18, 2008 - 1:26 pm

Do "nm -n vmlinux" on the kernel to set an address sorted list of
symbols, and then look to see what's near the eip (c037c0c7) and near
the top of the stack (c0100add, c0378980, c0101962, ...). Some of these
may be in data, or other strange places, but the ones which correspond

OK, good. I Didn't Break It (tm) ;)

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Sunday, January 20, 2008 - 9:08 am

ok I have done some of them, but I still don't know what I should be looking
at. Do you mean code related to xen or code related to have_vcpu_info_placement?
Please be patient with me :)

I just paste some of the result (around those addresses) here:

c037b000 B empty_zero_page
c037c000 B hypercall_page
c037d000 B system_state

c0100a00 t xen_cpuid
c0100a80 t xen_set_debugreg
c0100a90 t xen_get_debugreg
c0100aa0 t xen_save_fl
c0100ac0 t xen_irq_disable
c0100ad0 t xen_safe_halt
c0100af0 t xen_halt
c0100b20 t xen_store_tr
c0100b30 t cvt_gate_to_trap
c0100bb0 t xen_io_delay

c0378980 D per_cpu__irq_stat
c03789c0 d per_cpu__runqueues
c0378df4 D __per_cpu_end

c01018b0 t xen_flush_tlb_single
c0101940 t xen_idle
c0101980 T xen_setup_features
c01019c0 T xen_mc_flush
c0101aa0 T xen_mc_callback

c0104710 T kernel_thread
c01047c0 T cpu_idle
c0104840 T cpu_idle_wait
c0104940 T exit_thread

c0103fe4 T xen_irq_enable_direct
c0103ff1 T xen_irq_enable_direct_reloc
c0103ff5 T xen_irq_enable_direct_end
c0103ff8 T xen_irq_disable_direct
c0104000 T xen_irq_disable_direct_end
c0104004 T xen_save_fl_direct
c0104011 T xen_save_fl_direct_end
c0104014 T xen_restore_fl_direct
c010402b T xen_restore_fl_direct_reloc

c03483f0 t maxcpus
c0348430 t unknown_bootoption

So no fix from you? :)

Thanks
--

To: xming <xmingske@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Sunday, January 20, 2008 - 2:37 pm

Thanks, that answers that particular question; the vcpu is blocked
waiting for something to happen, which probably means it missed the
event which was supposed to wake it up. Why is another question. At
least there's a workaround, and that workaround gives me some clue where
to look.

Maybe when I have nothing else to do.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Sunday, January 20, 2008 - 3:29 pm

It doesn't matter, I tried vcpu=1 and vcpu=2, unless you want me to try

I'll wait, or should I poke xen-devel?
--

To: xming <xmingske@...>
Cc: <linux-kernel@...>, Xen-devel <xen-devel@...>
Date: Sunday, January 20, 2008 - 7:52 pm

I'll probably look at this when my current batch of work is under
control. In the meantime, I'll submit the workaround patch to keep

It would be an interesting datapoint, but I don't think it will make a

Poke xen-devel.

J
--

Previous thread: [PATCH] x86: fix unconditional arch/x86/kernel/pcspeaker.c compiling by Michael Opdenacker on Thursday, January 17, 2008 - 11:43 am. (34 messages)

Next thread: kexec, initramdisk and dmcrypt questions by Christoph Anton Mitterer on Thursday, January 17, 2008 - 11:50 am. (1 message)