Hi, when do migration test, we hit the panic as below: <1>BUG: unable to handle kernel paging request at 0000000b819fdb98 <1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34 <4>PGD 94b10067 PUD 0 <0>Oops: 0000 [#1] SMP <0>last sysfs file: /sys/class/misc/autofs/dev <4>CPU 3 <4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U) auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U) nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U) nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U) snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 18, comm: events/3 Not tainted 2.6.32 RIP: e030:[<ffffffff812a588f>] [<ffffffff812a588f>] ify_remote_via_irq+0x13/0x34 RSP: e02b:ffff8800e7bf7bd0 EFLAGS: 00010202 RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000 RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4 RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0 R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000 R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240 FS: 00007f409d3906e0(0000) GS:ffff8800028b8000(0000) GS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task f8800e7bf4540) Stack: 0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9 <0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee <0> ...
Joe, Patch looks good, however.. I am unclear from your description whether the patch fixes the problem (I would presume so). Or does it take a long time --
Yes, more than 100 migrations. we hit this issue around 3 times.
I dumped vmcore when guest crashed, from vmcore everything
looked good, fb_info, xenfb_info and so on.
Checked the calltrace I suspected when guest resuming, the process kevent
scheduled and refresh xenfb. look like when call notify_remote_via_irq(), better
to confirm irq is valid?
Please review new patch.
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
video/xen-fbfront.c | 19 +++++++++++--------
xen/events.c | 2 ++
2 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..367fb1c 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
static int xenfb_connect_backend(struct xenbus_device *dev,
struct xenfb_info *info)
{
- int ret, evtchn;
+ int ret, evtchn, irq;
struct xenbus_transaction xbt;
ret = xenbus_alloc_evtchn(dev, &evtchn);
if (ret)
return ret;
- ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
+ irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
0, dev->devicetype, info);
- if (ret < 0) {
+ if (irq < 0) {
xenbus_free_evtchn(dev, evtchn);
xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
- return ret;
+ return irq;
}
- info->irq = ret;
-
again:
ret = xenbus_transaction_start(&xbt);
if (ret) {
xenbus_dev_fatal(dev, ret, "starting transaction");
- return ret;
+ goto unbind_irq;
}
ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
virt_to_mfn(info->page));
@@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
if (ret == -EAGAIN)
goto again;
xenbus_dev_fatal(dev, ret, ...OK, so you are still trying to find the culprit. Did you look at this patch from Ian: https://patchwork.kernel.org/patch/403192/ And the event channels are correct? You could insert a WARN_ON here to see see if you get this during your migration process. --
I also don't see how the patch relates to the stack trace. Is the issue is that xenfb_send_event is called between xenfb_resume (which tears down the state, including evtchn->irq binding) and the probe/connect of the new fb? I suspect xenfb_resume should set info->update_wanted to 0. This will defer updates until we have successfully reconnected. Otherwise, since xenfb_resume also calls xenfb_connect I presume the call of xenfb_send_event is asynchronous. Which would suggest that some other locking is missing. --
