Re: [kvm-devel] guest crash on 2.6.20-rc4

Previous thread: [PATCH] Driver core: fix refcounting bug by Alan Stern on Monday, January 8, 2007 - 9:06 am. (5 messages)

Next thread: [PATCH] Broadcom 4400 resume small fix by Dmitriy Monakhov on Monday, January 8, 2007 - 9:26 am. (2 messages)
From: Roland Dreier
Date: Monday, January 8, 2007 - 9:18 am

I'm running a 64-bit Fedora 6 install as a guest on a host running
2.6.20-rc4 with the kvm-10 userspace release.  The CPU is a Xeon 5160
and I have 6 GB of RAM.  The guest is given 512 MB of memory.  I left
the guest idle overnight, and the makewhatis cron job seems to have
triggered this:

    Unable to handle kernel paging request at ffff81000ba04000 RIP:
     [<ffffffff8025f402>] clear_page+0x16/0x44
    PGD 8063 PUD 9063 PMD 800000000ba001e3 PTE aad8a7d881d984d9
    Oops: 0003 [1] SMP
    last sysfs file: /block/hda/removable
    CPU 0
    Modules linked in: autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables dm_multipath video sbs i2c_ec i2c_core button battery asus_acpi ac ipv6 parport_pc lp parport floppy pcspkr ne2k_pci 8390 serio_raw ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
    Pid: 4687, comm: makewhatis Not tainted 2.6.18-1.2869.fc6 #1
    RIP: 0010:[<ffffffff8025f402>]  [<ffffffff8025f402>] clear_page+0x16/0x44
    RSP: 0018:ffff810003e85c40  EFLAGS: 00010216
    RAX: 0000000000000000 RBX: ffff8100012e9140 RCX: 000000000000003f
    RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff81000ba04000
    RBP: 0000000000000001 R08: ffff81001fdc4d8e R09: 00000000000021e6
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff8100012e9100
    R13: ffff81000000b500 R14: ffff81000000c400 R15: 0000000000000001
    FS:  00002aaaaaac6db0(0000) GS:ffffffff805e4000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffff81000ba04000 CR3: 0000000003c05000 CR4: 00000000000006e0
    Process makewhatis (pid: 4687, threadinfo ffff810003e84000, task ffff81001f3a77f0)
    Stack:  ffffffff8020a632 ffff81000000c400 0000004400000000 ffff81000000c400
     000284d000000000 0000000000000001 0000000000000001 00000000000084d0
     ffff81000000c400 ...
From: Avi Kivity
Date: Tuesday, January 9, 2007 - 1:53 am

A way to reproduce this would be nice, though I realize it's asking much.

-- 
error compiling committee.c: too many arguments to function

-

From: Avi Kivity
Date: Tuesday, January 9, 2007 - 2:26 pm

I've managed to reproduce a bug with similar characteristics: a write 
fault into a present, writable kernel page.  The attached patch should 
fix it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

From: Roland Dreier
Date: Wednesday, January 10, 2007 - 11:31 am

> I've managed to reproduce a bug with similar characteristics: a write
 > fault into a present, writable kernel page.  The attached patch should
 > fix it.

Sorry for the delay in continuing this thread.  Anyway, the oops seems
to be pretty reproducible by running the makewhatis and locate db
update scripts in a loop.  I've applied your patch and kicked off a
test run;  I'll let you know if I can still get the bug to happen.

Thanks
-

From: Roland Dreier
Date: Wednesday, January 10, 2007 - 2:33 pm

>  	if (is_writeble_pte(*shadow_ent))
 > -		return 0;
 > +		return 1;

With this patch, it looks like my guest is surviving the load that
triggered the oops before.  So I think this fixes the issue I saw as well.
I assume you'll send this in for 2.6.20?

 - R.
-

From: Avi Kivity
Date: Thursday, January 11, 2007 - 1:06 am

The patch actually replaces one bug (guest pagefaults on writable dirty 
ptes, under certain conditions) with another, rarer one (spinning on a 
user-mode pagefault on writable dirty kernel ptes).  I'll do it right 
and re-test, then send for .20 along with a few friends.


-- 
error compiling committee.c: too many arguments to function

-

Previous thread: [PATCH] Driver core: fix refcounting bug by Alan Stern on Monday, January 8, 2007 - 9:06 am. (5 messages)

Next thread: [PATCH] Broadcom 4400 resume small fix by Dmitriy Monakhov on Monday, January 8, 2007 - 9:26 am. (2 messages)