Re: kvm causing memory corruption? now 2.6.26-rc4

Previous thread: [PATCH 2/2] [PATCH] move apic declarations to mach_apic.h by Glauber Costa on Tuesday, March 25, 2008 - 2:10 pm. (1 message)

Next thread: V3 Patch - Driver for Freescale 8610 and 5121 DIU by York Sun on Tuesday, March 25, 2008 - 2:27 pm. (5 messages)
From: Dave Hansen
Date: Tuesday, March 25, 2008 - 2:12 pm

I was getting some kvm userspace crashes trying to run a Windows guest.
So, I decided to try a recent kernel (2.6.25-rc6-00333-ga4083c9)  with
the kvm kernel code that shipped with that kernel.

I've had some lockups doing similar things over the last month or two,
but figured it was something really stupid I was doing, and never really
connected the dots.  Now, I've hooked up a serial console and reproduced
it with a fresh boot and not much else going on at all on the machine.

Machine is a Thinkpad T61.  .config is here:

http://sr71.net/~dave/linux/config-2.6.25-rc6-00333-ga4083c9

To trigger it, I first run kvm and see an error (-no-kvm works fine,
btw):

$ ~/src/kvm-userspace/qemu/x86_64-softmmu/qemu-system-x86_64 -hda ~/projects/qemu/windows-xp-base-runme.img 
kvm_run: Cannot allocate memory
kvm_run returned -12

Then, run it again.  I usually get an oops.  But, the weird part is that
the oops isn't *in* kvm.  It's in some other part of the kernel and in
some *OTHER* process.  One in bash is below.  That's what leads me to
believe it is memory corruption.  The machine also becomes increasingly
unstable after the original oops so there's definitely collateral
damage.
        
        $ addr2line -e vmlinux c01795e4
        /home/dave/kernels/linux-2.6.git/mm/filemap.c:1327
        
        int filemap_fault(struct vm_area_struct *vma, struct vm_fault
        *vmf)
        {
                int error;
                struct file *file = vma->vm_file;
                struct address_space *mapping = file->f_mapping;
                struct file_ra_state *ra = &file->f_ra;
        HERE--->struct inode *inode = mapping->host;

Which is a line of code that literally hasn't touched since the
beginning of time (in git terms :).  Full oops is below:

[  435.057922] BUG: unable to handle kernel NULL pointer dereference at 00000048
[  435.067275] IP: [<c01795e4>] filemap_fault+0x34/0x310
[  435.072815] *pdpt = 000000002a4a7001 *pde = 0000000000000000 
[  435.081272] Oops: ...
From: Avi Kivity
Date: Wednesday, March 26, 2008 - 2:50 am

This is fixed in 2.6.25-rc7.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dave Hansen
Date: Wednesday, March 26, 2008 - 9:52 am

I just updated to -rc7 and re-tested.  Same symptoms:

[  751.033545] BUG: unable to handle kernel paging request at 0096b848
[  751.040082] IP: [<c01a0636>] d_instantiate+0x26/0x50
[  751.048065] Oops: 0002 [#1] SMP
[  751.052057] Modules linked in: kvm_intel kvm nls_iso8859_1 vfat fat rfcomm l2cap tun ppdev acpi_cpufreq cpufreq_ondemand cpe
[  751.052057]
[  751.052057] Pid: 8743, comm: evolution Not tainted (2.6.25-rc7 #146)
[  751.052057] EIP: 0060:[<c01a0636>] EFLAGS: 00210286 CPU: 0
[  751.052057] EIP is at d_instantiate+0x26/0x50
[  751.052057] EAX: 0096b844 EBX: e65d7d48 ECX: 00000000 EDX: e65d7d60
[  751.052057] ESI: e67a7d00 EDI: e67a7cc0 EBP: e802ce48 ESP: e802ce3c
[  751.052057]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  751.052057] Process evolution (pid: 8743, ti=e802c000 task=f3c8ce00 task.ti=e802c000)
[  751.052057] Stack: e65d7d48 00000000 f4c191f8 e802ce60 c01e2fa4 e67a7cc0 f4c191f8 e65d7d48
[  751.052057]        e660c280 e802ce80 c01e30c3 00008180 e67a7cc0 00000000 c03b4a00 e660c280
[  751.052057]        00008180 e802cea0 c0197964 e802cf24 c03b4a00 e67a7cc0 e660c280 e802cf24
[  751.052057] Call Trace:
[  751.052057]  [<c01e2fa4>] ? ext3_add_nondir+0x34/0x60
[  751.052057]  [<c01e30c3>] ? ext3_create+0xf3/0x100
[  751.052057]  [<c0197964>] ? vfs_create+0x74/0x100
[  751.052057]  [<c0197c8f>] ? open_namei_create+0x4f/0xa0
[  751.052057]  [<c01981f3>] ? open_namei+0x513/0x560
[  751.052057]  [<c018db2c>] ? do_filp_open+0x2c/0x60
[  751.052057]  [<c018dd29>] ? get_unused_fd_flags+0x39/0xd0
[  751.052057]  [<c018dec4>] ? do_sys_open+0x54/0xe0
[  751.052057]  [<c018df6c>] ? sys_open+0x1c/0x20
[  751.052057]  [<c0104e2c>] ? sysenter_past_esp+0x6d/0xa5
[  751.052057]  [<c0390000>] ? quirk_vt8235_acpi+0x90/0xa0
[  751.052057]  =======================
[  751.052057] Code: 27 00 00 00 00 55 89 e5 57 89 c7 56 8d 70 40 53
89 d3 39 70 40 75 37 b8 40 15 4e c0 e8 14 d1 1f 00 85 db
[  751.052057] EIP: [<c01a0636>] d_instantiate+0x26/0x50 SS:ESP 0068:e802ce3c
[  ...
From: Avi Kivity
Date: Wednesday, March 26, 2008 - 9:58 am

Bad.  Which kvm userspace are you running?

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Wednesday, March 26, 2008 - 10:16 am

~/src/kvm-userspace$ git describe
kvm-63-118-g52be1a1

-- Dave

--

From: Avi Kivity
Date: Thursday, March 27, 2008 - 2:36 am

I dug out my i386 install and tried it.  Doesn't reproduce for me on 
either kvm.git or -rc7.

Do you have a working setup that we can bisect?


-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Thursday, March 27, 2008 - 3:10 am

[...]


btw, is this with >= 4GB RAM on the host?

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Thursday, March 27, 2008 - 7:55 am

Well, are you asking whether I have PAE on or not? :)

The host has 4GB of RAM exactly, but there are some serious BIOS holes,
so I needed PAE because about a gig of it is mapped >4GB.

[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009d800 (usable)
[    0.000000]  BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bf6b0000 (usable)
[    0.000000]  BIOS-e820: 00000000bf6b0000 - 00000000bf6cc000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bf6cc000 - 00000000bf700000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bf700000 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
[    0.000000]  BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 000000013c000000 (usable)


-- Dave

--

From: Avi Kivity
Date: Thursday, March 27, 2008 - 7:59 am

No, I'm asking whether there is a possibility of address truncation :)

PAE by itself doesn't affect kvm much, as it always runs the guest in 
pae mode.

Can you try running with mem=2000M or something?

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Thursday, March 27, 2008 - 9:30 am

Oh, sure.  I'll give that a shot.

-- Dave

--

From: Dave Hansen
Date: Monday, June 2, 2008 - 3:30 pm

I have a few more data points on this.  Sorry for the massive delay from
the last report -- I'm being a crappy bug reporter.  But, this is on my
one and only laptop which makes it a serious pain to diagnose.  I also
didn't have a hardware serial console on it before, which I do now.
This is all on 2.6.26-rc4-01549-g1beee8d.

Adding the mem= does not help at all.  But, it is all a bit more
diagnosable now than a month or two ago.  I turned on all of the kernel
debugging that I could get my grubby little hands on.  It now oopses
quite consistently when kvm runs instead of after.  Here's a collection
of oopses that I captured after setting up a serial line:

	http://sr71.net/~dave/kvm-oops1.txt

After collecting all those, I turned on CONFIG_DEBUG_HIGHMEM and the
oopses miraculously stopped.  But, the guest hung (for at least 5
minutes or so) during windows bootup, pegging my host CPU.  Most of the
CPU was going to klogd, so I checked dmesg.

I was seeing messages like this

[  428.918108] kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9

And quite a few of them, like 100,000/sec.  That's why klogd was pegging
the CPU.  Any idea on a next debugging step?

-- Dave

--

From: Dave Hansen
Date: Monday, June 2, 2008 - 5:59 pm

I followed these steps, and can now boot a vm.  But, causing the host
crashes is still a pretty bad bug.  I would imagine turning ACPI back on
will let me reproduce if necessary.

http://kvm.qumranet.com/kvmwiki/Windows_ACPI_Workaround

-- Dave

--

From: Avi Kivity
Date: Wednesday, June 4, 2008 - 6:42 am

Can you check with mem=900 (and CONFIG_HIGHMEM_DEBUG=n)?  That will 
confirm that the problems are highmem related, but not physical address 

That's a task switch.  Newer kvms handle them.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dave Hansen
Date: Friday, June 6, 2008 - 12:41 am

Do you mean 800M? ;) Highmem begins at 896MB if I remember correctly.


Newer userspace?  I'm running current kvm-git userspace as of a day or
two ago.

-- Dave

--

From: Avi Kivity
Date: Thursday, June 12, 2008 - 6:10 am

No, it's kernel code.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Tuesday, July 15, 2008 - 11:57 am

So, just a continuation of what we were talking about before...

I just had a bug triggered on my system because I'm running sparsemem
(it was in show_mem()).  I wonder if sparsemem is contributing to the
bug.  Does kvm ever do any arithmetic that you can think of with 'struct
page's?

To summarize my config: 
* 4GB of RAM
* hole from 3-4GB (so ram ranges from 0-5GB)
* PAE enabled
* SPARSEMEM=y

http://sr71.net/~dave/linux/config-2.6.26-rc8-00089-ge1441b9

I'm compiling a kernel right now to try without sparsemem.

-- Dave

--

From: Dave Hansen
Date: Wednesday, July 16, 2008 - 2:44 pm

On a suggestion of Anthony's, I tried a defconfig kernel.

It is now bombing out on an assertion in the lapic code:

	http://sr71.net/~dave/linux/2.6.26-oops1.txt



-- Dave

--

From: Dave Hansen
Date: Wednesday, July 16, 2008 - 3:48 pm

I think I found it!!!

$ (objdump -d kvm.ko ; objdump -d kvm-intel.ko ) | egrep 'sub.*0x...,.*esp|>:'  | egrep sub -B1
00001a90 <kvm_vcpu_ioctl>:
    1a9a:	81 ec 60 06 00 00    	sub    $0x660,%esp
--
00004e90 <kvm_arch_vcpu_ioctl>:
    4e9d:	81 ec 6c 08 00 00    	sub    $0x86c,%esp
--
00005900 <kvm_arch_vm_ioctl>:
    5903:	81 ec 34 05 00 00    	sub    $0x534,%esp
--
0000d4f0 <paging64_prefetch_page>:
    d4f8:	81 ec 1c 01 00 00    	sub    $0x11c,%esp
--
0000dfd0 <paging32_prefetch_page>:
    dfd8:	81 ec 1c 01 00 00    	sub    $0x11c,%esp
--
0000f390 <kvm_pv_mmu_op>:
    f3a1:	81 ec 28 02 00 00    	sub    $0x228,%esp

We're simply overflowing the stack.  I changed all of the large on-stack
allocations to 'static', and it actually boots now.  I know 'static'
isn't safe, but it was good for a quick test.

A 'make stackcheck' confirms this:

dave@nimitz:~/kernels/linux-2.6.git$ make checkstack
objdump -d vmlinux $(find . -name '*.ko') | \
	perl /home/dave/kernels/linux-2.6.git-t61/scripts/checkstack.pl i386
0x000042d3 kvm_arch_vcpu_ioctl [kvm]:			2148
0x000012e3 kvm_vcpu_ioctl [kvm]:			1620
0x00004a83 kvm_arch_vm_ioctl [kvm]:			1332
0x00009a26 airo_get_aplist [airo]:			1140
0x00009b76 airo_get_aplist [airo]:			1140
0x00009c82 airo_get_aplist [airo]:			1140
...

In other words, kvm has the top 3 stack users in my kernel.  As you can
see from my trace above, these things also get called with super-long
stacks already.  Man.  That sucked to find.

Avi, how would you like this fixed?  I'd be happy to prepare some
patches.  Do you have a particular approach that you think we should
use?  Just make the big objects dynamically allocated?

-- Dave

--

From: Avi Kivity
Date: Wednesday, July 16, 2008 - 10:52 pm

Yes, things like kvm_lapic_state are way too big to be on the stack.  
There's an additional problem here, that apparently your gcc (which 
version?) doesn't fold objects in a switch statement into the same stack 
slot:

switch (...) {
    case x: {
         struct medium a;
         ...
    }
    case y:
          struct medium b;
          ...
    }
};

These could be solved either by stack allocation, or by moving into 
functions marked noinline.  Whichever is easier.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Roland Dreier
Date: Wednesday, July 16, 2008 - 11:08 pm

> Yes, things like kvm_lapic_state are way too big to be on the stack.

I had a quick look at the code, and my worry about dynamic allocation
would be that handling allocation failure seems like it might get
tricky.  Eg for handling struct kvm_pv_mmu_op_buffer (which is 528 bytes
on the stack in kvm_pv_mmu_op()) can you deal with an mmu op failing?
(maybe in that case you can easily by just setting *ret to 0?)

 > There's an additional problem here, that apparently your gcc (which
 > version?) doesn't fold objects in a switch statement into the same
 > stack slot:
 > 
 > switch (...) {
 >    case x: {
 >         struct medium a;
 >         ...
 >    }
 >    case y:
 >          struct medium b;
 >          ...
 >    }
 > };

A trick for this is to do:

	union {
		struct medium1 a;
		struct medium2 b;
	} u;

	switch (...) {
	case x:
		use u.a;
		...

	case y:
		use u.b;
		...
	}
--

From: Dave Hansen
Date: Thursday, July 17, 2008 - 7:06 am

Well, you *better* be able to deal with it.  :)

This code is also doing a *ton* of copy_to/from_user().  If userspace
had one of its input buffers swapped out (or one of its output buffers
not faulted in yet) and we're out of memory enough to be failing
kmallocs() then we're sure as heck also going to failing the user
copies.

I think it's a non-issue.

-- Dave

--

From: Avi Kivity
Date: Thursday, July 17, 2008 - 7:15 am

Yes, it's designed to be restartable.  Returning 0 should be fine.

We can reduce the buffer size to 256 though.  I wouldn't want an 
allocation in this hot path.

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Wednesday, July 16, 2008 - 11:14 pm

$ gcc -v

Did you mean dynamic allocation? :)

-- Dave

--

From: Avi Kivity
Date: Wednesday, July 16, 2008 - 11:42 pm

Yeah.  Too early in the morning for me.

Roland's trick might be easier than either, though.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Wednesday, July 16, 2008 - 11:45 pm

That also reminds me.  kvm somehow has an outdated copy of
anon_inodes.c.  It needs to be updated for the r/o bind mount patches to
add a proper mnt_want/drop_write().  Otherwise, you'll run into warnings
about imbalanced mount writer counts.  Something like this will do, but
it would be best to just stay up to date with mainline for this kind of
stuff.

diff --git a/kernel/anon_inodes.c b/kernel/anon_inodes.c
index 74630ea..a47bc55 100644
--- a/kernel/anon_inodes.c
+++ b/kernel/anon_inodes.c
@@ -90,9 +90,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 
 	if (IS_ERR(anon_inode_inode))
 		return -ENODEV;
-	file = get_empty_filp();
-	if (!file)
-		return -ENFILE;
 
 	inode = igrab(anon_inode_inode);
 	if (IS_ERR(inode)) {
@@ -121,14 +118,15 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
 
-	file->f_vfsmnt = mntget(anon_inode_mnt);
-	file->f_dentry = dentry;
-	file->f_mapping = inode->i_mapping;
+	file = alloc_file(anon_inode_mnt, dentry, FMODE_READ | FMODE_WRITE,
+			(struct file_operations *)fops);
+	if (!file) {
+		error = -ENFILE;
+		goto err_dput;
+	}
 
 	file->f_pos = 0;
 	file->f_flags = O_RDWR;
-	file->f_op = (struct file_operations *)fops;
-	file->f_mode = FMODE_READ | FMODE_WRITE;
 	file->f_version = 0;
 	file->private_data = priv;
 
@@ -136,12 +134,13 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 
 	return fd;
 
+err_dput:
+	dput(dentry);
 err_put_unused_fd:
 	put_unused_fd(fd);
 err_iput:
 	iput(inode);
 err_put_filp:
-	fput(file);
 	return error;
 }
 


-- Dave

--

From: Avi Kivity
Date: Saturday, July 19, 2008 - 12:33 am

Instead I changed the code to use the host's anon_inodes when 
available.  I also introduced a couple of bugs but the list fixed these 
quickly.  Thanks list!

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Wednesday, July 16, 2008 - 11:27 pm

A newer gcc (4.2) makes this a wee bit better, but probably still
worrisome.

dave@nimitz:~/src/kvm-userspace-virgin/kernel$ objdump -d *.ko | perl /home/dave/kernels/linux-2.6.git-t61/scripts/checkstack.pl i386
0x00007b33 kvm_arch_vm_ioctl [kvm]:			1164
0x000072e8 kvm_arch_vcpu_ioctl [kvm]:			1060
0x00007497 kvm_arch_vcpu_ioctl [kvm]:			1060
0x0000adad kvm_pv_mmu_op [kvm]:				564
0x0000ae6b kvm_pv_mmu_op [kvm]:				564
0x00002553 kvm_vcpu_ioctl [kvm]:			468
0x0000d7e8 paging64_prefetch_page [kvm]:		304
0x0000d926 paging64_prefetch_page [kvm]:		304
0x0000d936 paging64_prefetch_page [kvm]:		304
0x0000c118 paging32_prefetch_page [kvm]:		300
0x0000c146 paging32_prefetch_page [kvm]:		300
0x0000c286 paging32_prefetch_page [kvm]:		300
0x0000f1eb x86_emulate_insn [kvm]:			192
0x0000f372 x86_emulate_insn [kvm]:			192
0x0000f5f2 x86_emulate_insn [kvm]:			192
0x00005373 kvm_task_switch [kvm]:			172
0x00001466 __kvm_set_memory_region [kvm]:		164
0x000014da __kvm_set_memory_region [kvm]:		164
0x0000dae8 paging64_page_fault [kvm]:			156
0x0000db06 paging64_page_fault [kvm]:			156
0x0000db67 paging64_page_fault [kvm]:			156
0x0000dcd5 paging64_page_fault [kvm]:			156
0x0000dcf1 paging64_page_fault [kvm]:			156
0x0000c356 kvm_mmu_pte_write [kvm]:			152
0x0000c50a kvm_mmu_pte_write [kvm]:			152
0x0000df28 paging32_page_fault [kvm]:			144
0x0000df46 paging32_page_fault [kvm]:			144
0x0000dfa7 paging32_page_fault [kvm]:			144
0x0000e115 paging32_page_fault [kvm]:			144
0x0000e131 paging32_page_fault [kvm]:			144
0x00014ed3 apic_mmio_read [kvm]:			108

-- Dave

--

From: Avi Kivity
Date: Wednesday, July 16, 2008 - 10:38 pm

Well that assert is plain wrong:

static int apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
               int short_hand, int dest, int dest_mode)
{
    int result = 0;
    struct kvm_lapic *target = vcpu->arch.apic;

    apic_debug("target %p, source %p, dest 0x%x, "
           "dest_mode 0x%x, short_hand 0x%x",
           target, source, dest, dest_mode, short_hand);

    ASSERT(!target);


It should be ASSERT(target), if anything.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Thursday, March 27, 2008 - 8:50 am

I don't really have a working revision to bisect against.  I'm not sure
that it ever worked.  It's also on my actual laptop, so it's a bit of a
pain to get any other work done while I'm bisecting. :)

I'll move the Windows image over to another machine today and see if I
can reproduce elsewhere.  I'll also check some older versions of KVM to
see if any of those work.  If I do that, should I keep the kvm
userspace, modules and BIOSes all synchronized from each version that I
test?

-- Dave

--

From: Avi Kivity
Date: Thursday, March 27, 2008 - 8:53 am

I'm fairly sure Windows works on kvm...


You can keep the userspace (qemu + bios) fixed and change the kernel, or 
vice versa.

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Thursday, March 27, 2008 - 9:30 am

Oh, I didn't mean to imply that Windows doesn't work, just that the
particular perverted way in which I'm poking it may have never

-- Dave

--

Previous thread: [PATCH 2/2] [PATCH] move apic declarations to mach_apic.h by Glauber Costa on Tuesday, March 25, 2008 - 2:10 pm. (1 message)

Next thread: V3 Patch - Driver for Freescale 8610 and 5121 DIU by York Sun on Tuesday, March 25, 2008 - 2:27 pm. (5 messages)