Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

Previous thread: Re: kernel decompressor interface by H. Peter Anvin on Tuesday, March 30, 2010 - 10:45 am. (1 message)

Next thread: [PATCH] FS-Cache: Order the debugfs stats correctly by David Howells on Tuesday, March 30, 2010 - 11:02 am. (1 message)
From: Linus Torvalds
Date: Tuesday, March 30, 2010 - 10:50 am

Ok, so -rc2 was messy, no question about it. I'm too much of a softie to 
hold back some peoples work, so my hard-line -rc1 didn't work out the way 
I wanted. But _next_ time! For sure this time.

Anyway, from a messy -rc2 we now have a -rc3 that should be in much better 
shape. Regressions fixed, and the ShortLog is short enough to be worth 
posting to lkml (-rc1 never is, and -rc2 seldom is. It's not like -rc2's 
are generally wondeful, this time around wasn't _that_ much different).

One regression fix that is worth pointing out is the EXT3_STATE_NEW 
handling in ext3, because the regression that one fixed was potentially 
quite nasty. It wouldn't cause data corruption, but it _could_ cause 
at least corrupt security labels.

So if you have SELinux enabled (either in permissive or enforcing mode) 
_and_ you ran 2.6.32-rc[12] _and_ your filesystem is ext3, you should not 
just update, you should make sure your extended attributes are fixed. The 
easiest way to fix them is likely to just check the "Relabel on next boot" 
checkmark in the SELinux config GUI ("system-config-selinux" if you don't 
do that whole admin menu thing), and reboot into 2.6.34-rc3.

[ And you can use something like 'restorecon -rv /' or whatever after 
  booting into the fixed kernel. See your nearest SELinux manual for real 
  details. ]

You might even want to do the whole "touch /forcefsck" before rebooting to 
make sure fsck runs (I don't think it matters, but it won't hurt - the 
relabeling will be so slow that whatever time your fsck takes is totally 
irrelevant, so do them both and get it over with).

Of course, I suspect most people who run experimental kernels are also the 
kinds of people who have turned off SELinux in annoyance long ago, or tend 
to be the kinds of people who long since upgraded to ext4 (which didn't 
have the problem), but hey, what do I know.

In short - if you started seeing odd security messages after running early 
2.6.34 -rc kernels, now you know what was going ...
From: Rafael J. Wysocki
Date: Tuesday, March 30, 2010 - 2:16 pm

On Tuesday 30 March 2010, Linus Torvalds wrote:


This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
which happens to have a RS780.

The symptom is that every operation involving the GPU is _very_ slow, so the
window manager eventually disables compositing.  Reverting this commit makes
things work flawlessly again.

So, please revert.

BTW, I don't think it's a -stable material.

Thanks,
Rafael
--


Ok, I'll go drop it.

thanks,

greg k-h
--

From: Rafael J. Wysocki
Date: Wednesday, March 31, 2010 - 6:13 pm

OK, I've verified that partial revert (below) is sufficient.

Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880

Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7
(PCI quirk: RS780/RS880: work around missing MSI initialization)
removed a quirk to disable MSI on RS780 and RS880, which still is
necessary on my Acer Ferrari One, because pci_enable_msi() attempts
to enable the MSI and apparently succeeds despite the PCI quirk
added by that commit.  Add the removed radeon quirk again.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/gpu/drm/radeon/radeon_irq_kms.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
@@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de
 	}
 	/* enable msi */
 	rdev->msi_enabled = 0;
-	if (rdev->family >= CHIP_RV380) {
+       /* MSIs don't seem to work on my rs780;
+        * not sure about rs880 or other rs780s.
+        * Needs more investigation.
+        */
+       if ((rdev->family >= CHIP_RV380) &&
+           (rdev->family != CHIP_RS780) &&
+           (rdev->family != CHIP_RS880)) {
 		int ret = pci_enable_msi(rdev->pdev);
 		if (!ret) {
 			rdev->msi_enabled = 1;
--

From: Alex Deucher
Date: Wednesday, March 31, 2010 - 7:19 pm

I also have the attached patch queued in via Dave's tree to disable
MSI on all IGP chips for the time being.

Alex
From: Clemens Ladisch
Date: Wednesday, March 31, 2010 - 11:36 pm

So it's better to disable MSI unconditionally.

Rafael, can you check if MSI works for the HDMI audio device?

This disables MSI only for the graphics device.  I'd prefer to have
the quirk on its bridge so that MSI gets disabled for the HDMI audio
device too, to avoid having to duplicate this quirk in the snd-hda-intel
driver.

==========

PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not seem
to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, unconditionally
disable MSI for the internal graphics and HDMI audio on these chipsets.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev)
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374,
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for ...
From: Alex Deucher
Date: Thursday, April 1, 2010 - 8:01 am

Works fine here.

--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 1:28 pm

Unfortunately it doesn't work for me without the

if ((rdev->family >= CHIP_RV380) &&
            (!(rdev->flags & RADEON_IS_IGP)))

radeon quirk.

Thanks,
Rafael
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 1:39 pm

what are your pci ids?

--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 1:48 pm

1022:960b

I guess 1022 is AMD.

OK, I'll try to add that.

Rafael
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 2:00 pm

0x960b won't affect the internal gfx.  That bridge is for the pcie x16 gfx slot.

0x9600                       Host bridge
0x9602                       Internal GFX PCI-PCI bridge ID
0x9603                       External GFX - port 0
0x960B                       External GFX - port 1
0x9604                       PCI-PCI bridge - Port 0
0x9605                       PCI-PCI bridge - Port 1
0x9606                       PCI-PCI bridge - Port 2
0x9607                       PCI-PCI bridge - Port 3
0x9608                       PCI-PCI bridge - Port 4
0x9609                       PCI-PCI bridge - Port 5
0x960A                       PCI-PCI bridge (SB)
0x960F                       HD Audio controller
0x791A                       HDMI Audio codec

--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 2:01 pm

It's possible your oem has the wrong vendor id for the 0x9602 bridge.

--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 2:08 pm

Yes, the patch below works.

Thanks,
Rafael


---
 drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
 drivers/pci/quirks.c                    |   36 ++------------------------------
 2 files changed, 4 insertions(+), 35 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for internal graphics disabled\n");
-		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
-	}
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD ...
From: Alex Deucher
Date: Thursday, April 1, 2010 - 2:13 pm

Let's skip this second chunk for now as there are other non-RS780 IGP
chips that could be problematic, so I'd rather just leave MSIs
disabled for now.

Alex
--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 2:46 pm

Works for me.

So do you want me to resubmit?

Rafael
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 3:07 pm

Please.

Thanks,

--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 4:20 pm

Appended, with sign-offs and changelog.

Thanks,
Rafael

---
Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not
seem to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, disable MSI
unconditionally for the internal graphics and HDMI audio on these
chipsets.

[rjw: Added the PCI_VENDOR_ID_AI quirk.]

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/pci/quirks.c |   36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, ...
From: Linus Torvalds
Date: Thursday, April 1, 2010 - 5:23 pm

Hmm. Isn't this missing a

	From: Clemens Ladisch <clemens@ladisch.de>

too? Or was the original patch yours?

		Linus
--

From: Rafael J. Wysocki
Date: Friday, April 2, 2010 - 9:46 am

Ouch, yes it is, sorry.

This one should be complete.

---
From: Clemens Ladisch <clemens@ladisch.de>
Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not
seem to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, disable MSI
unconditionally for the internal graphics and HDMI audio on these
chipsets.

[rjw: Added the PCI_VENDOR_ID_AI quirk.]

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/pci/quirks.c |   36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, ...
From: Clemens Ladisch
Date: Saturday, April 3, 2010 - 11:08 am

I fear I have to NACK this.  The fact that two OEMs have changed the vendor
ID makes it likely that this is a bug in AMD's template BIOS code, and that
we will see the same problem on other systems using other vendor IDs.

So we should not use the vendor ID of device 0x9602 to declare the quirk, but
use some other device with an ID that is known to be correct.  We already
access the configuration space of the host bridge, so we should use that.

Furthermore, the quirk in my first patch was never run at all on the ALi
system, so it is probable that the nb_cntl.strap_msi_enable detection
would actually work.  Rafael, please test this patch; if it doesn't work
on your system, we can still remove the check for the strap_msi_enable bit.

==========

Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge

On many RS780 systems, the vendor ID of the PCI/PCI bridge for the
internal graphics is set to that of the mainboard vendor, so the quirk
would not match and failed to notice the disabled MSI.

Since we do not know in advance all possible vendor IDs, we have to
declare the quirk on another device with an ID that is known to be
correct, and use that as a stepping stone to find the PCI/PCI bridge,
if present.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Cc: <stable@kernel.org>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2483,34 +2483,38 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
  * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
  * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
  */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
+static void __init rs780_int_gfx_disable_msi(struct pci_dev *host_bridge)
 {
+	struct pci_dev *int_gfx_bridge;
 	u32 nb_cntl;
 
-	if (!int_gfx_bridge->subordinate)
+	/*
+	 * Many OEMs change the vendor ID of the internal graphics PCI/PCI
+	 * bridge, so we use the possible vendor/device IDs of the host bridge
+	 ...
From: Rafael J. Wysocki
Date: Saturday, April 3, 2010 - 12:33 pm

Yes, this works (after reverting commit
5193d7a7f500cfbbfc0de221e808208199723521 and removing the
(rdev->flags & RADEON_IS_IGP) test from radeon_irq_kms_init()).

Thanks,
Rafael
--

From: Linus Torvalds
Date: Thursday, April 1, 2010 - 9:29 am

Hmm. Through the DRM merge I just did, this area actually conflicted, and 
the resolved version is now

        if ((rdev->family >= CHIP_RV380) &&
            (!(rdev->flags & RADEON_IS_IGP))) {

which presumably also fixes your issue?

[ Side note: somebody in the DRM tree seems to be way too used to LISP, 
  and thinks that adding parenthesis always improves the code ;-]

However, I do suspect that we should probably revert the quirk regardless 
as being useless (ie it probably was related to those IGP chips that 
apparently don't do MSI anyway).

So the patch that reverts the quirk by Clemens (to replace it with 
disabling MSI entirely when the AMD NB doesn't accept them) seems to be a 
good idea regardless, since it's apparently not just about gfx. Jesse?

			Linus
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 10:07 am

On Thu, Apr 1, 2010 at 12:29 PM, Linus Torvalds


Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
right approach I think.  Note that it's only devices hung off the int
gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
15626:
https://bugzilla.kernel.org/show_bug.cgi?id=15626

--

From: Linus Torvalds
Date: Thursday, April 1, 2010 - 10:24 am

Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that 
pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for 
PCIE drivers too.

Put another way: it sounds like the quirk now disables MSI for all 
devices. Maybe there would some more targeted mode?

		Linus
--

From: Clemens Ladisch
Date: Thursday, April 1, 2010 - 10:50 am

A quirk that used pci_no_msi() would disable all MSI for all devices.
However, these patches (and that in bug 15626) use PCI_BUS_FLAGS_NO_MSI
so that only the internal GPU devices are affected.

That "completely" in my patch title should better read "unconditionally".


Regards,
Clemens
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 10:53 am

On Thu, Apr 1, 2010 at 1:24 PM, Linus Torvalds

What I meant to say was MSI works fine on bridges other than the
bridge the internal gfx lives on.  quirk_disable_msi() just disables
MSI on the devices on that particular bridge as far as I understand
it, but I'm by no means an expert on the PCI code.  E.g., on my RS780
board, MSIs are only problematic on the integrated gfx chip.  MSIs
work fine on PCI/PCIE add-on cards and the integrated Ethernet.

--

From: Linus Torvalds
Date: Thursday, April 1, 2010 - 1:17 pm

Yes, it disabled MSI only on devices under that bridge. But if it's the 
northbridge, that would be everything, no?

But I don't know what devices those

	PCI_VENDOR_ID_AMD, 0x9602,
	PCI_VENDOR_ID_ASUSTEK, 0x9602,

things are. If they are just a PCIE->PCI bridge rather than the root 
bridge, then everything looks fine to me.

			Linus
--

From: Alex Deucher
Date: Thursday, April 1, 2010 - 1:23 pm

On Thu, Apr 1, 2010 at 4:17 PM, Linus Torvalds

Yup, those are just the pci to pci bridges used for the internal gfx.
Really there's only one, 0x9602, but some asus oem boards have the
--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 12:46 pm

Yes, it does.

Rafael
--

From: Jesse Barnes
Date: Thursday, April 1, 2010 - 3:48 pm

On Thu, 1 Apr 2010 09:29:23 -0700 (PDT)

Yeah, that sounds fine.  I can include it in my next pull req or you
can just pick it up directly.

-- 
Jesse Barnes, Intel Open Source Technology Center
--

From: Rafael J. Wysocki
Date: Thursday, April 1, 2010 - 4:23 pm

Not exactly that one, please, it's missing a quirk for the affected system.

I've just sent a corrected version, here:
https://patchwork.kernel.org/patch/90275/

Rafael
--

From: Borislav Petkov
Date: Friday, April 2, 2010 - 10:59 am

Hi,

I've got the following oopsie two times now when hibernating - this
means, I don't get it everytime I hibernate but only sometimes, say once
in a blue moon.

And yeah, I couldn't catch it over serial console so I had to make ugly
pictures. By the way, the numbers in the filenames increment as I scroll
down the whole oops (yep, it hadn't completely frozen and I still could
do Shift->PgUp or Shift->PgDn on the console):

http://www.kernel.org/pub/linux/kernel/people/bp/

So, here's what I could decipher from the oopsie, someone else who's
more knowledgeable in mm, rmap and anon_vma's list traversal should be
able to tell what goes wrong there.

EIP is at page_referenced+0xee

which is

<disasm>
    10c4:	41 01 c4             	add    %eax,%r12d
    10c7:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
    10cb:	74 19                	je     10e6 <page_referenced+0xff>
    10cd:	4d 8b 6d 20          	mov    0x20(%r13),%r13
    10d1:	49 83 ed 20          	sub    $0x20,%r13

    10d5:	49 8b 45 20          	mov    0x20(%r13),%rax		    <--------------

    10d9:	0f 18 08             	prefetcht0 (%rax)
    10dc:	49 8d 45 20          	lea    0x20(%r13),%rax
    10e0:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
</disasm>


Corresponding asm:

<asm>
	.loc 1 496 0
	movq	32(%r13), %r13	# <variable>.same_anon_vma.next, __mptr.451
.LVL295:
	subq	$32, %r13	#, avc
.LVL296:
.L184:
.LBE1278:
	movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next			<----------------
	prefetcht0	(%rax)	# <variable>.same_anon_vma.next
	leaq	32(%r13), %rax	#, tmp97
	cmpq	%rax, -128(%rbp)	# tmp97, %sfp
	jne	.L187	#,
.L186:
	.loc 1 514 0
	movq	%r14, %rdi	# anon_vma,
	call	page_unlock_anon_vma	#
</asm>


and the NULL pointer in question is being written into %r13 and then 32
is subtracted from it (I'm guessing container_of()). This is consistent
with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
-32 and with the code dump in the oops, in ...
From: Linus Torvalds
Date: Friday, April 2, 2010 - 11:09 am

I think this is likely due to the new scalable anon_vma linking by Rik. 
Nothing else I can imagine should have introduced anything like it.

Rik: the picures have the information, but you need to look at several to 
see both the oops and the backtrace. Here's a condensed version:

 shrink_all_memory ->
   do_try_to_free_pages ->
     shrink_zone ->
       shrink_inactive_list ->
         shrink_page_list ->
           page_referenced

where page_referenced() oopses due page_referenced_anon() as per 
Borislav's description below.

Added all the usual suspects to the Cc list. Left the full report appended 
so that the new people don't have to search for it on lkml.

		Linus

--

From: Andrew Morton
Date: Friday, April 2, 2010 - 8:24 am

From: Linus Torvalds
Date: Friday, April 2, 2010 - 11:37 am

Yup, looks like the same thing, except that bugzilla entry was due to 
swapping rather than hibernation and memory shrinking. But same end 
result, just different reasons for why we were trying to shrink the page 
lists.

		Linus
--

From: Rik van Riel
Date: Friday, April 2, 2010 - 3:01 pm

Interesting that it is a null pointer dereference, given
that we do not zero out the anon_vma_chain structs before
freeing them.

Page_referenced_anon() takes the anon_vma->lock before
walking the list.  The three places where we modify the
anon_vma_chain->same_anon_vma list, we also hold the
lock.

No doubt something in mm/ is doing something silly, but
I have not found anything yet :(

If I had to guess, I'd say maybe we got one of the
mprotect & vma_adjust cases wrong.  Maybe a page stayed
around in the LRU (and in a process?) after its anon_vma
already got freed?

There has to be a reason why a very heavy AIM7 workload
and some other stress tests did not trigger it, but a few
people are able to trigger it on their systems...
--

From: Linus Torvalds
Date: Friday, April 2, 2010 - 5:19 pm

So let's look at the individual anon_vma_chain entries instead.

What is the protection of the 'vma->anon_vma_chain' list? In 
anon_vma_prepare(), the code implies that it is the page_table_lock, but 
what about anon_vma_clone()? If I'm reading it correctly, it is some odd 
mix of "mmap_sem held for writing" or "mmap_sem held for reading _and_ 
page_table_lock". And then we have the exit case that apparently has no 
locking at all, but that should hopefully be single-threaded.

That thing is subtle. A few more comments about the locking would be good, 
so that people like me wouldn't have to try to guess the rules from 

I don't think AIM7 is at all a very interesting workload, and not likely 
to stress anything at all. Did your AIM7 test actually cause heavy 
swapping? I doubt it. 

Page swapout is where a lot of the magic happens, since that happens 
without mmap_sem held etc. 

			Linus
--

From: Minchan Kim
Date: Sunday, April 4, 2010 - 9:12 am

Hi, Rik. 


While I review the code again due to this BUG, I found some strange
thing. 

In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
has vma which is destroyed. 
I couldn't find any clean routine to remove this garbage. 
I am missing something?

But I think it isn't related to this bug because oops point is not
vma_address but anon_vma_chain.next.



-- 
Kind regards,
Minchan Kim


--

From: Rik van Riel
Date: Sunday, April 4, 2010 - 10:24 am

Good catch.  The parent VMA's anon_vmas will get delinked
eventually, but we need to get rid of the newly allocated
child anon_vmas.  You found a hopefully rare memory leak...

We need a call to unlink_anon_vmas(vma) at the error label

Agreed, it's probably not it.
--

From: Rik van Riel
Date: Sunday, April 4, 2010 - 4:09 pm

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

Reported-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..fb7ce99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 
  out_error_free_anon_vma:
 	anon_vma_free(anon_vma);
+	unlink_anon_vmas(vma);
  out_error:
 	return -ENOMEM;
 }

--

From: Minchan Kim
Date: Sunday, April 4, 2010 - 4:56 pm

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim
--

From: Linus Torvalds
Date: Monday, April 5, 2010 - 8:37 am

This looks _very_ wrong to me.

Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we 
should do it even if the "anon_vma_alloc()" failed, nbot just if the 
"anon_vma_chain_alloc()" failed?

No?

What am I missing?

			Linus
--

From: Minchan Kim
Date: Monday, April 5, 2010 - 8:48 am

On Tue, Apr 6, 2010 at 12:37 AM, Linus Torvalds

Indeed. You're right.
I should have been reviewed more carefully.



-- 
Kind regards,
Minchan Kim
--

From: Rik van Riel
Date: Monday, April 5, 2010 - 9:04 am

Indeed it should.  I've had my mind somewhere else this weekend :/

New patch in the next mail.
--

From: Rik van Riel
Date: Monday, April 5, 2010 - 9:13 am

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

This bug also has the potential to leave behind anon_vma_chain structs
with pointers to invalid memory.

Reported-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..eaa7a09 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -232,6 +232,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
  out_error_free_anon_vma:
 	anon_vma_free(anon_vma);
  out_error:
+	unlink_anon_vmas(vma);
 	return -ENOMEM;
 }
 
--

From: KOSAKI Motohiro
Date: Tuesday, April 6, 2010 - 1:53 am

Today, I've reviewed this patch carefully. but I haven't found any bug.

1) anon_vma->list is alwasys protected anon_vma->lock.
2) If anyone forget to take lock, list_add() and/or list_del() never
   assign to NULL.

then, NULL mean either three possibility.

 a) we see uninitialized data
 b) we see after freed data
 c) we see memory corruption by another bug

but (a) can't happen because 

	static inline void __list_add()
	{
	        next->prev = new;
	        new->next = next;
	        new->prev = prev;
	        prev->next = new;  (*)
	}

If uninitialized var is linked to avc list, new->next was already !NULL.

(b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma 
freeing until next rcu period. It mean rcu_read_lock()+page_mapped() 
can see kfree()ed page. but it is safe. noone corrupt it.

now I doubt (c) ;-)



Also, I've runned stress workload with shrink_all_memory() today. but
I couldn't reproduce the issue. hmm..  (perhaps I'm no lucky guy. 
I'm frequently fail to reproduce)

I'll continue to work.


--

From: KOSAKI Motohiro
Date: Tuesday, April 6, 2010 - 3:09 am

by the way: I haven't understand why rik's per process anon_vma concept
works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
not guranteed vma->anon_vma and page->anon_vma are the same.

but I guess bug reporter doesn't use ksm, it's minor feature.



--

From: Rik van Riel
Subject:
Date: Tuesday, April 6, 2010 - 7:34 am

KSM removes the page from its original anon_vma.

If the page gets reinstantiated (copy on write), it will be
created in the vma->anon_vma.

Am I overlooking something?
--

From: Rik van Riel
Subject:
Date: Tuesday, April 6, 2010 - 7:38 am

My status with this bug is the same - I have gone through
the code from all angles, but have not found any other bugs
yet (except for that leak - which could leave invalid
pointers behind).

This makes me wonder if perhaps the bug is a side effect
of something Borislav (and the other reproducers) have
in their kernel configuration, which we do not have.

Another (unlikely) thing is that the fix for the leak
makes the bug go away.  Yes, very unlikely.

Borislav, could you please send us your .config ?

Also, if you have the time, could you try out the
patch (-v2) I mailed in a little up this thread
that fixes the memory leak in anon_vma_fork?

I suspect it should not change anything, but it
could be useful to rule out anyway.
--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 8:34 am

Let's see the unlink_anon_vmas. 

1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
2. 	anon_vma_unlink
3. 		spin_lock(anon_vma->lock) <-- HERE LOCK.
4.		list_del(anon_vma_chain->same_anon_vma);

What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
anon_vma object between 2 and 3?
I mean how to make sure 3) does lock valid anon_vma? 

I hope it is culprit.


-- 
Kind regards,
Minchan Kim


--

From: Rik van Riel
Subject:
Date: Tuesday, April 6, 2010 - 8:40 am

How can the anon_vma get destroyed and reused, when this
anon_vma_chain still has a reference to it (and the
anon_vma has not been freed yet)?

What combination of circumstances is necessary for
your bug hypothetical to happen?
--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 8:58 am

AFAIK, anon_vma can be reused without free by SLAB_XXX_RCU.
So we always use it carefully by page_lock_anon_vma or manual check
with RCU and page_mapped. 



	CPU A				CPU B

unlink_anon_vmas
list_for_each_entry 
				
					free_pgtable
					anon_vma_unlink
  <crazy stall>				spin_lock(anon_vma);
					list_del(same_anon_vma)
					spin_unlock(anon_vma)
anon_vma_unlink
					anon_vma_free
					reuse for another anon_vma
spin_lock(another anon_vma)
list_del(another anon_vma)

If my assumption is wrong, please correct me. 
Thanks, Rik. 

-- 
Kind regards,
Minchan Kim


--

From: Linus Torvalds
Subject:
Date: Tuesday, April 6, 2010 - 8:55 am

I don't think so. That isn't the racy case. We're working with a 
anon_vma_chain, so the anonvma is all there.

The racy case is when we look up an anonvma by the page, and the page gets 
unmapped at the same time because somebody else is travelling over the LRU 
list of the page itself, isn't it?

I do wonder if "page_lock_anon_vma()" should check the whole 
"page_mapped()" case _after_ taking the anon_vma lock. Because if the race 
happens, we're following a anon_vma list that has nothing to do with that 
page (it's stilla _valid_ list, since we locked the anon_vma, but will it 
be ok?)

IOW, what is it that really keeps the anon_vma list reliable _and_ 
relevant wrt the page? We know we may get a stale anon_vma, are we ok if 
that anon_vma list doesn't actually have anything to do with the page any 
more?

I think the first check in "page_address_in_vma()" protects us, but 
whatever.

However, that made me look at the PAGE_MIGRATION case. That seems to be 
just broken. It's doing that page_anon_vma() + spin_lock without holding 
any RCU locks, so there is no guarantee that anon_vma there is at all 
valid.

Is that function always called with rcu_read_lock()? 

			Linus
--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 9:23 am

Hi, Linus. 


But the anon_vma is using for another anon_vma. 
Nonetheless, anon_vma_unlink does list_del(anon_vma's same_anon_vma).

Yes. but I thought page might travel with anon_vmas which have

So we always use it with (vma_address and page_check_address) to make
sure validation of anon_vma.
But I think it's not good design. I want to hold lock ahead checking of

FYI, recently there is a patch about migration case. 


-- 
Kind regards,
Minchan Kim


--

From: Linus Torvalds
Subject:
Date: Tuesday, April 6, 2010 - 9:28 am

No, I'm talking about rmap_walk_anon():

        anon_vma = page_anon_vma(page);
        if (!anon_vma)
                return ret;
        spin_lock(&anon_vma->lock);

which seems to be simply buggy. The anon_vma may not exist any more, 
because an RCU event might have really freed the page between looking it 
up and locking it.

			Linus
--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 9:45 am

unmap_and_move
	remove_migration_ptes
		rmap_walk
			rmap_walk_anon

We always has rcu_read_lock about anon page in unmap_and_move.
So I think it's not buggy. What am I missing?


-- 
Kind regards,
Minchan Kim


--

From: Linus Torvalds
Subject:
Date: Tuesday, April 6, 2010 - 9:53 am

Ok, in that case it's fine.

However, it does bring back my comment about all those anonvma changes: 
the locking is totally undocumented. 

Why isn't there a thing _saying_ that it's ok because of this?

Why is there no comment about the locking of that 'same_vma' / 
'vma->anon_vma_chain' except for the totally nonsensical one about 
page_table_lock (which doesn't protect _any_ of the other cases)?

		Linus
--

From: Rik van Riel
Subject:
Date: Tuesday, April 6, 2010 - 10:04 am

Which other cases? When do we ever walk the "same_vma" list
not from the context of the process owning the vma?

This bug in page_referenced is walking the "same_anon_vma" list,
which is locked with the anon_vma->lock.
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 11:28 am

That's the point. What does 'owning the vma' mean? That's exactly what I'm 
asking to be documented.

Quite frankly, the thing is a mess. There is _no_ comment on why it's ok 
to modify the list or walk the list, except for the one totally misleading 
one, since the page_table_lock has at most a _secondary_ meaning in the 
whole ownership (ie it is used only when we do _not_ own the vma chain 
exclusively).

So your very comment shows the whole confusion. No, we do not "own the 

Umm. Wake the hell up, Rik!

It's walking a _corrupt_ same_anon_vma list.  In other words, we _know_ 
that the 'anon_vma_chain' entry is crap. We know that exactly because it 
contains "impossible" values with regard to the list.

And what's the easiest way to get such a corrupt list, considering that 
the locking looks correct for that particular list?

That's right: by having something like anon_vma_clone() do something bad 
when it walks the same avc entries using the 'same_vma' list and creates 
copies of it.

You can't just say "but but but same_anon_vma list is always locked 
properly". Because it doesn't matter if that list is locked properly if 
walking _another_ list doesn't work right.

I really don't understand why you keep on harping on thatr same_anon_vma 
list. The fact that that was the corrupt list IN ABSOLUTELY NO WAY implies 
that that is the list that caused the corruption.

For example, let's say that the 'anon_vma_chain' list is corrupted. Never 
mind how. So what could happen is that you'd have vma->anon_vma pointing 
to one thing, and one or more entries on the 'vma->anon_vma_chain' list 
pointing to _another_ anon_vma.

What happens then? I have no idea. Maybe nothing bad. But the point is, if 
one avc list is corrupted and you may end up referencing those avc's in 
unexpected cases, how can you trust the other list that is in the same 
data structure?

For example, maybe some list corruption causes us to do that 
"anon_vma_chain_link()" _twice_ on the same avc ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 12:03 pm

On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)

The lib/list_debug.c stuff might detect such things.  I wonder if
either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 12:10 pm

Well, even without CONFIG_LIST_DEBUG we'd catch _some_ things, and 
conversely, even with LIST_DEBUG on we don't catch everything.

For example, doing list_del() twice on the same entry will die with a 
really nice pattern due to poisoning even without LIST_DEBUG.

But list_add() twice on the same entry will sadly silently succeed both 
with and without list debugging (the list debugging will check the target 
list head, but there is no way to check the "new->next/prev" entries).

Anyway, I've not actually found anything wrong in the same_vma locking. 
And I'm not at all convinced there is any list corruption there. My point 
was really only that
 (a) the locking rules seem very unclear and certainly not documented and 
 (b) corruption of one list could easily be the cause of corruption of 
     another list of the same structure.
but I don't actually see anything wrong anywhere.

			Linus
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 12:35 pm

I _have_ found what looks like a few clues, though.

In particular, the disassembly in Steinar Gunderson's case looks much more 
like the disassembly I get, and if I read that correctly, it's actually 
the _first_ iteration of the for_each_entry() loop that crashes.

Why do I think so?

In Steinar's oops, we have "RAX: ffff880169111fc8", which is clearly a 
kernel pointer. However, the code from Steinar's oops decodes to:

   0:	3b 56 10             	cmp    0x10(%rsi),%edx
   3:	73 1e                	jae    0x23
   5:	48 83 fa f2          	cmp    $0xfffffffffffffff2,%rdx
   9:	74 18                	je     0x23
   b:	4d 89 f8             	mov    %r15,%r8
   e:	48 8d 4d cc          	lea    -0x34(%rbp),%rcx
  12:	4c 89 e7             	mov    %r12,%rdi
  15:	e8 44 f2 ff ff       	callq  0xfffffffffffff25e
  1a:	41 01 c5             	add    %eax,%r13d
  1d:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
  21:	74 19                	je     0x3c
  23:	48 8b 43 20          	mov    0x20(%rbx),%rax
  27:	48 8d 58 e0          	lea    -0x20(%rax),%rbx
  2b:*	48 8b 43 20          	mov    0x20(%rbx),%rax     <-- trapping instruction
  2f:	0f 18 08             	prefetcht0 (%rax)
  32:	48 8d 43 20          	lea    0x20(%rbx),%rax
  36:	48 39 45 88          	cmp    %rax,-0x78(%rbp)
  3a:	75 a7                	jne    0xffffffffffffffe3
  3c:	41 fe 06             	incb   (%r14)
  3f:	e9                   	.byte 0xe9

which matches my code pretty well, and the point is, _if_ it went through 
the loop, then %rbx should be %rax+20. And it's not.

IOW, the code you see above before the trapping instruction is the end of 
the loop: it's the

		referenced += page_referenced_one(page, vma, address,
				&mapcount, vm_flags);
		if (!mapcount)
			break;
	}

part (the "callq" and "add %eax" is that "referenced +=", and %r13d is 
"referenced").

What you cannot see from the code decode is the loop setup and _entry_, 
which looks like this for me:

        movl    12(%rbx), %eax  # ...
From: Steinar H. Gunderson
Date: Tuesday, April 6, 2010 - 12:10 pm

Not set on my kernel.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--

From: Borislav Petkov
Date: Tuesday, April 6, 2010 - 12:42 pm

From: Andrew Morton <akpm@linux-foundation.org>

No, it is off in my .config. I'll turn it on and retest to see whether
it screams something. In the meantime, I've been testing current git
(v2.6.34-rc3-288-gab195c5), and especially Rik's mem leak fix which
Linus already committed (4946d54cb55e86a156216fcfeed5568514b0830f) and
tried to retrigger the bug by hibernating the machine several times.

Now, this machine has 8G of memory so I thought maybe if starting
several assorted guests on it would put some pressure on anon_vma lists
but no, the machine habernated happily by creating almost a 600Mb
hibernation image and having all three guests loaded.

Then, I said, well, let's have another last test run and started firefox
which went into reloading the last session. And I remember that firefox
still hadn't finished loading all pages when I hibernated and boom, it
oopsed.

So, it definitely is some anon_vma lists concurrency issue ... The good
thing is, I was able to catch the oops in its sheer magnificence over
netconsole this time:


[ 2995.478125] PM: Preallocating image memory... 
[ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0 
[ 2995.714001] Oops: 0000 [#1] PREEMPT SMP 
[ 2995.714001] last sysfs file: /sys/power/state
[ 2995.714001] CPU 0 
[ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.714001] 
[ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.714001] RIP: 0010:[<ffffffff810c194d>]  [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP: 0018:ffff88022fa038b8  EFLAGS: 00010283
[ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: ...
From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 1:02 pm

So again, I can show that the code has never actually been through the 
loop. The above code decodes to:

   0:	3b 56 10             	cmp    0x10(%rsi),%edx
   3:	73 1e                	jae    0x23
   5:	48 83 fa f2          	cmp    $0xfffffffffffffff2,%rdx
   9:	74 18                	je     0x23
   b:	48 8d 4d cc          	lea    -0x34(%rbp),%rcx
   f:	4d 89 f8             	mov    %r15,%r8
  12:	48 89 df             	mov    %rbx,%rdi
  15:	e8 4d f2 ff ff       	callq  0xfffffffffffff267
  1a:	41 01 c4             	add    %eax,%r12d
  1d:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
  21:	74 19                	je     0x3c
  23:	4d 8b 6d 20          	mov    0x20(%r13),%r13
  27:	49 83 ed 20          	sub    $0x20,%r13
  2b:*	49 8b 45 20          	mov    0x20(%r13),%rax     <-- trapping instruction
  2f:	0f 18 08             	prefetcht0 (%rax)
  32:	49 8d 45 20          	lea    0x20(%r13),%rax
  36:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
  3a:	75 aa                	jne    0xffffffffffffffe6
  3c:	4c 89 f7             	mov    %r14,%rdi
  3f:	e8                   	.byte 0xe8

and in your case, if we had gone through the loop, then %rax would still 
contain the return value from page_referenced_one(). 

But %rax is a kernel pointer, and %r12d is 0.

So again, it's actually anon_vma.head.next that is NULL, not any of the 
entries on the list itself.

Now, I can see several cases for this:

 - the obvious one: anon_vma just wasn't correctly initialized, and is 
   missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we 
   don't have a whole lot of coverage of constructors), or somebody 
   allocated an anon_vma without using the anon_vma_cachep.

 - Related to the above: perhaps the RCU freeing isn't working, or 
   slub/slab/slob ends up reusing the allocations for something else than 
   anonvma's, so together with the race _and_ an unlucky re-use, you get 
   some odd crud.

   I haven't looked at the kernel config files: do they perhaps ...
From: Steinar H. Gunderson
Date: Tuesday, April 6, 2010 - 1:46 pm

No KSM for me.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 1:56 pm

Ok, not anything odd there either, and you're not using any odd RCU setup 
either. Nothing odd at all strikes me about your config, in fact. Lots and 
lots of modules, but I guess it comes from some distro default config..

		Linus
--

From: Steinar H. Gunderson
Date: Tuesday, April 6, 2010 - 2:05 pm

I think it was originally some distro config, yes, but that «config fork» was
at 2.6.16 or something...

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--

From: Borislav Petkov
Date: Tuesday, April 6, 2010 - 1:51 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

I've added code to verify this and am suspend/resuming now... Wait a
minute, Linus, you're good! :) :

[  873.083074] PM: Preallocating image memory... 
[  873.254359] NULL anon_vma->head.next, page 2182681

This is the page_to_pfn number.

Now, how do we track back to the place which is missing anon_vma->head
init? Can we use the struct page *page arg to page_referenced_anon()
somehow?

[  873.254654] Pid: 3642, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5-dirty #3
[  873.254904] Call Trace:
[  873.255063]  [<ffffffff810c0c28>] page_referenced+0xd3/0x219
[  873.255212]  [<ffffffff810c5fb0>] ? swapcache_free+0x37/0x3c
[  873.255364]  [<ffffffff810ab782>] shrink_page_list+0x14a/0x477
[  873.255512]  [<ffffffff810aa6e0>] ? isolate_pages_global+0xc4/0x1f0
[  873.255662]  [<ffffffff813f8a76>] ? _raw_spin_unlock_irq+0x30/0x58
[  873.255811]  [<ffffffff810abe06>] shrink_inactive_list+0x357/0x5e5
[  873.255960]  [<ffffffff810ab626>] ? shrink_active_list+0x232/0x244
[  873.256112]  [<ffffffff810ac39e>] shrink_zone+0x30a/0x3d4
[  873.256264]  [<ffffffff810acf79>] do_try_to_free_pages+0x176/0x27f
[  873.256416]  [<ffffffff810ad117>] shrink_all_memory+0x95/0xc4
[  873.256564]  [<ffffffff810aa61c>] ? isolate_pages_global+0x0/0x1f0
[  873.256713]  [<ffffffff81076e4c>] ? count_data_pages+0x65/0x79
[  873.256862]  [<ffffffff810770b3>] hibernate_preallocate_memory+0x1aa/0x2cb
[  873.257036]  [<ffffffff813f4f75>] ? printk+0x41/0x44
[  873.257186]  [<ffffffff81075a53>] hibernation_snapshot+0x36/0x1e1
[  873.257337]  [<ffffffff81075ccc>] hibernate+0xce/0x172
[  873.257485]  [<ffffffff81074a39>] state_store+0x5c/0xd3
[  873.257634]  [<ffffffff81184eff>] kobj_attr_store+0x17/0x19
[  873.257783]  [<ffffffff81125d43>] sysfs_write_file+0x108/0x144
[  873.257932]  [<ffffffff810d560f>] vfs_write+0xb2/0x153
[  873.258084]  [<ffffffff81063bd9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  873.258237]  [<ffffffff810d5773>] ...
From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 2:27 pm

Yeah, I was pretty sure of that thing.

I still don't see _how_ it happens, though. That 'struct anon_vma' is very 
simple, and contains literally just the lock and that list_head.

Now, 'head.next' is kind of magical, because it contains that magic 
low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in 
mm/mmap.c). But I'm not seeing anything else touching it.

And if you allocate a anon_vma the proper way, the SLUB constructor should 
have made sure that the head is initialized. And no normal list operation 
ever sets any list pointer to zero, although a "list_del()" on the first 

You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and 
then make the "object_err()" function in mm/slub.c be non-static. You 
could call it when you see the problem, perhaps.

Or you could just add tests to both alloc_anon_vma() and free_anon_vma() 

Probably anything but the default SLUB these days.  But Steinar already 

Yeah, wasn't for Steinar either. So it doesn't look like it's any odd 
corner case that depends on some odd configuration.

		Linus
--

From: Borislav Petkov
Date: Tuesday, April 6, 2010 - 3:59 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Ok, I tried doing all you suggested and here's what came out. Please,
take this with a grain of salt because I'm almost falling asleep - even
the coffee is not working anymore so it could be just as well that I've
made a mistake somewhere (the new OOPS is a #GP, by the way), just
watch:

Source changes locally:

--
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4884462..0c11dfb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 
+void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason);
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..7b35b3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -66,11 +66,24 @@ static struct kmem_cache *anon_vma_chain_cachep;
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *ret;
+	ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+
+	if (!ret->head.next) {
+		printk("%s NULL anon_vma->head.next\n", __func__);
+		dump_stack();
+	}
+
+	return ret;
 }
 
 void anon_vma_free(struct anon_vma *anon_vma)
 {
+	if (!anon_vma->head.next) {
+		printk("%s NULL anon_vma->head.next\n", __func__);
+		dump_stack();
+	}
+
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -494,6 +507,18 @@ static int page_referenced_anon(struct page *page,
 		return referenced;
 
 	mapcount = page_mapcount(page);
+
+	if (!anon_vma->head.next) {
+		printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
+				page_to_pfn(page));
+
+		object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");
+
+		dump_stack();
+
+		return ...
From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 4:27 pm

Hey ho, yeah.

The reason it's a #GP fault is that it's not a NULL pointer dereference 
any more, but a wild pointer that is not in the legal region of pointers 
on x86-64. That is also why your debugging code didn't catch it: the 
pointer isn't NULL, so you got the #GP fault on the same old instruction:

	  2b:*	49 8b 45 20          	mov    0x20(%r13),%rax     <-- trapping instruction

for all the same old reasons.

But now %r13 has a non-zero value: 0x002e2e2e002e2e0e, which I do _not_ 


No, you're mis-reading the asm. It's again the first iteration, and the 
code above it is again the end of the loop. And %rax is once more a kernel 
pointer, not the return value of 'page_referenced_one()'. 

So it once more is 'anon_vma->head.next' that is crap, but now it's not 
NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 
subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

What does '0x2e' mean? It's ASCII '.', but that doesn't really mean 
anything either.

			Linus
--

From: Rik van Riel
Date: Tuesday, April 6, 2010 - 4:54 pm

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
 include/linux/mm.h |    2 +-
 mm/mmap.c          |    6 +++---
 mm/rmap.c          |   20 +++++++++++++-------
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
 	struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  * anon_vmas being allocated, preventing vma merge in subsequent
  * mprotect.
  */
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *near;
 	unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 			can_vma_merge_before(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff +
 				((vma->vm_end - vma->vm_start) >> ...
From: KOSAKI Motohiro
Date: Wednesday, April 7, 2010 - 12:00 am

Ahhhh, I'm shame myself. sure, neighbor vma might have lots avc ;-)

Hmm.. probably I'm moron.
I'm also confusing this locking rule as same as linus said.

after this patch, new locking order are

	down_read(mmap_sem)
		anon_vma_clone(vma, merge_vma)
			list_add(&avc->same_vma, &vma->anon_vma_chain);
			spin_lock(&anon_vma->lock);
			list_add_tail(&avc->same_anon_vma, &anon_vma->head);
			spin_unlock(&anon_vma->lock);
		spin_lock(&anon_vma->lock);
		spin_lock(&mm->page_table_lock);

So, Why mmap_sem read lock can protect vma->anon_vma_chain?
An another threads seems to be able to change avc list concurrentlly and freely.

plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone() 



--

From: Rik van Riel
Date: Wednesday, April 7, 2010 - 7:48 am

You are right, the code needs to take the pagetable_lock
around the call to anon_vma_clone, so other threads
get locked out.

This means the locking order has now been inverted,
with the pagetable_lock on the outside and the
anon_vma locks on the inside.

I have checked all the other call sites to the
anon_vma code.  The direct callers of anon_vma_clone
and anon_vma_fork already hold the mmap_sem for
write.  The callers of anon_vma_prepare hold the
mmap_sem for read - so excluding other callers of
anon_vma_prepare with the page_table_lock is enough.

mm_take_all_locks has the mmap_sem for write.

There seem to be no other traversals of the same_vma
list, so changing the locking order to have the
page_table_lock on the outside of the anon_vma locks

Added in the new version.  See the next email.

--

From: Rik van Riel
Date: Wednesday, April 7, 2010 - 7:54 am

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
v2:
 - fix the locking issues spotted by Kosaki Motohiro
 - set vma->anon_vma correctly

 include/linux/mm.h |    2 +-
 mm/mmap.c          |    6 +++---
 mm/rmap.c          |   27 ++++++++++++++++++---------
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
 	struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  * anon_vmas being allocated, preventing vma merge in subsequent
  * mprotect.
  */
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *near;
 	unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 ...
From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 8:30 am

No, they're broken.

And Rik, please explain the locking rather than make even more of these 
kinds of random ad-hoc locking rules.

I've said this now _three_ times, but let me repeat once more:

 - the locking rules for that anon_vma_chain are very unclear. I _think_ 
   you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held 
   for reading and page_table_lock held", but nowhere is that actually 
   documented.

Why is it so hard for you to just admit that? Especially after you 

Rik, the above is obviously total crap.

anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL. 
You can't do that with a spinlock held.

		Linus
--

From: Rik van Riel
Date: Wednesday, April 7, 2010 - 8:52 am

You are right, the idea was to continue use the locking that
the anon_vma code was already using, without introducing any
new locking with the anon_vma patches.

However, it has become clear that this is no longer possible,
due to the need to hold a secondary lock across anon_vma_clone,

Looks like we'll either have to introduce a per-mm semaphore for
the same_vma anon_vma chains, or move the complexity of solving
this bug to anon_vma_merge, where we can ensure that the resulting
VMA has the sum of the anon_vmas of each VMA.
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 9:56 am

I do wonder if we could possibly simplify this a _lot_ by just requiring 
that the anon_vma gets allocated at vma creation time (ie mmap), rather 
than doing it on-demand when we actually do the page fault.

That would make all of this crap happen under mmap_sem held for writing, 
and it would simplify the faulting code (which is the much more critical 
code) a lot.

And it would make all your locking problems go away. Now all anon_vma code 
really _would_ run with mmap_sem held exclusively, without any races.

When I tried to do a "fill in multiple page table entries in one go" 
patch, that annoying anon_vma issue was a problem as well. Allocating the 
anon_vma up-front would have simplified that code too.

I can't imagine that we ever really have mappings without an anon_vma in 
practice _anyway_, so why delay the allocation until page fault time?

Maybe I'm missing something subtle. 

			Linus
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 2:19 pm

Here is a patch that boots for me (but has had _zero_ serious testing: 
caveat emptor etc etc).

It basically moves "anon_vma_prepare()" to be called in vma_link and in 
__insert_vm_struct() - which I _think_ should cover all normal vma 
creation events. I did a "WARN_ONCE(!vma->anon_vma)" just to check, I 
haven't triggered one yet.

Now, this clearly will create anon_vma's that may never get used at all, 
ie for things like shared mappings etc that never have anonymous memory 
associated with them. But that structure is pretty small, so I don't find 
it in myself to care too deeply.

And with this, all the anon_vma games shuld all happen with mmap_sem held 
for writing, which should hopefully simplify things a lot. Rik, can you 
use this to make a new version of your fixing patch?

Comments?

		Linus

---
 mm/memory.c |   10 +---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..0abefd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,7 ...
From: Rik van Riel
Date: Wednesday, April 7, 2010 - 2:52 pm

I remember there being an "unfixable" spot with this
approach when I originally wrote the new anon_vma
linking code.

However, I can't for the life of me find that spot.
I am starting to believe I made it fixable as a side
effect of one of the changes I made :)

One of the issues with your patch is that anon_vma_prepare
can fail and this patch ignores its return value.

Having anon_vma-prepare fail after an mremap or mprotect
might result in messing up the VMAs of a process, or having
to undo the VMA changes that were made.

In fact, this may be the problem I was running into - not
wanting to add even more complex error paths to the vma
shuffling code.
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 3:09 pm

Yes. The failure point is too late to do anything really interesting with, 
and the old code also just causes a SIGBUS. My intention was to change the 

	WARN_ONCE(!vma->anon_vma);

into returning that SIGBUS - which is not wonderful, but is no different 
from old failures.

In the long run, it would be nicer to actually return an error from the 
mmap() that fails, but that's more complicated, and as mentioned, it's not 
what the old code used to do either (since the failure point was always at 

We really aren't any worse off than we have always been.

If anon_vma_prepare() fails, the vma list will be valid, but no new pages 
can be added to that vma. That used to be true before too.

			Linus
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 3:15 pm

Put another way: I'm not proud of it, but the new code isn't any worse 
than what we used to have, and I think the new code is _fixable_.

The easiest way to do that would likely be to pre-allocate the anon_vma 
struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That 
way anon_vma_prepare() itself can never fail, and all we need to do is a 
simple allocation earlier in the call-chain.

			Linus
--

From: Rik van Riel
Date: Wednesday, April 7, 2010 - 5:38 pm

Agreed, it is no worse than what we had before.

As to fixable, I supect both situations are fixable.
The new code by getting the error paths right, the
old code by completely bailing out of the page fault
and retrying it (the pageout code should trigger an

That may not work, because we may want to merge the anon_vma
with the anon_vma in an adjacant VMA ... and that adjacant
VMA could be chained onto multiple anon_vmas.

That means allocating a single anon_vma_chain may not be
enough.
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 4:37 pm

Not SIGBUS, but VM_FAULT_OOM, of course.

IOW, something like this should be no worse than what we have now, and has 
the much nicer locking semantics.

Having done some more digging, I can point to a downside: we do end up 
having about twice as many anon_vma entries. It seems about half of the 
vma's never need an anon_vma entry, probably because they end up being 
read-only file mappings, and thus never trigger the anonvma case.

That said:

 - I don't really think you can fix the locking problem you have in a 
   saner way

 - the anon_vma entry is much smaller than the vm_area_struct, so we're 
   still using much less memory for them than for vma's.

 - We _could_ avoid allocating anonvma entries for shared mappings or for 
   mappings that are read-only. That might force us to allocate some of 
   them at mprotect time, and/or when doing a forced COW event with 
   ptrace, but we have the mmap_sem for writing for the one case, and we 
   could decide to get it for the other.

   So it's not a _fundamental_ problem if we decide we want to recover 
   most of the memory lost by doing unconditional allocations.

There are alternative models. For example, the VM layer _could_ decide to 
just release the mmap_sem, and re-do it and take it for writing if the vma 
doesn't have an anon_vma. 

I dunno. I like how this patch makes things so much less subtle, though.  
For example: with this in place, we could further simplify 
anon_vma_prepare(), since it would now never have the re-entrancy issue 
and wouldn't need to worry about taking that page_table_lock and 
re-testing vma->anon_vma for races.

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..b5efe76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if ...
From: KOSAKI Motohiro
Date: Wednesday, April 7, 2010 - 7:03 pm

Hi

Wow, your patch is very cool. I'm surprising such 20 lines patch makes

Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better.
Now SIGBUS and VM_FAULT_OOM make different result.

SIGBUS       -> kill current task
VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory())

If current task can't recover proper anon_vma. we should just kill current
instead random highest badness process. otherwise !anon_vma process continue
to randomly invoke oom-killer.

Perhaps, I'm missing something.



--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 7:33 pm

Yeah, maybe VM_FAULT_SIGBUS works ok instead of VM_FAULT_OOM. But the 
cause of it is the system having been oom when themappign was created, so 

Yes, that is a good point.

Anyway, I think it might be interesting to test my anon_vma_prepare() 
locking change patch together with Rik's _first_ version of his "fix 
anon_vma_prepare" thing (the one without the spinlock). They should apply 
independently of each other, and maybe it all even works together.

		Linus
--

From: Borislav Petkov
Date: Wednesday, April 7, 2010 - 10:47 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
mappings while we might sleep in anon_vma_prepare():

[    9.386929] BUG: sleeping function called from invalid context at mm/rmap.c:119
[    9.387188] in_atomic(): 1, irqs_disabled(): 0, pid: 1068, name: modprobe
[    9.387343] 3 locks held by modprobe/1068:
[    9.387524]  #0:  (&p->cred_guard_mutex){+.+.+.}, at: [<ffffffff810d97fc>] prepare_bprm_creds+0x29/0x5a
[    9.387959]  #1:  (&mm->mmap_sem){++++++}, at: [<ffffffff81110ee2>] elf_map+0x70/0x190
[    9.388416]  #2:  (&(&inode->i_data.i_mmap_lock)->rlock){+.+...}, at: [<ffffffff810bcbdf>] vma_adjust+0x190
/0x3ca
[    9.388848] Pid: 1068, comm: modprobe Not tainted 2.6.34-rc3-00290-ge4b2849 #6
[    9.389102] Call Trace:
[    9.389256]  [<ffffffff810630f6>] ? __debug_show_held_locks+0x22/0x24
[    9.389418]  [<ffffffff8102c288>] __might_sleep+0x117/0x11b
[    9.389570]  [<ffffffff810c0f2e>] anon_vma_prepare+0x30/0x132
[    9.389722]  [<ffffffff810bcd95>] vma_adjust+0x346/0x3ca
[    9.389874]  [<ffffffff810bcf68>] __split_vma+0x14f/0x1b9
[    9.390027]  [<ffffffff810bd143>] do_munmap+0x171/0x315
[    9.390181]  [<ffffffff81110ee2>] ? elf_map+0x70/0x190
[    9.390335]  [<ffffffff81110f9d>] elf_map+0x12b/0x190
[    9.390493]  [<ffffffff81111b35>] load_elf_binary+0xb33/0x170e
[    9.390645]  [<ffffffff8102d529>] ? sub_preempt_count+0xa3/0xb6
[    9.390800]  [<ffffffff810d945a>] search_binary_handler+0x166/0x30e
[    9.390952]  [<ffffffff810d92ab>] ? copy_strings+0x1d4/0x1e5
[    9.391111]  [<ffffffff81111002>] ? load_elf_binary+0x0/0x170e
[    9.391265]  [<ffffffff810dadff>] do_execve+0x1fc/0x2f5
[    9.391424]  [<ffffffff8100a379>] sys_execve+0x43/0x61
[    9.391576]  [<ffffffff810025fa>] stub_execve+0x6a/0xc0


-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Thursday, April 8, 2010 - 7:11 am

Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in 
__insert_vm_struct.

It should be simple enough to just move it into the caller, just after it 
releases that lock. There's only one user of that __insert_vm_struct() 
anyway. You can do it yourself, or you can replace my previous patch with 
this..

[ The patch below also makes it warn once and return SIGBUS for the case 
  where there is no anon_vma.  I decided I still want to hear about it if 
  there might be some path that tries to insert a vma on its own ]

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+		return VM_FAULT_SIGBUS;
+
 	__set_current_state(TASK_RUNNING);
 
 ...
From: Rik van Riel
Date: Thursday, April 8, 2010 - 11:25 am

Reviewed-by: Rik van Riel <riel@redhat.com>

I haven't seen any places that insert VMAs by itself.
Several strange places that allocate them, but they
all appear to use the standard functions to insert them.
--

From: Linus Torvalds
Date: Thursday, April 8, 2010 - 11:32 am

Yeah, I think I'll commit it as-is, assuming we get confirmation that it 
(along with your patch) actually ends up fixing the original problem.

I had actually had lockdep etc on with that patch, but for some reason I'd 
overlooked the SPINLOCK_SLEEP debugging, so I hadn't seen the stupid issue 
that Borislav pointed out. I wonder if LOCKDEP or spinlock debugging hould 
just select it. Small detail, but I should have caught that obvious bug 

Yeah, it's complicated enough to add a vma with all the rbtree etc stuff 
that I hope nobody actually cooks their own. But I too grepped for vma 
allocations, and there were more of them than I expected, so...

		Linus
--

From: Borislav Petkov
Date: Thursday, April 8, 2010 - 1:31 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, Apr 08, 2010 at 11:32:06AM -0700

Here we go, another night of testing starts... got more caffeine this

... and of course, I just hit that WARN_ONCE on the first suspend (it did
suspend ok though):

[   88.078958] ------------[ cut here ]------------
[   88.079007] WARNING: at mm/memory.c:3110 handle_mm_fault+0x56/0x67c()
[   88.079032] Hardware name: System Product Name
[   88.079056] Mapping with no anon_vma
[   88.079082] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod k10temp 8250_pnp 8250 serial_core edac_core ohci_hcd pcspkr
[   88.079637] Pid: 1965, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9 #7
[   88.079676] Call Trace:
[   88.079713]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   88.079744]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   88.079774]  [<ffffffff810b857d>] handle_mm_fault+0x56/0x67c
[   88.079805]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   88.079838]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   88.079866]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   88.079898]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   88.079929]  [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   88.079960]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   88.079988] ---[ end trace 154dd7f6249e1cc3 ]---

and then sysfs triggered that lockdep circular locking warning - I
thought it was fixed already :(


[  256.831204] =======================================================
[  256.831210] [ INFO: possible circular locking dependency detected ]
[  256.831216] 2.6.34-rc3-00290-g2156db9 #7
[  256.831221] -------------------------------------------------------
[  256.831226] hib.sh/2464 is trying to acquire lock:
[  256.831231]  (s_active#80){++++.+}, at: [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[  256.831250] 
[  ...
From: Borislav Petkov
Date: Thursday, April 8, 2010 - 2:00 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

And this happens quite often - I changed the WARN_ONCE to WARN and can't
start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
upon boot too:

[   55.814570] ------------[ cut here ]------------
[   55.814623] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[   55.814648] Hardware name: System Product Name
[   55.814671] Mapping with no anon_vma
[   55.814693] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   55.815249] Pid: 1936, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #8
[   55.815290] Call Trace:
[   55.815327]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   55.815362]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   55.815391]  [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[   55.815420]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   55.815452]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   55.815483]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   55.815518]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   55.815553]  [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   55.815585]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   55.815613] ---[ end trace fa59f67cbfeeca44 ]---
[   60.801651] ------------[ cut here ]------------
[   60.801672] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[   60.801681] Hardware name: System Product Name
[   60.801689] Mapping with no anon_vma
[   60.801702] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   60.802156] Pid: 2008, comm: iceowl-bin Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #8
[   60.802169] Call ...
From: Linus Torvalds
Date: Thursday, April 8, 2010 - 4:16 pm

Hmm. I tried console-kit-daemon, which I had installed, but didn't get 
anything like that. Probably some setup difference.

I also went through every user of 'vm_area_cachep', and saw nothing 
suspicious at least for the mmu case (I didn't check the nommu.c code). I 
must have missed something.

One thing you could do is to add some more debugging info when that "no 
anon_vma" warning happens. In particular, if you still have the SLUB 
debugging on, you could try to do that

	page = virt_to_head_page(vma);
	object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma");

and it should give you _which_ routine did the kmem_cache_alloc() for the 
vma that doesn't have an anon_vma.

		Linus
--

From: Borislav Petkov
Date: Thursday, April 8, 2010 - 4:47 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Yep, looks good: its mmap_region()...


[   88.237326] ------------[ cut here ]------------
[   88.237377] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab()
[   88.237403] Hardware name: System Product Name
[   88.237428] Mapping with no anon_vma
[   88.237451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[   88.237938] Pid: 1978, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #9
[   88.237980] Call Trace:
[   88.239269]  [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94
[   88.239320]  [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43
[   88.239378]  [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab
[   88.239440]  [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[   88.239471]  [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[   88.239517]  [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[   88.239548]  [<ffffffff813f9463>] ? error_sti+0x5/0x6
[   88.239597]  [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   88.239626]  [<ffffffff813f927f>] page_fault+0x1f/0x30
[   88.239674] ---[ end trace 42d53170a0d3ccef ]---
[   88.239699] =============================================================================
[   88.239750] BUG vm_area_struct: NULL anon_vma
[   88.239790] -----------------------------------------------------------------------------
[   88.239794] 
[   88.239805] INFO: Allocated in mmap_region+0x23d/0x500 age=2 cpu=0 pid=1978
[   88.239815] INFO: Slab 0xffffea0007a0f0e8 objects=17 used=1 fp=0xffff88022dfbb0f0 flags=0x80000000000000c2
[   88.239823] INFO: Object 0xffff88022dfbb000 @offset=0 fp=0xffff88022dfbb0f0
[   88.239827] 
[   88.239832]   Object 0xffff88022dfbb000:  00 32 53 2b 02 88 ff ff 00 20 ab 29 d1 7f 00 00 .2S+..ÿÿ..«)Ñ...
[   88.239861]   Object 0xffff88022dfbb010:  00 30 ac ...
From: Linus Torvalds
Date: Thursday, April 8, 2010 - 5:50 pm

Can you double-check your current diffs - maybe something got corrupted.

mmap_region installs the vma with vma_link(), and the last thing 
vma_link() does with my patch is that "anon_vma_prepare()".

Maybe with all the patches flying around, you had a reject or something, 
and you lost that one anon_vma_prepare()?

Or maybe I screwed up somewhere and sent you the wrong patch. Here it is 
again, just in case.

[ I have a horrible cold, and can hardly think straight. So who knows, 
  maybe I'm missing something. But if you have lost one of the 
  'anon_vma_prepare()' call sites, that would certainly explain why you 
  get NULL anon_vma's ]

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (WARN_ONCE(!vma->anon_vma, "Mapping with no ...
From: Borislav Petkov
Date: Thursday, April 8, 2010 - 6:30 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Right, it looks like it. I'll add some more debugging calls there
tomorrow - it might give us more clues in case someone hasn't caught it

Doesn't look like it - here's the diff between yours and what I have
applied here (yep, only minor fuzz but no code differences) Also, I've
added my version at the end:

--- a.diff      2010-04-09 03:03:35.000000000 +0200
+++ b.diff      2010-04-09 03:03:52.000000000 +0200
@@ -1,8 +1,8 @@
 diff --git a/mm/memory.c b/mm/memory.c
-index 1d2ea39..bd7ea7f 100644
+index 833952d..08d4423 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
-@@ -2224,9 +2224,6 @@ reuse:
+@@ -2223,9 +2223,6 @@ reuse:
  gotten:
        pte_unmap_unlock(page_table, ptl);
  
@@ -12,7 +12,7 @@ index 1d2ea39..bd7ea7f 100644
        if (is_zero_pfn(pte_pfn(orig_pte))) {
                new_page = alloc_zeroed_user_highpage_movable(vma, address);
                if (!new_page)
-@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
        /* Allocate our own private page. */
        pte_unmap(page_table);
  
@@ -21,7 +21,7 @@ index 1d2ea39..bd7ea7f 100644
        page = alloc_zeroed_user_highpage_movable(vma, address);
        if (!page)
                goto oom;
-@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        if (flags & FAULT_FLAG_WRITE) {
                if (!(vma->vm_flags & VM_SHARED)) {
                        anon = 1;
@@ -32,7 +32,7 @@ index 1d2ea39..bd7ea7f 100644
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
                                                vma, address);
                        if (!page) {
-@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct ...
From: Borislav Petkov
Date: Friday, April 9, 2010 - 2:21 am

From: Borislav Petkov <bp@alien8.de>

So I went and reapplied the three patches (3rd is the object_err export
for SLUB debugging) on a new branch of today's git - same results, the
same processes crap up in the WARN(!vma->anon_vma) check so it should be
something else we're missing.

More code staring later...

-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Friday, April 9, 2010 - 9:35 am

Can you try with _just_ my patch? Or add a 

	vma->anon_vma = merge_vma->anon_vma;

to Rik's "merge_vma" case in anon_vma_prepare().

Because I'm starign at Rik's patch, and one thing strikes me: it does that 
"anon_vma_clone()" in anon_vma_prepare(), and maybe I'm blind, but I don't 
see where that actually sets vma->anon_vma.

As far as I can tell, anon_vma_clone() was designed purely for the fork() 
case, which has done

	*new = *vma;

which will set new->anon_vma to the same vma. But Rik's patch never does 
that for the anon_vma_prepare() case.

And maybe we should do it in anon_vma_clone() itself, just to make it 
impossible to mistakenly leave it out, the way I think Rik's patch did.

Anyway, I'm still groggy from allt he flu medication, so take everything I 
say with a grain of salt.

In fact, the more I look at this, the less I think I like Rik's patch in 
the first place. I think the real bug that Rik tried to fix is that 
apparently anon_vma_merge() doesn't necessarily merge everything right. 

and I think that _this_ is the real bug to begin with. The real fix should 
be in vma_adjust/anon_vma_merge, not in how we set up the anon_vma in the 
first place. I do _not_ think we should require that we always merged 
things at mmap() time, because we may _never_ be able to merge perfectly 
(ie start out with to disjoing mmaps, and fill in the middle).

			Linus
--

From: Borislav Petkov
Date: Friday, April 9, 2010 - 10:40 am

From: Linus Torvalds <torvalds@linux-foundation.org>

Yep, yours along with the SLUB debugging piece just survived one
hibernation cycle without a problem. Also, no SIGBUS-killed processes,
all seems fine. Will continue stressing it though...

Let me know what you want me to do next.

-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Friday, April 9, 2010 - 10:50 am

Continue stress-testing it. I don't think my patch on its own should fix 
the original problem, but at least we now know why you got those NULL 
anon_vma's.

So what I _think_ will happen is that you'll be able to re-create the 
problem that started this all.  But I'd like to verify that, just because 
I'm anal and I'd like these things to be tested independently.

So assuming that the original problem happens again, if you can then apply 
Rik's patch, but add a

	dst->anon_vma = src->anon_vma;

to just before the success case (the "return 0") in anon_vma_clone(), 
that would be good.

			Linus
--

From: Borislav Petkov
Date: Friday, April 9, 2010 - 12:14 pm

From: Linus Torvalds <torvalds@linux-foundation.org>


It looks like this way we mangle the anon_vma chains somehow. From
what I can see and if I'm not mistaken, we save the anon_vmas alright
but end up in what seems like an endless list_for_each_entry()
loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
can't seem to yield it through page_unlock_anon_vma() at the end of
page_referenced_anon() so it has to be that code in between iterating
over each list entry...

I could be completely wrong though...


[  373.683545] PM: Syncing filesystems ... done.
[  373.950289] Freezing user space processes ... (elapsed 0.04 seconds) done.
[  373.998878] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  374.011121] PM: Preallocating image memory... 
[  439.161126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[  439.161315] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  439.162302] irq event stamp: 0
[  439.162302] hardirqs last  enabled at (0): [<(null)>] (null)
[  439.162302] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  439.163297] softirqs last  enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  439.163297] softirqs last disabled at (0): [<(null)>] (null)
[  439.163297] CPU 1 
[  439.163297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  439.165297] 
[  439.165297] Pid: 3617, comm: hib.sh Tainted: G        W  2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[  439.165297] RIP: 0010:[<ffffffff8118b731>]  [<ffffffff8118b731>] delay_tsc+0x0/0xca
[  439.165297] RSP: 0018:ffff8801f68b77f0  EFLAGS: 00000202
[  ...
From: Linus Torvalds
Date: Friday, April 9, 2010 - 12:32 pm

Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up.

Rik? I think it's back to you. I'm not going to bother committing the 
change to the anon_vma locking unless you actually need the locking 
guarantees for anon_vma_prepare().

And I've got the feeling that the proper fix is in the vma_adjust() 
handling if your original idea was right.

Anybody?

We're at the point where I've already delayed -rc4 several days because 
it's pointless cutting it without fixing this. One option is to just say 
"f*ck it, we'll revert it all and try again later". But it feels so 
close..

		Linus
--

From: Rik van Riel
Date: Friday, April 9, 2010 - 1:03 pm

We can fix it on the other side, by changing anon_vma_merge
to actually link all the anon_vma structs into the VMA.

An added benefit is that we are already holding the required
lock (mmap_sem) exclusively in that code path.

I'll cook up a patch and I'll mail it out after a little
testing.
--

From: Johannes Weiner
Date: Friday, April 9, 2010 - 1:43 pm

Okay, I think I got it working.  I first thought we would need an
m^n loop to properly merge the anon_vma_chains, but we can actually
be cleverer than that:

---
Subject: mm: properly merge anon_vma_chains when merging vmas

Merging can happen when two VMAs were split from one root VMA or
a mergeable VMA was instantiated and reused a nearby VMA's anon_vma.

In both cases, none of the VMAs can grow any more anon_vmas and forked
VMAs can no longer get merged due to differing primary anon_vmas for
their private COW-broken pages.

In the split case, the anon_vma_chains are equal and we can just drop
the one of the VMA that is going away.

In the other case, the VMA that was instantiated later has only one
anon_vma on its chain: the primary anon_vma of its merge partner (due
to anon_vma_prepare()).

If the VMA that came later is going away, its anon_vma_chain is a
subset of the one that is staying, so it can be dropped like in the
split case.

Only if the VMA that came first is going away, its potential parent
anon_vmas need to be migrated to the VMA that is staying.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

It compiles and boots but I have not really excercised this code.
Boris, could you give it a spin?  Thanks!

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..ecef882 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -114,13 +114,7 @@ int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
-
-static inline void anon_vma_merge(struct vm_area_struct *vma,
-				  struct vm_area_struct *next)
-{
-	VM_BUG_ON(vma->anon_vma != next->anon_vma);
-	unlink_anon_vmas(next);
-}
+void anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
 
 /*
  * rmap interfaces called when adding or removing pte of page
diff --git a/mm/rmap.c ...
From: Rik van Riel
Date: Friday, April 9, 2010 - 1:57 pm

I've looked it over 5 times, can't find anything wrong
with it.  Your approach looks like it should work just
fine.


Reviewed-by: Rik van Riel <riel@redhat.com>
--

From: Borislav Petkov
Date: Friday, April 9, 2010 - 2:33 pm

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, Apr 09, 2010 at 10:43:28PM +0200


ok, I got this ontop of mainline (no other patches from this thread)
but unfortunately it breaks at the same spot while under heavy page
reclaiming when trying to hibernate while booting 3 guests.

[  322.171120] PM: Preallocating image memory... 
[  322.477374] BUG: unable to handle kernel NULL pointer dereference at (null)
[  322.477376] IP: [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[  322.477376] PGD 2014e8067 PUD 221b4e067 PMD 0 
[  322.477376] Oops: 0000 [#1] PREEMPT SMP 
[  322.477376] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[  322.477376] CPU 3 
[  322.477376] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core
[  322.477376] 
[  322.477376] Pid: 2750, comm: hib.sh Tainted: G        W  2.6.34-rc3-00411-ga7247b6 #13 M3A78 PRO/System Product Name
[  322.477376] RIP: 0010:[<ffffffff810c0c87>]  [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[  322.477376] RSP: 0018:ffff88020936d8b8  EFLAGS: 00010283
[  322.477376] RAX: ffff88022de91af0 RBX: ffffea0006dcb488 RCX: 0000000000000000
[  322.477376] RDX: ffff88020936dcf8 RSI: ffff88022de91ac8 RDI: ffff88022ced0000
[  322.477376] RBP: ffff88020936d938 R08: 0000000000000002 R09: 0000000000000000
[  322.477376] R10: 0000000000000246 R11: 0000000000000003 R12: 0000000000000000
[  322.477376] R13: ffffffffffffffe0 R14: ffff88022de91ab0 R15: ffff88020936da00
[  322.477376] FS:  00007f286493e6f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[  322.477376] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  322.477376] CR2: 0000000000000000 CR3: 00000001f8354000 CR4: 00000000000006e0
[  322.477376] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[  322.477376] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: ...
From: Linus Torvalds
Date: Friday, April 9, 2010 - 4:22 pm

This comment makes my head hurt. In fact, the whole anon_vma thing hurts 
my head.

Can we have some better high-level documentation on what happens for all 
the cases.

 - split (mprotect, or munmap in the middle):

	anon_vma_clone: the two vma's will have the same anon_vma, and the 
	anon_vma chains will be equivalent. 

 - merge (mprotect that creates a mergeable state):

	anon_vma_merge: we're supposed to have a anon_vma_chain that is 
	a superset of the two chains of the merged entries.

 - fork:

	anon_vma_fork: each new vma will have a _new_ anon_vma as it's 
	primary one, and will link to the old primary trough the 
	anon_vma_chain. It's doing this with a anon_vma_clone() followed 
	by adding an entra entry to the new anon_vma, and setting 
	vma->anon_vma to the new one.

 - create/mmap:

	anon_vma_prepare: find a mergeable anon_vma and use that as a 
	singleton, because the other entries on the anon_vma chain won't 
	matter, since they cannot be associated with any pages associated 
	with the newly created vma..

Correct?

Quite frankly, just looking at that, I can't see how we get to your rules. 
At least not trivially. Especially with multiple merges, I don't see 
how "singleton" is such a special case.

		Linus
--

From: Rik van Riel
Date: Friday, April 9, 2010 - 4:45 pm

The trick is in the fact that anon_vma_merge is only called
when vma->anon_vma == vma1->anon_vma.

If the top anon_vmas are different, then anon_vma_merge will
not be called.

This means that VMAs which have recently passed through fork
will not be passed to anon_vma_merge, because their top
anon_vmas are different.

That leaves just the split & create cases, which will be
passed to anon_vma_merge when they are merged.

In case of split, they will have identical anon_vma chains.

In case of create + merge, one of the two VMAs will have
the whole anon_vma chain, while the other one has just
the top anon_vma.
--

From: Linus Torvalds
Date: Friday, April 9, 2010 - 5:03 pm

Sure sure. I still think it's _way_ too complex. See my previous email 
where I suggested one single simple additional rule that I think makes 

Right. The case of different anon_vma's is the trivial one. I don't worry 

And yes, split is fundamentally simple. Split guarantees that the chains 
look identical.


THIS is where I think you simplified a lot and said "and magic happens".

The thing is, in the case of create, we create a different chain. That 
simple fact just makes merging fundamentally complicated.  And we now have 
two different chains, and both of those can split, so those differences 
can "spread out". And you need to guarantee that "merge" really works. It 
didn't work in your original code, and quite frankly, I do _not_ think 
it's entirely obvious that it works in Johannes' code either.

Don't get me wrong: _maybe_ Johannes' code works fine. I just don't think 
it's obvious at all. And if it doesn't work fine, now you're just 
spreading the differences even further.

This is why I suggest that we limit the "re-use an existing vma for a new 
case" to the singleton case, which means that now you _never_ have 
differences at all. There's no spreading on splitting. Merging is trivial. 

Now, admittedly, I'm really hopped up on cough medication, so the feeling 
of this solving all the problems in the universe may not be entirely 
accurate. But it feels so _right_.

I hope if feels right when I'm off my meds too.

			Linus
--

From: Rik van Riel
Date: Friday, April 9, 2010 - 5:11 pm

I am not on any cough meds, and your patch looks right.
OTOH, maybe I should be on some kind of cold meds, because
I haven't been feeling right all week...
--

From: Johannes Weiner
Date: Friday, April 9, 2010 - 4:54 pm

The key is that merging is only possible if the primary anon_vmas are
equivalent.

This only happens if we split a vma in two and clone the old vma's
anon_vma_chain into the new vma.  So the chains are equivalent.

Or anon_vma_prepare() finds a mergeable anon_vma, in which case this
will be the singleton on the vma's chain.

If a split vma is merged, the old anon_vma_chains are equivalent, we
drop one completely and the one that stays has not changed.

If a mergeable vma (singleton anon_vma) is merged into another one,
this singleton is the primary anon_vma of the swallowing vma, thus
already linked and the swallowing vma's anon_vma_chain stays unchanged.

If it's the other way round and the singleton vma swallows the other
one, every anon_vma of the vanishing vma is moved over (except your
singleton anon_vma, you already have that).  The result should look
exactly like the chain we swallowed.  So in all this merging, no
unique and new combination of anon_vma_chains should have been
created!  Thus you can merge as much as you want, either you swallow
singletons and don't change yourself or you are the singleton and
after the merger have an equivalent anon_vma_chain to the vma you
swallowed.

Again: no new anon_vmas should enter the game for mergeable vmas and
no _new_ anon_vma_chains should be created while merging.  Thus it
is always true that you either merge with a singleton or the chains
are equivalent.

At least those are my assumptions.  Maybe they are crap, but I don't
see how right now.

And according to Boris' test, somewhere we still drop anon_vmas
where we let pages in the field pointing at them.

	Hannes
--

From: Linus Torvalds
Date: Friday, April 9, 2010 - 4:56 pm

Ok, so I don't know if the above is correct, but if it is, let's ignore 
the "merge" case as being complex, and look at the other cases.

With fork, the main anon_vma becomes different, so let's ignore that. That 
always means that the resulting list is not comparable or compatible, and 
we'll never mix them up.

If we make one very _simple_ rule for the create/mmap case, namely that we 
only re-use another _singleton_ anon_vma, then split and create case will 
look exactly the same. And in particular, we get a very simple and 
powerful rule: if the anon_vma matches, then the _list_ will also always 
match.

And that, in turn, would make 'merge' trivial too: you really can always 
drop the side that goes away. There's never any question about how to 
merge the lists, or which to pick, because every single operation that 
leaves the anon_vma the same will guarantee that the list will be 
identical too.

So now the simple rule is that if the anon_vma is the same, then the list 
of associated anon_vma's will always be the same - across all of merge, 
split and create.

Isn't that a _much_ simpler model to think about?

So _instead_ of all the patches that have floated about, I would suggest 
this simple change to "find_mergeable_anon_vma()" instead..

Oh, and maybe it's the meds talking again. I'm feeling better than 
yesterday, but am still a bit lightheaded. 

		Linus

---
 mm/mmap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..462a8ca 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
 	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
 
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
+	if (near->anon_vma && list_is_singular(&near->anon_vma_chain) &&
+			vma->vm_end == near->vm_start &&
  			mpol_equal(vma_policy(vma), vma_policy(near)) ...
From: Rik van Riel
Date: Friday, April 9, 2010 - 5:19 pm

Boris, this is your chance to really ruin our week :)

If the bug persists with Linus's patch, we've been fixing
the wrong bug all week long, and you are experiencing
something else...


--

From: Johannes Weiner
Date: Friday, April 9, 2010 - 5:31 pm

That leaves the chance that my code was correct and we leave a conceptual
error around somewhere that can materialize again.  But I am at a point
where simplification never sounded more blissful, so yeah, I like it :)

Let's hope it fixes Boris's issue.
--

From: Linus Torvalds
Date: Friday, April 9, 2010 - 5:32 pm

Absolutely. I really don't know whether your merge routine works or not. 
I'd just rather not have to even _try_ to understand it.

I have a fairly simple rule for most of the code I see: if I have a hard 


I'm going to just guess that it won't, and that Boris' issue was actually 
due to something else entirely, and we've all been staring at totally the 
wrong code.

But we can hope.

			Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 12:27 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Now why would you go and jinx it like that... :)

Hibernation runs back-to-back:

1. light system load after boot... ok
2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok			[ this was the fireproof way to trigger the bug, btw]
3. kvm guests down, firefox loading a 4Mb html page... ok
4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok
5. ubuntu guest booting done, firefox done, play video... ok
6. video broken after resume due to:

[AO_ALSA] Pcm in suspend mode, trying to resume. 212%  2%  1.7% 1 0 
[AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented

i.e., unrelated... still ok

7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok
8. all guests off, firefox off, back to light load... ok

No oopsies or problems in dmesg except the old lockdep sysfs warning.

I will keep running that kernel in the next couple of days and keep you
informed in case this is the fix we're gonna use.

-- 
Regards/Gruss,
    Boris.
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 4:26 am

From: Borislav Petkov <bp@alien8.de>

Yep, you jinxed it :)

This time we got stuck on the anon_vma->lock (yep, we've seen that
oopsie before). So, it might be that we _really_ are staring at the
wrong code... Back to square one.


[18969.797126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:5605]
[18969.797126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029] irq event stamp: 0
[18969.798029] hardirqs last  enabled at (0): [<(null)>] (null)
[18969.798029] hardirqs last disabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last  enabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last disabled at (0): [<(null)>] (null)
[18969.798029] CPU 1 
[18969.798029] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029] 
[18969.798029] Pid: 5605, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #1 M3A78 PRO/System Product Name
[18969.798029] RIP: 0010:[<ffffffff8118b7f4>]  [<ffffffff8118b7f4>] delay_tsc+0x33/0xca
[18969.798029] RSP: 0018:ffff8801aebdf7b8  EFLAGS: 00000206
[18969.798029] RAX: 00000000fc6fc9e8 RBX: ffff8801aebdf7e8 RCX: 0000000000001200
[18969.798029] RDX: 0000000000002806 RSI: ffff8801aebdf848 RDI: 0000000000000001
[18969.798029] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[18969.798029] R10: ffff8801aebdf8a8 R11: 0000000000000001 R12: 0000000000000014
[18969.798029] R13: ffff88000a200000 R14: ffff8801aebde000 R15: ffff8801aebdffd8
[18969.798029] FS:  00007f2c86c656f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[18969.798029] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[18969.798029] ...
From: Rik van Riel
Date: Saturday, April 10, 2010 - 7:45 am

This is a different bug, though.

If the null pointer dereference is gone, Linus's patch
fixed that bug and we can move forward to fixing the
anon_vma->lock bug.

I'll start auditing the code to see if we forget to
unlock the anon_vma in some unlikely error path...
--

From: Linus Torvalds
Date: Saturday, April 10, 2010 - 8:24 am

No, I think we're good. I suspect this is a different issue. Do you have 
lockdep enabled, along with mutex and spinlock debugging etc? That might 
help pinpoint what triggers this.

But I think the fact that you are apparently not able to get the list 
corruption is a good sign. Of course, it might just be harder to trigger, 
and these things could all be a sign of a different bug, but my gut feel 
is that we did fix something, and you are just damn good at stressing the 
new code. Kudos.

		Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 9:38 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Yep, even my mom says I'm good at breaking things :) But seriously,
thanks - means a lot coming from you.

And I got an oops again, this time the #GP from couple of days ago.

<thinking out loud>

I'm starting to think that maybe there could be something wrong with the
machine I'm running it on. Especially since there are only two people
who reported this issue, Steinar and me, so how probable is it that
maybe those two machines have failing RAM module somewhere? Or some
other data corrupting thing? Although I should be getting mchecks...
Hmm...

</thinking out loud>

Im going to run the stress test on 2.6.33.2 to verify whether this is
actually software-related. Just in case. Oh, yes, I almost forgot, the
latest and greatest in the world of oopsies:


[  452.351588] general protection fault: 0000 [#1] PREEMPT SMP 
[  452.352119] last sysfs file: /sys/power/state
[  452.352131] CPU 1 
[  452.352131] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp
[  452.352131] 
[  452.352131] Pid: 2929, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #4 M3A78 PRO/System Product Name
[  452.352131] RIP: 0010:[<ffffffff810c5f00>]  [<ffffffff810c5f00>] page_referenced+0xee/0x1dc
[  452.352131] RSP: 0018:ffff88022adb18b8  EFLAGS: 00010206
[  452.352131] RAX: ffff88022ad5c468 RBX: ffffea0007598558 RCX: 0000000000000000
[  452.352131] RDX: ffff88022adb1cf8 RSI: ffff88022ad5c440 RDI: ffff88022e7d38a0
[  452.352131] RBP: ffff88022adb1938 R08: 0000000000000002 R09: 0000000000000000
[  452.352131] R10: ffff88022be83868 R11: ffffffff00000012 R12: 0000000000000000
[  452.352131] R13: 0032323200323212 R14: ffff88022ad5c428 R15: ffff88022adb1a00
[  452.352131] FS:  00007f056a1e36f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  452.352131] CS:  0010 DS: ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 10:05 am

Oh damn. So the list corruption really does happen still.

And the pattern is similar, but not the same: now it's 0032323200323232, 
rather than 002e2e2e002e2e2e. Very intriguing. 0x32 instead of 0x2e, but 
the same pattern of duplicated bytes. And not very helpful in that it 

No. Just the fact that there are two people who reported the same 
thing is already a pretty strong sign that it's real. Also, hardware 
problems don't tend to be as consistent in the details as yours have 
been. 

And in fact I have seen it personally (but couldn't reproduce it) on the 
kids mac mini after you reported it.

So I'm convinced the problem is real, and just not so easily 
triggered, and you're being a great tester.

			Linus
--
Here's the one I've seen, in case you care.  I haven't posted it, because 
it doesn't really add anything new.

 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [<c02850cf>] page_referenced+0xd6/0x199
 *pde = 21d73067 *pte = 00000000 
 Oops: 0000 [#2] SMP 
 last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sda/uevent
 Modules linked in: [last unloaded: scsi_wait_scan]
 
 Pid: 14440, comm: firefox Tainted: G      D    2.6.34-rc2-00391-gfc1203c #3 Mac-F4208EC8/Macmini1,1
 EIP: 0060:[<c02850cf>] EFLAGS: 00210287 CPU: 1
 EIP is at page_referenced+0xd6/0x199
 EAX: f59e65d4 EBX: c10b5480 ECX: 00000000 EDX: fffffff0
 ESI: f59e65d0 EDI: 00000000 EBP: d8f77cd8 ESP: d8f77ca0
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process firefox (pid: 14440, ti=d8f76000 task=cb795440 task.ti=d8f76000)
 Stack:
  f59e65d4 00000000 fffffff0 c15ba000 d8f77cbc c02885b8 c07972c4 d8f77cdc
  c0276712 00000000 00000001 c10b5498 c10b5480 d8f77e94 d8f77d58 c0276b53
  d8f77d48 00000000 00000000 00000000 0000001d d8f77de8 00000001 c07972c4
 Call Trace:
  [<c02885b8>] ? swapcache_free+0x1b/0x24
  [<c0276712>] ? __remove_mapping+0x90/0xb2
  [<c0276b53>] ? shrink_page_list+0x109/0x3ba
  [<c0277099>] ? ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 11:21 am

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started 
wondering: when is the 'page->mapping' of an anonymous page actually 
cleared?

The thing is, the mapping of an anonymous page is actually cleared only 
when the page is _freed_, in "free_hot_cold_page()". 

Now, let's think about that. And in particular, let's think about how that 
relates to the freeing of the 'anon_vma' that the page->mapping points to. 

The way the anon_vma is freed is when the mapping is torn down, and we do 
roughly:

	tlb = tlb_gather_mmu(mm,..)
	..
	unmap_vmas(&tlb, vma ..
	..
	free_pgtables()
	..
	tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_ 
unmapping all the pages we do the "unlink_anon_vmas(vma);" in 
"free_pgtables()". Fine so far - the anon_vma stay around until after the 
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the 
pages". The actual _freeing_ of the page happens generally in 
tlb_finish_mmu(), because we can free the page only after we've flushed 
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages 
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because 
we use a per-cpu variable), but as far as I can tell it is _not_ an 
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just 
been released back to the SLUB caches, the page itself might have been 
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I 
really might be missing something here.

		Linus

---
 include/asm-generic/tlb.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 11:26 am

Btw, if this turns out to be accurate, the real fix is to probably just 
have a separate phase at the very end to actually release all the vma's, 
rather than do it in "free_page_tables()". We don't want to make the 
tlb-gather any more atomic than it already is. In fact, Nick is trying to 
make it preemptible.

So the patch included in that mail was meant very much as a "let's test my 
crazy theory" patch, rather than as the real solution.

The patch is also untested. Maybe it doesn't work at all and introduces 
new bugs. Caveat emptor.

			Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 11:51 am

From: Linus Torvalds <torvalds@linux-foundation.org>

So, if I understand you correctly, the list_head anon_vma gets freed
_before_ the page descriptor itself, therefore we still get a valid
page->mapping in page_lock_anon_vma(). Maybe that explains the funny
patterns in %r13. But how do they come to exist when the anon_vma is
freed, shouldn't there be LIST_POISON or something recognizable?

Anyways, testing...

-- 
Regards/Gruss,
Boris.
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 11:58 am

From: Borislav Petkov <bp@alien8.de>

Nope, still b0rked. And this time is not a funny pattern but
ffffffffffffffe0 we had originally.

[  521.306972] BUG: unable to handle kernel NULL pointer dereference at (null)
[  521.307126] IP: [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[  521.307126] PGD 22d952067 PUD 2291db067 PMD 0 
[  521.307126] Oops: 0000 [#1] PREEMPT SMP 
[  521.307126] last sysfs file: /sys/power/state
[  521.307126] CPU 1 
[  521.307126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp
[  521.307126] 
[  521.307126] Pid: 2896, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0-dirty #5 M3A78 PRO/System Product Name
[  521.307126] RIP: 0010:[<ffffffff810c60b4>]  [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[  521.307126] RSP: 0018:ffff88022bd9f8b8  EFLAGS: 00010283
[  521.307126] RAX: ffff88022af8c338 RBX: ffffea00067e2998 RCX: 0000000000000000
[  521.307126] RDX: ffff88022bd9fcf8 RSI: ffff88022af8c310 RDI: ffff88022c0c5e60
[  521.307126] RBP: ffff88022bd9f938 R08: 0000000000000002 R09: 0000000000000000
[  521.307126] R10: ffff88022b4454d8 R11: ffffffff00000012 R12: 0000000000000000
[  521.307126] R13: ffffffffffffffe0 R14: ffff88022af8c2f8 R15: ffff88022bd9fa00
[  521.307126] FS:  00007ff70fb586f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  521.307126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  521.307126] CR2: 0000000000000000 CR3: 000000022e19c000 CR4: 00000000000006e0
[  521.307126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  521.307126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  521.307126] Process hib.sh (pid: 2896, threadinfo ffff88022bd9e000, task ffff88022c0c5e60)
[  521.307126] Stack:
[  521.307126]  ffff88022af8c338 00000000810c5dd3 ffff88022bd9f918 ffffffff810c5f3c
[  521.307126] <0> ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 1:05 pm

Ok, I think that just depends on who happens to re-use the allocation and 
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an 
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already 
mentioned to Rik earlier - the code is very hard to understand, and it's 
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is 
apart from the actual unmapping sequence the only other place that 
actually free's anon_vmas. So there are reasons to be very suspicious of 
that code.

And I think that code can actually lose an anon_vma chain. It's totally 
screwing up the "import anonvma" case: when it does

                        if (anon_vma_clone(importer, vma)) {
                                return -ENOMEM;
                        }
                        importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma". 

In which case we actually end up with an _empty_ chain (because importer 
didn't have a chain to begin with!) but "importer->anon_vma" points to an 
anon_vma.

And then when we do that "remove_next", we actually get rid of the only 
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone() 
only makes sense if importer is set, and it is only ever set _inside_ the 
earlier if-statement, so the whole code should be moved inside there), nor 
does some of the comments.

This patch is scary and untested, but the more I look at that code, the 
more convinced I am that vma_adjust was _really_ badly screwed up. The 
patch below may make things worse. I'll test it myself too, but I'm 
sending it out first, since I was writing the email as I was looking at 
the piece of cr*p.

		Linus

---
 ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 1:12 pm

Ok, it boots. Which means it must be bug-free and perfect. And I really am 
convinced that the old vma_adjust() use of anon_vma_clone() was _totally_ 
broken, so this really could explain everything.

The RCU grace period thing for the TLB flush does look like a real bug 
too, but it's one that is probably impossible to hit in practice.

A broken vma_adjust(), however, would seem to be trivial to hit once you 
just get the right memory freeing patterns going, because the anon_vma 
would easily be _loong_ gone because we didn't create a chain to it at 
all, so the anon_vma code decided that it's not used any more.

So I'm actually pretty optimistic that this really is it.

		Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 1:36 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Ok, let me verify what/in which order should be tested before I test
something wrongly. The RCU-safe fix for the TLB flush can stay for
correctness reasons, this last patch, obviosly, what happens with the
find_mergeable_anon_vma() changes to use only singleton lists for
merging? Should I keep those too?

-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Saturday, April 10, 2010 - 1:40 pm

Yes. So the patches I actually think are important are:

 - the RCU fix is real, although admittedly the race window is probably 
   too small to ever really hit.

 - the simplification rule to find_mergeable_anon_vma's is required, 
   because otherwise our anon_vma_merge() will do the wrong thing (maybe 
   Johannes' patch would be an alternative, but quite frankly, I think we 
   want the simpler code, and I don't think we even _want_ to share 
   anon_vma's that are complex due to forking)

   I like my "cleanup" version (the bigger one with lots of comments) more 
   than the two-liner version, but they should be equivalent.

 - the vma_adjust() fix is the one that I think may actually end up fixing 
   your problems for good. Knock wood.

So I think they are all required, but I suspect that the vma_adjust() one 
is finally the most direct explanation of the problem you've seen.

		Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 2:25 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Damn, nope, still no joy :(. It looked like it was fixed but one of the
test was to hibernate right after the 3 kvm guests were shut down and I
guess the mem freeing pattern kinda hits it where it most hurts.

Anyways, I'm going to bed soon, will test whatever you come up with guys
tomorrow morning when I can think again.

By the way, do we want to create a new thread - the mailchain is off the
screen limits of my netbook :)

Thanks.

p.s. Oopsie:


[  647.288638] PM: Syncing filesystems ... done.
[  647.307459] Freezing user space processes ... (elapsed 0.01 seconds) done.
[  647.320981] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  647.334152] PM: Preallocating image memory... 
[  647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
[  647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[  647.493001] PGD 22a1d1067 PUD 1cb6a9067 PMD 0 
[  647.493001] Oops: 0000 [#1] PREEMPT SMP 
[  647.493001] last sysfs file: /sys/power/state
[  647.493001] CPU 0 
[  647.493001] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core
[  647.493001] 
[  647.493001] Pid: 3231, comm: hib.sh Not tainted 2.6.34-rc3-00503-g8b3334b #6 M3A78 PRO/System Product Name
[  647.493001] RIP: 0010:[<ffffffff810c60a0>]  [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[  647.493001] RSP: 0018:ffff880223b6f8b8  EFLAGS: 00010283
[  647.493001] RAX: ffff88022aa316c8 RBX: ffffea0006882fc0 RCX: 0000000000000000
[  647.493001] RDX: ffff880223b6fcf8 RSI: ffff88022aa316a0 RDI: ffff88022de6de60
[  647.493001] RBP: ffff880223b6f938 R08: 0000000000000002 R09: 0000000000000000
[  647.493001] R10: ffff880228cb03a8 R11: ffffffff00000012 R12: 0000000000000000
[  647.493001] R13: ffffffffffffffe0 R14: ffff88022aa31688 R15: ...
From: Linus Torvalds
Date: Saturday, April 10, 2010 - 2:30 pm

Damn, I really hoped that was it. Three independent bugs found and fixed, 

I prefer to keep it in one thread so that they all show up together if I 

Well, it sure is consistent. I'll start to think about what else could go 
wrong..

		Linus
--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 2:51 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Yep, I'll redo the testing tomorrow, so that we are sure that even with

I'll keep the thread then - I didn't know it mattered. Mine was just a

Which could mean that even with those issues fixed, the real issue is
yet something else. Because obviously the fixes you throw at it don't
seem to change it - even the traces remain consistent across tests.
And if it is use-after-free case, the funny patterns could be some
shifted SLUB poison values which we happen to "see" through the dangling
pointer...  I dunno.

Hmm.

-- 
Regards/Gruss,
Boris.
--

From: Borislav Petkov
Date: Sunday, April 11, 2010 - 6:08 am

From: Borislav Petkov <bp@alien8.de>

Ok, I could verify that the three patches we were talking about still
can't fix the issue. However, just to make sure I'm sending the versions
of the patches I used for you guys to check.

[  529.667108] PM: Preallocating image memory... 
[  529.930881] BUG: unable to handle kernel NULL pointer dereference at (null)
[  529.931275] IP: [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[  529.931377] PGD 22e33d067 PUD 22ddc1067 PMD 0 
[  529.931377] Oops: 0000 [#1] PREEMPT SMP 
[  529.931377] last sysfs file: /sys/power/state
[  529.931377] CPU 3 
[  529.931377] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[  529.931377] 
[  529.931377] Pid: 3354, comm: hib.sh Tainted: G        W  2.6.34-rc3-00503-g0fcc334 #1 M3A78 PRO/System Product Name
[  529.931377] RIP: 0010:[<ffffffff810c603c>]  [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[  529.931377] RSP: 0018:ffff880105a118b8  EFLAGS: 00010283
[  529.931377] RAX: ffff88022dc896c8 RBX: ffffea0007a15e10 RCX: 0000000000000000
[  529.931377] RDX: ffff880105a11cf8 RSI: ffff88022dc896a0 RDI: ffff88022b760000
[  529.931377] RBP: ffff880105a11938 R08: 0000000000000002 R09: 0000000000000000
[  529.931377] R10: 0000000000000000 R11: ffffffff00000012 R12: 0000000000000000
[  529.931377] R13: ffffffffffffffe0 R14: ffff88022dc89688 R15: ffff880105a11a00
[  529.931377] FS:  00007f21045876f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[  529.931377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  529.931377] CR2: 0000000000000000 CR3: 000000022b33f000 CR4: 00000000000006e0
[  529.931377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  529.931377] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  529.931377] Process hib.sh (pid: 3354, threadinfo ffff880105a10000, task ...
From: Borislav Petkov
Date: Sunday, April 11, 2010 - 6:19 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Ho humm.

Maybe I'm crazy, but something started bothering me. And I started
wondering: when is the 'page->mapping' of an anonymous page actually
cleared?

The thing is, the mapping of an anonymous page is actually cleared only
when the page is _freed_, in "free_hot_cold_page()".

Now, let's think about that. And in particular, let's think about how that
relates to the freeing of the 'anon_vma' that the page->mapping points to.

The way the anon_vma is freed is when the mapping is torn down, and we do
roughly:

	tlb = tlb_gather_mmu(mm,..)
	..
	unmap_vmas(&tlb, vma ..
	..
	free_pgtables()
	..
	tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_
unmapping all the pages we do the "unlink_anon_vmas(vma);" in
"free_pgtables()". Fine so far - the anon_vma stay around until after the
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the
pages". The actual _freeing_ of the page happens generally in
tlb_finish_mmu(), because we can free the page only after we've flushed
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
we use a per-cpu variable), but as far as I can tell it is _not_ an
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just
been released back to the SLUB caches, the page itself might have been
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I
really might be missing something here.

		Linus
---
 include/asm-generic/tlb.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git ...
From: Borislav Petkov
Date: Sunday, April 11, 2010 - 6:19 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Ok, I think that just depends on who happens to re-use the allocation and
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already
mentioned to Rik earlier - the code is very hard to understand, and it's
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is
apart from the actual unmapping sequence the only other place that
actually free's anon_vmas. So there are reasons to be very suspicious of
that code.

And I think that code can actually lose an anon_vma chain. It's totally
screwing up the "import anonvma" case: when it does

                        if (anon_vma_clone(importer, vma)) {
                                return -ENOMEM;
                        }
                        importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma".

In which case we actually end up with an _empty_ chain (because importer
didn't have a chain to begin with!) but "importer->anon_vma" points to an
anon_vma.

And then when we do that "remove_next", we actually get rid of the only
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone()
only makes sense if importer is set, and it is only ever set _inside_ the
earlier if-statement, so the whole code should be moved inside there), nor
does some of the comments.

This patch is scary and untested, but the more I look at that code, the
more convinced I am that vma_adjust was _really_ badly screwed up. The
patch below may make things worse. I'll test it myself too, but I'm
sending it out first, since I was writing the email as I was looking at
the ...
From: Borislav Petkov
Date: Sunday, April 11, 2010 - 6:19 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

			Linus
---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, ...
From: Borislav Petkov
Date: Sunday, April 11, 2010 - 6:25 am

From: Linus Torvalds <torvalds@linux-foundation.org>


Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

			Linus
---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, ...
From: Linus Torvalds
Date: Sunday, April 11, 2010 - 10:07 am

Yup, the patches are the ones I wanted you to try.

So either my fixes were buggy (possible, especially for the vma_adjust 
case), or there are other bugs still lurking.

The scary part is that the _old_ anon_vma code didn't really care about 
the anon_vma all that deeply. It was just a placeholder, if you got some 
of it wrong the worst that would probably happen would be that a page 
could never find all the mappings it had. So it was a possible swap 
efficiency problem when we cannot get rid of all mapped pages, but if it 
only happens for some small and unusual special case, nobody would ever 
have noticed.

With the new code, when you have a page that is associated with a stale 
anon_vma, you get the page_referenced() oops instead.

And I can't find the bug. Everything I've looked at looks fine. So I'm 
going to ask you to start applying "validation patches" - code to check 
some internal consistency, and seeing if we break that internal 
consistency somewhere.

It may be that Rik has some patches like this from his development work, 
but here's the first one. This patch should have caught the vma_adjust() 
problem, but all it caught for me was that "anon_vma_clone()" ended up 
cloning the avc entries in the wrong order so the lists didn't actually 
look exactly the same.

The patch fixes that case, so if this triggers any warnings for you, I 
think it's a real bug.

But I'm pretty sure that the problem is that we have a "page->mapping" 
that points to an anon_vma that no longer exists, and you can easily get 
that while still having valid vma chains - they just aren't necessarily 
the complete _set_ of chains they should be.

[ In particular, I think that the _real_ problem is that we don't clear 
  "page->mapping" when we unmap a page.

  See the comment at the end of page_remove_rmap(), and it also explains 
  the test for "page_mapped()" in page_lock_anon_vma().

  But I think the bug you see might be exactly the race between 
  page_mapped() and ...
From: Linus Torvalds
Date: Sunday, April 11, 2010 - 10:16 am

Actually, so if it's that race, then we might get rid of the oops with 
this total hack.

NOTE! If this is the race, then the hack really is just a hack, because it 
doesn't really solve anything. We still take the spinlock, and if bad 
things has happened, _that_ can still very much fail, and you get the 
watchdog lockup message instead. So this doesn't really fix anything.

But if this patch changes behavior, and you no longer see the oops, that 
tells us _something_. I'm not sure how useful that "something" is, but it 
at least means that there are no _mapped_ pages that have that stale 
anon_vma pointer in page->mapping.

Conversely, if you still see the oops (rather than the watchdog), that 
means that we actually have pages that are still marked mapped, and that 
despite that mapped state have a stale page->mapping pointer. I actually 
find that the more likely case, because otherwise the window is _so_ small 
that I don't see how you can hit the oops so reliably.

Anyway - probably worth testing, along with the verify_vma() patch. If 
nothing else, if there is no new behavior, even that tells us something. 
Even if that "something" is not a huge piece of information.

		Linus

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
-	return anon_vma;
+
+	if (page_mapped(page))
+		return anon_vma;
+
+	spin_unlock(&anon_vma->lock);
 out:
 	rcu_read_unlock();
 	return NULL;
--

From: Borislav Petkov
Date: Sunday, April 11, 2010 - 11:55 am

From: Linus Torvalds <torvalds@linux-foundation.org>

Ok, did test with the all 5 patches applied. It oopsed with the same
trace, see below. Except one kernel/sched.c:3555 warning checking
spinlock count overflowing, nothing else. :(

I tried to see whether the page->mapping pointer is stale, I dunno,
maybe there could be something in the register dump which could tell us
what's happening. This is how I see it, I could very well be wrong and
missing something though:


So, yes, we oops at the same place, however, a bit early we do

	anon_vma = page_lock_anon_vma(page);
	if (!anon_vma)
		return referenced;

which compiles here to

	.loc 1 496 0
	movq	%rbx, %rdi	# page,
	call	page_lock_anon_vma	#
.LVL288:
	.loc 1 497 0
	testq	%rax, %rax	# anon_vma
.LVL289:
	.loc 1 496 0
	movq	%rax, %r14	#, anon_vma

and I checked that on the path before the instruction where we oops we
don't touch %r14 so the value in the register dump below should be that
anon_vma. Which looks like valid kernel pointer. We dereference it later
to get anon_vma->head.next with

	.loc 1 501 0
	movq	64(%r14), %r13	# <variable>.head.next, <variable>.head.next
.LBE1287:
	leaq	64(%r14), %rax	#,
	movq	%rax, -128(%rbp)	#, %sfp
.LBB1288:
	subq	$32, %r13	#, avc

which ends up in %r13 as ffffffffffffffe0.

So, it really looks like at least that list_head in anon_vma is
bollocks, or even the whole anon_vma. So if this is correct, it is
highly likely that the anon_vma is already freed material or not
initialized at all.

Hm...


[  616.317201] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  616.329964] PM: Preallocating image memory... 
[  616.586463] BUG: unable to handle kernel NULL pointer dereference at (null)
[  616.586851] IP: [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[  616.587045] PGD 225dcf067 PUD 22627f067 PMD 0 
[  616.587126] Oops: 0000 [#1] PREEMPT SMP 
[  616.587126] last sysfs file: /sys/power/state
[  616.587126] CPU 1 
[  616.587126] Modules linked in: ...
From: Linus Torvalds
Date: Sunday, April 11, 2010 - 5:13 pm

Ok, that preempt-count thing is a real problem, but should be unrelated to 
your issues.

Anyway, so this all means that we definitely have lost sight of an 
'anon_vma', even if page->mapping still points to it, and even though the 
page is still mapped.

I'll see if I can come up with a patch to do the same kind of validation 

Sadly, you cannot tell by the pointer. A stale pointer still is a 
perfectly fine kernel pointer, it's just that we've long since released 
the anon_vma it used to point to, and now it points to some random other 

Yes, it's pretty certain it is long free'd, and re-allocated to something 
else.

		Linus
--

From: Linus Torvalds
Date: Sunday, April 11, 2010 - 6:04 pm

Ok, this may or may not work. It hasn't triggered for me, which may be 
because it's broken, but maybe it's because I'm not doing whatever it is 
you are doing to break our VM.

It checks each anonymous page at unmap time against the vma it gets 
unmapped from. It depends on the previous vma_verify debugging patch, and 
it would be interesting to hear whether this patch causes any new warnngs 
for you..

If the warnings do happen, they are not going to be printing out any 
hugely informative data apart from the fact that the bad case happened at 
all. But If they do trigger, I can try to improve on them - it's just not 
worth trying to make them any more interesting if they never trigger.

		Linus

---
 mm/memory.c |   21 +++++++++++++++++++++
 mm/mmap.c   |    2 +-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..5d2df59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -890,6 +890,25 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+extern void verify_vma(struct vm_area_struct *);
+
+static void verify_anon_page(struct vm_area_struct *vma, struct page *page)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma *need_anon_vma = page_anon_vma(page);
+	struct anon_vma_chain *avc;
+
+	verify_vma(vma);
+	if (WARN_ONCE(!anon_vma, "anonymous page in vma without anon_vma"))
+		return;
+	list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) {
+		WARN_ONCE(avc->vma != vma, "anon_vma_chain vma entry doesn't match");
+		if (avc->anon_vma == need_anon_vma)
+			return;
+	}
+	WARN_ONCE(1, "page->mapping does not exist in vma chain");
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -940,6 +959,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
+			if ...
From: Borislav Petkov
Date: Monday, April 12, 2010 - 12:20 am

From: Linus Torvalds <torvalds@linux-foundation.org>

Haa, I think you're gonna want to improve them :)

	WARN_ONCE(1, "page->mapping does not exist in vma chain");

triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
more cores, there maybe would've been more of them :) Maybe need locking
if clean output is of interest (see below).

So, anyway, if I can read this correctly, there is a page->mapping
anon_vma which is _not_ in the anon_vmas chain of the vma
(avc->same_vma).

And the spot we oops on is in page_referenced_anon():

	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

which is actually where we iterate over all vmas associated with this
anon_vma.

So if that previous anon_vma pointed to by the page_mapping has been
falsely unlinked at some point, no wonder we boom on that later.

By the way, I completely understand when you say that your head hurts
from looking at this :).


[  486.580872] Restarting tasks ... done.
[  494.167242] [drm] Resetting GPU
[  495.422354] ------------[ cut here ]------------
[  495.422407] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.422442] Hardware name: System Product Name
[  495.422474] page->mapping does not exist in vma chain
[  495.422504] Modules linked in:
[  495.422545] ------------[ cut here ]------------
[  495.422555] ------------[ cut here ]------------
[  495.422565]  powernow_k8
[  495.422583] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.422591]  cpufreq_ondemand
[  495.422597] Hardware name: System Product Name
[  495.422602] page->mapping does not exist in vma chain cpufreq_powersave
[  495.422612] Modules linked in: cpufreq_userspace powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table freq_table cpufreq_conservative cpufreq_conservative binfmt_misc binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt kvm_amd dm_mod 8250_pnp kvm 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  495.422676]  ipv6Pid: 2919, comm: udevd Tainted: G     ...
From: Linus Torvalds
Date: Monday, April 12, 2010 - 9:02 am

Goodie.

I can't trigger this on my machine (not that I tried very hard - but I did 
do some swapping loads etc by limiting my memory to just 1GB etc). So I'm 
pretty sure my verification code is "correct", and verifies things that 
should be right.

And the fact that it triggers under the exact load that you use to then 
trigger the bug is a damn good thing.  That means that we are finally on 
the right track, and we have somethign that correlates well with the 

Yes, and that is supposed to be a no-no. The page is clearly associated 
with the vma in question (since we are unmapping it through that vma), but 
the vma list of 'anon_vma's doesn't actually have the one that 
'page->mapping' points to.

And that, in turn, means that we've lost sight of the 'page->mapping' 
anon_vma, and THAT in turn means that it could well have been free'd as 
being no longer referenced.

And if it was free'd, it could be re-allocated as something else (after 

Well, I have to say that I'm happy I've spent the time on it, because this 
way I got to learn all the new rules. It's just that I really wish I 
wouldn't have _had_ to.

Anyway, I'll have to think way more about this to see if I can come up 
with a debugging patch that shows more details about what actually caused 
this to happen in the first place. But we definitely have a smoking gun.

		Linus
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 9:26 am

I have a new theory. And this new theory is completely different from all 
the other things we've been looking at.

The new theory is really simple: 'page->mapping' has been re-set to the 
wrong mapping. 

Now, there is one case where we reset page->mapping _intentionally_, 
namely in the COW-breaking case of having the last user 
("page_move_anon_rmap"). And that looks fine, and happens under normal 
loads all the time. We _want_ to do it there.

But there is a _much_ more subtle case that involved swapping.

So guys, here's my fairly simple theory on what happens:

 - page gets allocated/mapped by process A. Let's call the anon_vma we 
   associate the page with 'A' to keep it easy to track.

 - Process A forks, creating process B. The anon_vma in B is 'B', and has 
   a chain that looks like 'B' -> 'A'. Everything is fine.

 - Swapping happens. The page (with mapping pointing to 'A') gets swapped 
   out (perhaps not to disk - it's enough to assume that it's just not 
   mapped any more, and lives entirely in the swap-cache)

 - Process B pages it in, which goes like this:

	do_swap_page ->
	  page = lookup_swap_cache(entry);
	 ...
	  set_pte_at(mm, address, page_table, pte);
	  page_add_anon_rmap(page, vma, address);

   And think about what happens here!

In particular, what happens is that this will now be the "first" mapping 
of that page, so page_add_anon_rmap() will do

	if (first)
		__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for process 
B!

So now page->mapping actually points to anon_vma 'B', not 'A' like it used 
to.

What happens then? Trivial: process 'A' also pages it in (nothing happens, 
it's not the first mapping), and then process 'B' execve's or exits or 
unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but anon_vma B 
does not exist any more. This can go on forever. Forget about RCU grace 
periods, forget about locking, ...
From: Rik van Riel
Date: Monday, April 12, 2010 - 11:40 am

That bug looks entirely possible.  Given that Borislav
has heavy swapping going on, it is quite possible that

The patch would help avoid the bug you described.

It does have the drawback of moving all the pages of
child processes back into the anon_vma of the parent
process after swapin, even if they are privately owned
pages by the child process.

I am guessing it may need a check to see whether the
page and swap slot are exclusively owned by the current
process.

Page or swap slot shared?      => oldest anon_vma
Page and swap slot exclusive?  => newest anon_vma

I suspect the easiest way to achieve this would be to
pass a flag in from do_swap_page, where we already
check this, a few lines above calling page_add_anon_rmap:

         if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
                 flags &= ~FAULT_FLAG_WRITE;
         }



--

From: Borislav Petkov
Date: Monday, April 12, 2010 - 12:00 pm

From: Rik van Riel <riel@redhat.com>

Yeah, about that. I dunno whether you guys saw that but the machine has
8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I
saw was 5Gb used, most of which pagecache. So I was kinda doubtful when
Linus came up with the swapping theory earlier. I'll pay attention to
the SwapCached in /proc/meminfo more to see whether we do any swapping.
It could be that there is a small amount which is swapped out for
whatever reason... Maybe that's the bug...

But I'll give the patch a run anyway in an hour or so anyway.

-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 12:17 pm

Thanks. I suspect you will find that even if there is no actual disk IO 
swapping going on during any of the normal loads, the shrink_all_memory() 
thing in your hibernation event will cause swap to happen. Or at least 
swap-cache entries to be done.

Oh, and I've decided that my rcu_read_lock() patch for the tlb_gather() 
thing for unmapping is bogus. Exactly because the critical issue isn't 
when the page is free'd (and page->mapping is cleared), but when the page 
is unmapped (and page_mapped() clears).

And that is done correctly even with the delayed frees in tlb_gather. So 
addign the rcu_read_lock/rcu_read_unlock around it all doesn't actually 
matter or help.

So the patches that I think fix real bugs are

 - the anon_vma_prepare() fix to only share anon_vma's if they are 
   singletons.

 - the vma_adjust() fix to copy the right anon_vma chains

 - the anon_vma_clone() fix to traverse the avc's in reverse order, so 
   that the resulting cloned chain is the same as the original chain

   You got this patch as part of the "verify_vma()" patch, but the only 
   part of that patch that matters is the one-liner that changes a 
   "for_each_list_entry" to use the "_reverse()" version..

 - and that last patch to pick the right anon_vma when mapping a page 
   (which could still be improved: the "insert new page" case does _not_ 
   have to take the oldest anon_vma, and Rik is correct that if we have an 
   exclusive swap cache entry we could also take the top one)

I think I'll re-post all four patches with real commit messages, to get 
ack's for them. I'd like to finally get the much delayed -rc4 out the 
door.

Oh, and if that "pick the right anon_vma" patch doesn't fix it, I suspect 
we'll have to revert the whole anon_vma changes for 2.6.34. It's getting 
pretty late in the -rc series to fix this bug. I'm _hoping_ that I really 
nailed it this time, and that we're ok, but if Borislav reports it still 
happening, and people not having any other ideas, I ...
From: Linus Torvalds
Date: Monday, April 12, 2010 - 1:23 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 10 Apr 2010 15:22:30 -0700
Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

When we move the boundaries between two vma's due to things like
mprotect, we need to make sure that the anon_vma of the pages that got
moved from one vma to another gets properly copied around.  And that was
not always the case, in this rather hard-to-follow code sequence.

Clarify the code, and fix it so that it copies the anon_vma from the
right source.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/mmap.c |   24 ++++++++----------------
 1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
-	struct anon_vma *anon_vma = NULL;
 	long adjust_next = 0;
 	int remove_next = 0;
 
 	if (next && !insert) {
+		struct vm_area_struct *exporter = NULL;
+
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 */
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end > next->vm_start) {
 			/*
@@ -527,7 +528,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 5 shifting the boundary up.
 			 */
 			adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end < vma->vm_end) {
 			/*
@@ -536,28 +537,19 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 4 shifting the boundary down.
 			 */
 			adjust_next = - ((vma->vm_end - end) >> ...
From: Linus Torvalds
Date: Monday, April 12, 2010 - 1:23 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 11 Apr 2010 17:15:03 -0700
Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

We want to walk the chain in reverse order when cloning it, so that the
order of the result chain will be the same as the order in the source
chain.  When we add entries to the chain, they go at the head of the
chain, so we want to add the source head last.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 
-	list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		avc = anon_vma_chain_alloc();
 		if (!avc)
 			goto enomem_failure;
-- 
1.7.1.rc1.dirty

--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 1:23 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
 static void __page_set_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
+	struct anon_vma *anon_vma;
+
+	BUG_ON(!vma->anon_vma);
+
+	/*
+	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 *
+	 * So take the last AVC chain entry in the vma, which is the deepest
+	 * ancestor, and use the anon_vma from that.
+	 */
+	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	anon_vma = avc->anon_vma;
 
-	BUG_ON(!anon_vma);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);
-- 
1.7.1.rc1.dirty

--

From: Rik van Riel
Date: Monday, April 12, 2010 - 2:03 pm

Agreed.  I'll send an incremental for that later, you
can judge whether or not it's something you'll want to

Reviewed-by: Rik van Riel <riel@redhat.com>
--

From: Johannes Weiner
Date: Monday, April 12, 2010 - 5:41 pm

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Would you mind pasting that nice description of the error case from your
other email into that changelog?  I skimmed over the description but when
I read this patch several hours later, I had to go back to that previous

I think the key here is not that it's the oldest (past) but also the one with
the longest extent (future), so that it's bound to stay until the last possible
mapping for this page vanishes.

Maybe it's just me, but I doubt the comment as it is would help me understand

Hannes
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 6:08 pm

It now looks like this..

		Linus
---
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

Here's the scenario:

 - page gets allocated/mapped by process A. Let's call the anon_vma we
   associate the page with 'A' to keep it easy to track.

 - Process A forks, creating process B. The anon_vma in B is 'B', and has
   a chain that looks like 'B' -> 'A'. Everything is fine.

 - Swapping happens. The page (with mapping pointing to 'A') gets swapped
   out (perhaps not to disk - it's enough to assume that it's just not
   mapped any more, and lives entirely in the swap-cache)

 - Process B pages it in, which goes like this:

        do_swap_page ->
          page = lookup_swap_cache(entry);
         ...
          set_pte_at(mm, address, page_table, pte);
          page_add_anon_rmap(page, vma, address);

   And think about what happens here!

   In particular, what happens is that this will now be the "first"
   mapping of that page, so page_add_anon_rmap() used to do

        if (first)
                __page_set_anon_rmap(page, vma, address);

   and notice what anon_vma it will use? It will use the anon_vma for
   process B!

   What happens then? Trivial: process 'A' also pages it in (nothing
   happens, it's not the first mapping), and then process 'B' execve's
   or exits or unmaps, making anon_vma B go away.

   End result: process A has a page that points to anon_vma B, but
   anon_vma B does not exist any more.  This can go on forever.  Forget
   about RCU grace periods, forget about locking, forget anything like
   that.  The bug is simply that ...
From: Minchan Kim
Date: Monday, April 12, 2010 - 9:23 pm

On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds
Reviewed-by: Minchan Kim <minchan.kim>

It was great hunting and was a chance to learn many things
from LKML smart guys.
I feel again about OSS's power and great procedure of linux evolution

Thanks for everybody.

-- 
Kind regards,
Minchan Kim
--

From: Minchan Kim
Date: Monday, April 12, 2010 - 9:26 pm

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Sorry for mistake.
I was extremely excited. :)

-- 
Kind regards,
Minchan Kim
--

From: Rik van Riel
Date: Monday, April 12, 2010 - 1:57 pm

Reviewed-by: Rik van Riel <riel@redhat.com>

--

From: Johannes Weiner
Date: Monday, April 12, 2010 - 5:18 pm

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
--

From: Minchan Kim
Date: Monday, April 12, 2010 - 9:16 pm

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim
--

From: Rik van Riel
Date: Monday, April 12, 2010 - 1:54 pm

Reviewed-by: Rik van Riel <riel@redhat.com>

--

From: Johannes Weiner
Date: Monday, April 12, 2010 - 4:59 pm

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--

From: Minchan Kim
Date: Monday, April 12, 2010 - 9:15 pm

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 1:22 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 10 Apr 2010 10:36:19 -0700
Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

This changes the anon_vma reuse case to require that we only reuse
simple anon_vma's - ie the case when the vma only has a single anon_vma
associated with it.

This means that a reuse of an anon_vma from an adjacent vma will always
guarantee that both vma's are associated not onyl with the same
anon_vma, they will also have the same anon_vma chain (of just a single
entry in this case).

And since anon_vma re-use was the only case where the same anon_vma
might be associated with different chains of anon_vma's, we now have the
case that every vma that shares the same vma will always also have the
same chain.  That makes it much easier to think about merging vma's that
share the same anon_vma's: you can always just drop the other anon_vma
chain in anon_vma_merge() since you know that they are always identical.

This also splits up the function to validate the anon_vma re-use, and
adds a lot of commentary about the possible races.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, so I'm sending out this series of four patches, in the (perhaps 
futile) hope that they will finally fix the problem that Borislav has been 
so great at reporting.

I'd like to gather ack's, nak's and perhaps changelog improvement 
suggestions while doing this.

 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 }
 
 /*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * ...
From: Rik van Riel
Date: Monday, April 12, 2010 - 1:54 pm

Reviewed-by: Rik van Riel <riel@redhat.com>
--

From: Johannes Weiner
Date: Monday, April 12, 2010 - 4:54 pm

Hi Linus,



I like to think of 'incomplete' and 'complete' versions of the same
chain and that this new rule of yours simplifies things by limiting
reuse to the cases where the incomplete and the complete version

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

That said, I still don't like that the vma comparisons differ depending
on whether we reuse an anon_vma or merge vmas.  In my happy-place, the
same vma comparison function is predicate for both cases, so I actually
liked that aspect of the old code, but I also see that code reuse is a
PITA in that file...  Ah well, that can still be cleaned up later.
--

From: Minchan Kim
Date: Monday, April 12, 2010 - 9:04 pm

On Tue, Apr 13, 2010 at 5:22 AM, Linus Torvalds

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>



-- 
Kind regards,
Minchan Kim
--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 2:51 am

Maybe write that as:

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
	if (a->vm_end != b->vm_start)
		return 0;

	if (!mpol_equal(vma_policy(a), vma_policy(b))
		return 0;

	if (a->vm_file != b->vm_file)
		return 0;

	if ((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC))
		return 0;

	if (a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) != b->vm_pgoff)
		return 0;

	return 1;
}


--

From: Borislav Petkov
Date: Monday, April 12, 2010 - 2:50 pm

From: Linus Torvalds <torvalds@linux-foundation.org>


Linus, are you trying to give me a heart-attack? This sh*t just survived
20(!) hibernation runs without a problem (well, there is this nagging
/sysfs lockdep warning) but apart from that, it survived! I even did my
all time best when hitting on it. Normally, it used to crap up on the
6th cycle as latest. Now we're rock solid. And yes, there were something
like ~64Mb in the swap cache.

Also, I have your verification stuff in addition to the 4 patches you
sent before. Not a single WARN_ONCE got triggered. So I have a gut
feeling that it is fixed but you never know with these beasts.

As before, I'll rebuild and reapply everything in the morning and retest
just in case. And I guess I'll have to test all following -rc's so that
we can be absolutely sure.

So cheers!

-- 
Regards/Gruss,
    Boris.
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 3:11 pm

Hey, all my other theories made sense too.. They just didn't work.

But as Edison said: I didn't fail, I just found three other ways to not 

Ok. That does sound very positive. Of course, last time you sounded 
positive, I had an email from you half an hour later that said "oh no, it 
oopsed again". So I'll take it with a bit of salt, but on the whole I'll 
be optimistic about it.

				Linus
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 3:18 pm

Oh, btw, I like your email gateway. Only noticed now:

	mail.skyhub.de (SuperMail on ZX Spectrum 128k)

that's a tough little machine.

			Linus
--

From: Borislav Petkov
Date: Monday, April 12, 2010 - 3:29 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Yeah, and it can handle all that mail traffic just fine :)

-- 
Regards/Gruss,
    Boris.
--

From: Borislav Petkov
Date: Tuesday, April 13, 2010 - 2:38 am

From: Linus Torvalds <torvalds@linux-foundation.org>

Ok, just finished testing -rc4 - no problems so far. Let's just go out
on a limb here and say with a greater certainty that this really got
fixed but be smart about it and keep an eye open if it happens again -
you never know.

Where is the champagne?

-- 
Regards/Gruss,
    Boris.
--

From: Rik van Riel
Date: Wednesday, April 14, 2010 - 2:59 pm

The recent anon_vma fixes cause many anonymous pages to end up
in the parent process anon_vma, even when the page is exclusively
owned by the current process.

Adding exclusively owned anonymous pages to the top anon_vma
reduces rmap scanning overhead, especially in workloads with
forking servers.

This patch adds a parameter to __page_set_anon_rmap that can
be used to indicate whether or not the added page is exclusively
owned by the current process.

Pages added through page_add_new_anon_rmap are exclusively
owned by the current process, and can be added to the top
anon_vma.

Pages added through page_add_anon_rmap can be either shared
or exclusively owned, so we do the conservative thing and
add it to the oldest anon_vma.

A next step would be to add the exclusive parameter to
page_add_anon_rmap, to be used from functions where we do
know for sure whether a page is exclusively owned.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
Borislav, I audited the code before making this change, but would
still appreciate your testing of this patch :)

Linus, once this patch survives Borislav's testing, I'll start
looking at the next step. I'd like to do things one step at a
time so I won't cause another regression...

 mm/rmap.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..12ac0f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -730,23 +730,31 @@ void page_move_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
+ * @exclusive:  the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
 
 	BUG_ON(!vma->anon_vma);
 
-	/*
-	 ...
From: Johannes Weiner
Date: Wednesday, April 14, 2010 - 4:20 pm

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
--

From: Borislav Petkov
Date: Thursday, April 15, 2010 - 1:34 am

From: Rik van Riel <riel@redhat.com>

Just did some light hammering and it looks ok so far. I'll keep watching
out for oopsies/issues.

Lightly-tested-by: Borislav Petkov <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.
--

From: Minchan Kim
Date: Thursday, April 15, 2010 - 9:02 am

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim
--

From: Linus Torvalds
Date: Thursday, April 15, 2010 - 1:01 pm

I really dislike your coding style.

If we do this conditionally, we're _much_ better off declaring the 
variables we only use inside that conditional block inside the block 
itself. And since we access "vma->anon_vma" in either case, just move that 
case outside the conditional statement, and avoid a pointless 
if/then/else.

IOW, something like this. Totally untested.

		Linus

---
 mm/rmap.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..78d4730 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -732,21 +732,25 @@ void page_move_anon_rmap(struct page *page,
  * @address:	the user virtual address mapped
  */
 static void __page_set_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
-	struct anon_vma_chain *avc;
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = vma->anon_vma;
 
-	BUG_ON(!vma->anon_vma);
+	BUG_ON(!anon_vma);
 
 	/*
-	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 * If the page isn't exclusively mapped into this vma,
+	 * we must use the _oldest_ possible anon_vma for the
+	 * page mapping!
 	 *
-	 * So take the last AVC chain entry in the vma, which is the deepest
-	 * ancestor, and use the anon_vma from that.
+	 * So take the last AVC chain entry in the vma, which is
+	 * the deepest ancestor, and use the anon_vma from that.
 	 */
-	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
-	anon_vma = avc->anon_vma;
+	if (!exclusive) {
+		struct anon_vma_chain *avc;
+		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+		anon_vma = avc->anon_vma;
+	}
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
@@ -802,7 +806,7 @@ void page_add_anon_rmap(struct page *page,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(address < ...
From: Felipe Balbi
Date: Thursday, April 15, 2010 - 11:09 pm

Hi,


while at that, would it make sense to first provide list_last_entry() 
since we already have list_first_entry() ??

totally unrelated to this patch, sorry

-- 
balbi
--

From: Linus Torvalds
Date: Friday, April 16, 2010 - 7:48 am

Yeah, it probably would make sense. Especially as doing a simple grep for 
'list_entry.*prev' does seem to imply that there might be quite a few 
places that would be able to use it. Although some of them do seem to be 
about finding the previous entry rather than the last in a list.

That said, doing the same grep for 'next' shows that a lot of places don't 
use the list_first_entry() that we _do_ have, so..

		Linus
--

From: Rik van Riel
Date: Sunday, April 11, 2010 - 12:49 pm

Looking around the code some more, zap_pte_range()
calls page_remove_rmap(), which leaves the
page->mapping in place and has this comment:

         /*
          * It would be tidy to reset the PageAnon mapping here,
          * but that might overwrite a racing page_add_anon_rmap
          * which increments mapcount after us but sets mapping
          * before us: so leave the reset to free_hot_cold_page,
          * and remember that it's only reliable while mapped.
          * Leaving it set also helps swapoff to reinstate ptes
          * faster for those pages still in swapcache.
          */

I wonder if we can clear page->mapping here, if
list_is_singular(anon_vma->head).  That way we
will not leave stale pointers behind.

Adding another VMA to the anon_vma can happen
at fork time - which will not happen simultaneously
with exit or munmap, because the mmap_sem is taken
for write during either code path.

Am I overlooking something here?
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 8:44 am

See my earlier email about this exact issue. It's well-known that there 
are stale page->mapping pointers. The "page_mapped()" check _should_ have 

What does that help? What if list _isn't_ singular?

		Linus
--

From: Rik van Riel
Date: Monday, April 12, 2010 - 8:51 am

Good point.  I wonder if we have some SMP reordering

Yeah, that was a bad idea.  Looking at the same code for
11 days straight seems to have put some knots in my brain :)
--

From: Rik van Riel
Date: Sunday, April 11, 2010 - 2:45 pm

Another thing I just thought of.

The anon_vma struct will not be reused for something completely
different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
is created with.

The anon_vma_chain structs are allocated from a slab without that
flag, so they can be reused for something else in the middle of
an RCU section.

Is that something worth fixing, or is this so subtle that we'd
rather not have the code rely on this kind of behaviour at all?
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 8:51 am

Rik, we _know_ it got re-used by something totally different. That's 
clearly the problem. The page->mapping pointer does _not_ point to an 
anon_vma any more. That's the problem here.

What we need to figure out is how we have a page on the LRU list that is 
still marked as 'mapped' that has that stale mapping pointer.

I can easily see how the stale mapping pointer happens for a non-mapped 
page. That part is trivial. Here's a simple case:

 - vmscan does that whole "isolate LRU pages", and one of them is a (at 
   that time mapped) anonymous page. It's now not on any LRU lists at all.

 - vmscan ends up waiting for pageout and/or writeback while holding that 
   list of pages.

 - in the meantime, the process that had the page exists or unmaps, 
   unmapping the page and freeing the vma and the anon_vma.

 - vmscan eventually gets to the page, and does that page_referenced() 
   dance. page->mapping points to something that is long long gone (as in 
   "IO access lifetimes", so we're talking something that has been freed 
   literally milliseconds ago, rather than any RCU delays)

So I can see the stale page->mapping pointer happening. That part is even 
trivial. What I don't see is how the page would be still marked 'mapped'. 
Everything that actually free's the vma/anon_vmas should also have 
unmapped the page before that - even if it didn't _free_ the page.

			Linus
--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 3:36 am

Sorry, Now I'm lost what discuss in this crazy long thread.
IIUC, If the page->mapping was freed millisecns ago, following (1)
check returen false and we never touch page->mapping literally.

Am I missing something?


===================================================================
struct anon_vma *page_lock_anon_vma(struct page *page)
{
        struct anon_vma *anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))       /* (1) here */ 
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        spin_lock(&anon_vma->lock);
        return anon_vma;
out:
        rcu_read_unlock();
        return NULL;
}
=================================================


And, I think your following patch seems incorrect.
The added page_mapped() is called after spinlock(anon_vma->lock),
it mean check-after-dereference. such check doesn't prevent invalid
pointer dereference, I think.

perhaps, I'm missing anything. I have to reread this thread at all from
first.

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
-	return anon_vma;
+
+	if (page_mapped(page))
+		return anon_vma;
+
+	spin_unlock(&anon_vma->lock);
 out:
 	rcu_read_unlock();
 	return NULL;







--

From: Rik van Riel
Date: Saturday, April 10, 2010 - 1:24 pm

It frees anon_vma_chain structures, but not actual anon_vmas.

Walking the anon_vma (from rmap) requires the anon_vma->lock,

A few lines up from that code, we have:

         if (vma->anon_vma && (insert || importer || start != 
vma->vm_start))
                 anon_vma = vma->anon_vma;

So anon_vma should always be vma->anon_vma.

If we have already imported an anon_vma, we will not
do so twice, because of the !importer->anon_vma check.


If we import a chain, from vma to importer, importer->anon_vma
will be equal to vma->anon_vma.


No argument there, vma_adjust is very hard to read and it took
me a few days to convince myself that my changes kept things
equivalent to how they were before.

--

From: Linus Torvalds
Date: Saturday, April 10, 2010 - 1:34 pm

Rik, I think you're ignoring the fact that the anon_vma_chain is also the 
implicit refcount.

So when you don't create the chains, you implicitly end up freeing the 
anon_vma too early. In fact, it might well happen at that 
'anon_vma_merge()': when it does the unlink_anon_vmas(), it may be 
unlinking the last remaining anon_vma ref, and then anon_vma_unlink 
_will_ in fact free the anon_vma.

Even though we have a 'vma->anon_vma' pointer that points to it - because 

None of that matters. If the dang thing got free'd, the lock isn't 

No. vma->anon_vma is NULL, so the above lines are total no-ops. We're 
trying to _fill_ it. But we're doing it wrong.

So we end up with:

	anon_vma = next->anon-vma
	importer = vma

and we do:

	if (anon_vma_clone(importer, vma)) {
		return -ENOMEM;
	}
	importer->anon_vma = anon_vma;

do you see?

The "anon_vma_clone(importer, vma)" does NOTHING, because it is cloning 
from the wrong source (from 'vma', rather than from 'next', so it leaves 
the vma chains empty.

And then, despite having empty chains, we do that

	importer->anon_vma = anon_vma;

which sets the anon_vma to the (non-NULL) next->anon_vma.

And then, a bit later, we'll do

	anon_vma_merge(vma, next);

which will happily notice that the anon_vma's of both vma and next match 
(because we just _set_ them to match), and then frees the ONLY REMAINING 
CHAIN - the one in next. The one we DID NOT CORRECTLY COPY, because we got 


The thing you seem to miss is that we aren't supposed to import the chain 

Stop worrying about 'vma'. Start worrying about 'next'.

			Linus
--

From: Rik van Riel
Date: Saturday, April 10, 2010 - 1:43 pm

Yeah, after reading through your patch it became obvious.
It's the code above this code that sets up the problem.

It's a small miracle it worked before...
--

From: Rik van Riel
Date: Saturday, April 10, 2010 - 1:32 pm

Your patch looks correct.  Gotta love how before,
"vma" could be either exporter or importer!

I'm guessing that it did not break before my
changes, because of plain old luck...

Acked-by: Rik van Riel <riel@redhat.com>
--

From: Rik van Riel
Date: Saturday, April 10, 2010 - 12:36 pm

Which is also where they are removed from the LRU.

Looks like we should move the anon_vma freeing from free_pgtables
over to remove_vma?

This code is just below the tlb_finish_mmu in exit_mmap:

         /*
          * Walk the list again, actually closing and freeing it,
          * with preemption enabled, without holding any MM locks.
          */
         while (vma)
                 vma = remove_vma(vma);

This comment in free_pgtables is a little suspect:

                 /*
                  * Hide vma from rmap and truncate_pagecache before freeing
                  * pgtables
                  */
                 unlink_anon_vmas(vma);
                 unlink_file_vma(vma);

After all, the rmap code will quickly notice that there either are
no page tables, or the page tables no longer have anything in them.

It looks like we may have had this use-after-free bug in the VM for
quite a while...  I am not entirely sure what exposed the bug, but
I can see how it works.

--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 7:40 am

Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
== RCU read lock assumption does hold.

But even with your patch it doesn't close all holes because while
zap_pte_range() can remove the last mapcount of the page, the
page_remove_tlb() et al. don't need to be the last use count of the
page.

Concurrent reclaim/gup/whatever could still have a count out on the page
delaying the actual free beyond the tlb gather RCU section.

So the reason page->mapping isn't cleared in page_remove_rmap() isn't
detailed beyond a (possible) race with page_add_anon_rmap() (which I
guess would be reclaim trying to unmap the page and a fault re-instating
it).

This also complicates the whole page_lock_anon_vma() thing, so it would
be nice to be able to remove this race and clear page->mapping in
page_remove_rmap().
--

From: Minchan Kim
Date: Monday, April 12, 2010 - 8:17 am

anon_vma lock is just valid in case of page_mapped.
if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped.
And last put_page doesn't touch anon_vma for freeing the page so I

BTW, I totally agree with you.
Now anon_vma is very complicated.
SLAB_DESTROY_BY_RCU, vma merge, when page->mapping is cleared,
anon_vma_chain and so on.. :(


-- 
Kind regards,
Minchan Kim
--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 8:33 am

Hmm, I think you're right. The race I was thinking of makes the
page_lock_anon_vma() RCU section overlap with that of the mmu_gather,
which ensures the thing is long enough, or hits the !_mapcount case.

I'm not sure there are other page->mapping users that are interesting.

--

From: Rik van Riel
Date: Monday, April 12, 2010 - 8:19 am

For anonymous pages, I don't see where the race comes from.

Both do_swap_page and the reclaim code hold the page lock
across the entire operation, so they are already excluding
each other.

Hugh, do you remember what the race between page_remove_rmap
and page_add_anon_rmap is/was all about?

I don't see a race in the current code...
--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 9:01 am

Something like the below would be nice if possible.


---
 mm/rmap.c |   44 +++++++++++++++++++++++++++++++-------------
 1 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..241f75d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -286,7 +286,22 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky: 
+ *
+ *  page_add_anon_vma()
+ *   atomic_add_negative(page->_mapcount);
+ *   page->mapping = anon_vma;
+ *
+ *
+ *  page_remove_rmap()
+ *   atomic_add_negative();
+ *   page->mapping = anon_vma;
+ *
+ * So we have to first read page->mapping(), and then verify
+ * _mapcount, and make sure we order them correctly.
+ *
+ * We take anon_vma->lock in between so that if we see the anon_vma
+ * with a mapcount we know it won't go away on us.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
@@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
-	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		goto out;
-	if (!page_mapped(page))
-		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	/*
+	 * Order the reading of page->mapping and page->_mapcount against the
+	 * mb() implied by the atomic_add_negative() in page_remove_rmap().
+	 */
+	smp_rmb();
+	if (!page_mapped(page)) {
+		spin_unlock(&anon_vma->lock);
+		anon_vma = NULL;
+		goto out;
+	}
+
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
 	}
-	/*
-	 * It would be tidy to reset the ...
From: Rik van Riel
Date: Monday, April 12, 2010 - 9:06 am

That would be a bug for file pages :)

I could see how it could work for anonymous memory, though.
--

From: Linus Torvalds
Date: Monday, April 12, 2010 - 9:46 am

I think it's scary for anonymous pages too. The _common_ case of 
page_remove_rmap() is from unmap/exit, which holds no locks on the page 
what-so-ever. So assuming the page could be reachable some other way (swap 
cache etc), I think the above is pretty scary. 

Also do note that the bug we've been chasing has _always_ had that test 
for "page_mapped(page)". See my other email about why the unmapped case 
isn't even interesting, because it's so easy to see how page->mapping can 
be stale for unmapped pages.

It's the _mapped_ case that is interesting, not the unmapped one. So 
setting page->mapping to NULL when unmapping is perhaps a nice consistency 
issue ("never have stale pointers"), but it's missing the fact that it's 
not really the case we care about.

			Linus
--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 11:40 am

Yes, I don't think this is the problem that has been plaguing us for
over a week now.

But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):

- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
  since it locks after checking page->_mapcount so:
    * it can return !NULL for an unmapped page (your patch cures that)
    * it can return !NULL but for a different anon_vma
      (my earlier patch checking page_rmapping() after the spin_lock
       cures that, but doesn't cure the above):

        [ highly unlikely but not impossible race ]

        page_referenced(page_A)

			try_to_unmap(page_A)

					unrelated fault

							fault page_A

	CPU0		CPU1		CPU2		CPU3

	rcu_read_lock()
	anon_vma = page->mapping;
	if (!anon_vma & ANON_BIT)
	  goto out
	if (!page_mapped(page))
	  goto out

			page_remove_rmap()
			...
			anon_vma_free()-----\
					    v
					anon_vma_alloc()
					
							anon_vma_alloc()
							page_add_anon_rmap()
					   ^
	spin_lock(anon_vma->lock)----------/


    Now I don't think the above can happen due to how our slab
    allocators work, they won't share a slab page between cpus like
    that, but once we make the whole thing preemptible this race
    becomes a lot more likely.


So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.


struct anon_vma *page_lock_anon_vma(struct page *page)
{
	struct anon_vma *anon_vma;
	unsigned long anon_mapping;

	rcu_read_lock();
again:
	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);

	/*
	 * The RCU read lock ensures we can safely dereference anon_vma
	 * since it ensures the backing slab ...
From: Peter Zijlstra
Date: Monday, April 12, 2010 - 12:30 pm

On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote:


page_remove_rmap()
anon_vma_unlink()
  anon_vma_free()


page_add_anon_rmap(), so that the page_mapped() test below would be


--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 12:44 pm

OK, so non of the users of page_lock_anon_vma() with exception of the
memory-failure.c one could really care. And all of them seem to be safe
enough wrt dealing with a dead one.

So unless people care, I'm going to not spend more time on trying to
make page_lock_anon_vma() behave.

Instead I'll try and see wth it is that migrate.c and rmap_walk_anon are
doing.

--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 3:53 am

Does anon->lock dereference is guranteed if page->_mapcount==-1?
It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
gurantee.

perhaps, I'm missing your point.


--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 4:30 am

No you're right, I got my head hopelessly twisted up trying to make
page_lock_anon_vma() do something reliable, but there really isn't much
that can be done.

Luckily most users (with exception of the memory-failure.c one) don't
really care and all take steps to verify the page is indeed in any of
the vmas it might find.

So I've given up on this and will only submit a patch like the below,
which hopefully does still make sense...

I do think there's a missing barrier in there as well, but I've made
enough of a fool of myself.

[ with the preemptible mmu_gather patches I introduce a refcount to
  the anon_vma, and then with atomic_inc_not_zero() we can add a
  guarantee that the returned anon_vma is alive ]

---
 mm/rmap.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..49a2533 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -285,8 +285,22 @@ void __init anon_vma_init(void)
 }
 
 /*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
  */
 struct anon_vma *page_lock_anon_vma(struct page ...
From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 5:00 am

Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems
overengineering, I think. this is fastest, but anon_vma allocation is not
(and was not) fork/exit bottleneck point. So, I guess most simply way is
best.


Also following patch looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks for that. I've thought this is really necessary. but my (very) poor



--

From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 7:27 am

Well, that refcount stuff still relies on DESTROY_BY_RCU :-)

Anyway, it also looks like a lot of races are avoided by ordering the
rmap_add/remove calls wrt to adding/removing the page to/from the LRU.

Rmap calls come from LRU pages, and it looks like rmap state is only
changed for pages that are not on the LRU.

I still have to go through all that code again to make sure, but I
couldn't find a race between page_add_anon_rmap() and
page_lock_anon_vma() due to that.

If there is, we need to look at page_mapped() before page->mapping
because page_add_anon_rmap() first increments the mapcount and only then
adjusts the mapping, so the existing order in page_anon_lock_vma() can
end up dereferencing a long dead anon_vma.






--

From: Borislav Petkov
Date: Saturday, April 10, 2010 - 10:07 am

From: Borislav Petkov <bp@alien8.de>

Just did a bunch of hibernation runs - 2.6.33.2 feels rock solid - no
issues whatsoever. So in the face of such results a hw failure is kinda
unprobable... Hmm...

-- 
Regards/Gruss,
Boris.
--

From: Linus Torvalds
Date: Saturday, April 10, 2010 - 9:41 am

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated 
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a 
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and 
some of it is just because one routine got split up into three), and I 
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from 
the decision of whether we can merge the vma's - the two are kind of 
related, but they are not really the same, and they have different issues. 
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real 
semantic difference, but if I did something wrong it won't work. Also note 
the comment about the lack of locking between two adjacent anon_vma's 
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to 
ever matter (anon_vma's are stable once they are set, so it's really just 
that you could first load a NULL, and then if you re-load the value you 
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't 
hold any lock that protects the list we are checking. But 
"list_is_singular()" is safe and won't oops even if the pointers in the 
list are crap, because it only _compares_ the prev/next pointers, it 
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the 
very very unlikely case that two anon_vma's get prepared concurrently, but 
from a correctness standpoint it doesn't matter. We might sometimes - once 
in a blue moon - reject an anon_vma that could in theory have been merged, 
but that won't hurt.

Comments? Rik, Johannes?

			Linus

---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 ...
From: Johannes Weiner
Date: Saturday, April 10, 2010 - 3:49 pm

I am all in favor of only doing singletons, so that we don't have to
inflict my psycho-active merging routine on civilians.

I am not convinced it's a good idea to share an anon_vma, however, when
we know beforehand the vmas will never merge, because it will increase
rmap overhead of walking unrelated vmas for every page in every vma that
is part of the reused anon_vma.

So we usually take that as a trade-off when there is a chance the vmas
could still reunite and we don't want to spoil that through differing
anon_vmas.

But if it's already clear that they won't, it appears to me it would
be more efficient in the long run to just allocate our own anon_vma.

Did you have something in mind that I missed?

	Hannes
--

From: Linus Torvalds
Date: Saturday, April 10, 2010 - 4:31 pm

Mostly that the corner cases will never matter, and I'd prefer to keep the 
code simpler than to care deeply.

For example, the only case you'd see vm_ops->close() is for special device 
mappings. It's true that they cannot have their vma's merged, but it's 
also true that they (a) will seldom have anon_vma's anyway and (b) would 
never get mapped very many times so that anon_vma merging would be an 
issue.

In other words, it's a "don't care" situation, where to keep the code 
simpler we just document that we don't care.

		Linus
--

From: Minchan Kim
Date: Wednesday, April 7, 2010 - 8:55 am

Hi, Rik.


At last, you might find culprit.

AFAIU your descriptoin, don't we have to care vma_merge case, too?
Sorry if it is dumb question.

-- 
Kind regards,
Minchan Kim
--

From: Borislav Petkov
Date: Wednesday, April 7, 2010 - 12:29 am

From: Linus Torvalds <torvalds@linux-foundation.org>

No, maybe I expressed myself wrong (it was late an' all) - I was
basically trying to confirm your assessment that anon_vma->head.next
is crap but the code had changed since I had added the debugging 'if
(!anon_vma->head.next)' and that was the value that was already in %r13
before iterating over the list chain.

Yeah, just a minor nitpick and not that it matters. Nevermind though,
we're on the same page.

-- 
Regards/Gruss,
Boris.
--

From: Paulo Marques
Date: Wednesday, April 7, 2010 - 7:05 am

Just a wild shot in the dark: it can be a couple of gray pixels with
intensity 0x2e at some 32 bits per pixel mode. I say this because of the
zero bytes there and someone mentioning seeing the problem when starting X.

-- 
Paulo Marques - www.grupopie.com

"Don't worry, you'll be fine; I saw it work in a cartoon once..."
--

From: Borislav Petkov
Date: Wednesday, April 7, 2010 - 7:13 am

From: Paulo Marques <pmarques@grupopie.com>

I don't think those are related: the problem when X was starting happens
with Rik' newest patch and the funny %r13 value happened after enabling
SLUB debugging last night.

Thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 4:37 pm

Oh, and since the debugging code never triggered ('head.next' wasn't 
actually NULL), you never got here, but the 'page' you passed in to 
object_error() should be the page of the slab allocation, not the page 
associated with the anon_vma.

So it should be something like "virt_to_head_page(anon_vma)" that you pass 
in to object_err().

Not that it matters. I assume it is the fact that SLAB debugging is on 
that actually turns the NULL into a non-NULL thing. Poisoning is not 
active for SLUb's with constructors or RCU-freeing, but things like 
redzoning still are. So enabling SLUB debugging will change the offsets 
within the pages of all the SLUB allocations.  I wonder if that's just 
what caused it to now have that 0x002e2e2e002e2e2e instead of NULL.

		Linus
--

From: Rik van Riel
Date: Tuesday, April 6, 2010 - 4:22 pm

It gets more fun.  It looks like the anon_vma is only
allocated through anon_vma_alloc() and only handled
by the functions in rmap.c

By themselves, all of those functions look alright.

However, I think I may have found a possible bug in
the interplay between anon_vma_prepare() and vma_adjust(),
across several mprotect invocations.

Let me explain what I think may be going on in small
steps, since it is quite subtle (assuming I am right).

1) a process forks, creating a second "layer" of
    anon_vma objects for the VMAs that have anon pages

2) a new VMA is created adjacant to an existing one,
    with different permissions

3) anon_vma_prepare is called on the new VMA, this
    only links the "top" anon_vma to the new VMA, since
    that is the anon_vma where all new pages get
    instantiated anyway   (this would be part of the bug)

4) mprotect changes the permission of one of the VMAs,
    causing the old and the new VMAs to get merged

5) vma_adjust calls anon_vma_merge, causing the anon_vma
    chain of one of the VMAs to get nuked - with bad luck,
    this is the original one, leaving just the new anon_vma
    attached to the VMA

6) if the parent process quits, the old anon_vma structs
    get freed

7) meanwhile, we may still have some anonymous pages
    stick around in memory that have their page->mapping
    point to a freed anon_vma struct

Does this look like it could happen?

If so, I'll cook up a patch to change anon_vma_prepare
and find_mergeable_anon_vma to attach the whole chain
of anon_vmas to the new VMA, using anon_vma_clone().
--

From: Linus Torvalds
Date: Tuesday, April 6, 2010 - 5:10 pm

Sounds at least possible. Way more likely than any of the "trivially 
obvious" code being buggy, or the SLUB layer suddenly having a serious bug 
that only the new user could trigger.

That said, the code that _really_ confuses me is the stuff that uses 
"anon_vma_clone()". Could you please also explain the code flow of 
vma_adjust() to mere mortals, please?

I suspect Borislav is sleeping. But at least we have a patch for him to 
test when he wakes up ;)

		Linus
--

From: Rik van Riel
Date: Tuesday, April 6, 2010 - 6:18 pm

That's easier said than done.  I spent 3 days with pen and paper,
going over that code before I made the anon_vma changes, first
verifying that the code is indeed correct and then figuring out
how I could make the anon_vma changes safely.

I am not happy with the complexity of the code around vma_adjust,
but could not find a way to simplify it and still keep merging
VMAs the way we do.

My largest change to vma_adjust was moving some code closer to
the beginning of the function, so I could bail out if the

I am looking forward to the test results.

--

From: Borislav Petkov
Date: Wednesday, April 7, 2010 - 12:22 am

From: Rik van Riel <riel@redhat.com>
Date: Tue, Apr 06, 2010 at 09:18:28PM -0400

Hi Rik,


This happens when starting X, I haven't even started hibernating.

  [By the way, further testing will have to wait till tonight since I
   have a job, you know :) ]

Also, mm/rmap.c:745 is

	BUG_ON(!anon_vma);

in __page_set_anon_rmap().

---
[   43.142371] ------------[ cut here ]------------
[   43.142411] kernel BUG at mm/rmap.c:745!
[   43.142436] invalid opcode: 0000 [#1] PREEMPT SMP 
[   43.142514] last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent
[   43.142537] CPU 0 
[   43.142559] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[   43.142997] 
[   43.143012] Pid: 1940, comm: console-kit-dae Not tainted 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[   43.143012] RIP: 0010:[<ffffffff810c08e7>]  [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[   43.143012] RSP: 0000:ffff88022c019da8  EFLAGS: 00010246
[   43.143012] RAX: 0000000000000000 RBX: ffffea000774ff78 RCX: 000000002ce900f4
[   43.143012] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[   43.143012] RBP: ffff88022c019dc8 R08: 00007f29e3cfd928 R09: 000000000062c318
[   43.143012] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88022bbad960
[   43.143012] R13: 00007f29e3cfd928 R14: 00007f29e3cfd928 R15: 80000002216d9067
[   43.143012] FS:  00007f29e3d0f790(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[   43.143012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.143012] CR2: 00007f29e3cfd928 CR3: 000000022dfd3000 CR4: 00000000000006f0
[   43.143012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   43.143012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   43.143012] Process console-kit-dae (pid: 1940, threadinfo ...
From: Pekka Enberg
Date: Wednesday, April 7, 2010 - 3:09 am

Hi Linus,

On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds

I haven't followed the discussion at all but if someone wants to
investigate that angle more, the most likely suspect are the recent
per-cpu changes. That said, I'd expect the problem to be more
widespread if SLUB is to blame here.

                        Pekka
--

From: KOSAKI Motohiro
Date: Wednesday, April 7, 2010 - 3:12 am

Nope. We don't doubt SLUB nor per-cpu anymore. Rik found the bug in his patch.

thanks.



--

From: Peter Zijlstra
Date: Wednesday, April 7, 2010 - 1:41 am

Right, so anon_vma uses SLAB_DESTROY_BY_RCU and as the huge comment in
rmap.c explains, that doesn't mean the objects themself get RCU grace
period delays in freeing, only the SLAB that backs these objects does.

So the moment you do kmem_cache_free() on the anon_vma it can be re-used
for another allocation. The only guarantee given by RCU is that the
backing storage doesn't go away and hence you can 'safely' deref
pointers, you still very much have to revalidate you got the object you
were looking for.
--

From: Peter Zijlstra
Date: Wednesday, April 7, 2010 - 1:36 am

I think it does, the anon_vma thing has an RCU destroyed slab, but that
doesn't mean the anon_vma object itself is rcu delayed. The moment we
free it it can be re-used. So the above use after free is a bug.
--

From: Johannes Weiner
Date: Wednesday, April 7, 2010 - 2:16 am

It frees avc->anon_vma, not avc.  So the sequence is

	free(avc->anon_vma) in anon_vma_unlink()
	list_del(&avc->same_vma) in unlink_anon_vmas()

It's not a use-after free.  A problem would be if somebody should find the
avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
pointer is invalid.

I don't think this can happen, however.  Both the unlinking and the looking
at the list happen under vma->vm_mm's mmap_sem held for writing.
--

From: Peter Zijlstra
Date: Wednesday, April 7, 2010 - 2:37 am

What I was worried about was it freeing anon_vma and then still having
the avc on list. But I guess that cannot happen because it only frees if
its actually empty.


--

From: Rik van Riel
Date: Wednesday, April 7, 2010 - 7:12 am

Peter, the avc is an anon_vma_chain, which is a different
object than the anon_vma itself.  There is no use after free
of an anon_vma object in unlink_anon_vmas + anon_vma_unlink.
--

From: Linus Torvalds
Date: Wednesday, April 7, 2010 - 8:46 am

Well, it's not really a "use after free" - it's just that a stale pointer 
still exists in a live data structure that is linked into the list. I 
don't think there is a real bug there, simply because I don't think 
anybody will be accessing that list (we should hopefully have all the 
sufficient mutual exclusion in place).

So I just think it is bad form to potentially free something before we get 
rid of all pointers to it. So to me it's a cleanliness issue: good code 
shouldn't do things like that, and it would be much cleaner to remove the 
AVC entry that has a pointer to the anon_vma _before_ we might be freeing 
the anon_vma.

Maybe I'm just anal.

		Linus
--

From: Linus Torvalds
Subject:
Date: Tuesday, April 6, 2010 - 9:32 am

No, that can only happen if somebody has done "anon_vma_free()" on it. And 
nobody does that if the anonvma still has a non-empty'&anon_vma->head'.

So as long as the anon_vma has a anon_vma_chain entry associated with it 
(or a ksm refcount, but that's a separate issue), it's not going to be 
re-allocated for any other use, because it's not going to be free'd.
	
		Linus
--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 9:54 am

That's what I am missing.
Thanks, Linus. 

I will think over the problem. :)

-- 
Kind regards,
Minchan Kim


--

From: Peter Zijlstra
Date: Wednesday, April 7, 2010 - 1:37 am

When doing the whole make i_mmap_lock/anon_vma->lock a mutex thing last
week I ran into the same issue and its on my todo list to find out wth
is happening there. 

So yes I think we should move that validation check inside
page_lock_anon_vma().

I'll cook up a patch once I'm done staring at the various funny arch
mmu_gather implementations.
--

From: Borislav Petkov
Subject:
Date: Tuesday, April 6, 2010 - 10:05 am

From: Rik van Riel <riel@redhat.com>


Sure, building ontop of v2.6.34-rc3-288-gab195c5.

Will try to trigger it but let me remind you that it will take a while
since it doesn't happen everytime I suspend.

Any other printks or debug output which might be helpful to slap at the

-- 
Regards/Gruss,
    Boris.
Previous thread: Re: kernel decompressor interface by H. Peter Anvin on Tuesday, March 30, 2010 - 10:45 am. (1 message)

Next thread: [PATCH] FS-Cache: Order the debugfs stats correctly by David Howells on Tuesday, March 30, 2010 - 11:02 am. (1 message)