Re: [PATCH -mmotm 05/30] mm: sl[au]b: add knowledge of reserve pages

Previous thread: [PATCH RESEND] tools/usb: Add Makefile by Davidlohr Bueso on Tuesday, July 13, 2010 - 2:50 am. (3 messages)

Next thread: [PATCH -tip] rbtree: avoid func twice when node doesn't have a right sibling by Xiaotian Feng on Tuesday, July 13, 2010 - 3:30 am. (1 message)
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:16 am

Hi,

Here's the latest version of swap over NFS series since -v20 last October. We decide to push
this feature as it is useful for NAS or virt environment.

The patches are against the mmotm-2010-07-01. We can split the patchset into following parts:

Patch 1 - 12: provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Patch 13 - 15: Provide some generic network infrastructure needed later on.

Patch 16 - 21: reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them.

Patch 22 - 23: Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

Patch 24 - 27: convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

Patch 28 - 30: minor bug fixing with latest -mmotm.

[some history]
v19: http://lwn.net/Articles/301915/
v20: http://lwn.net/Articles/355350/

Changes since v20:
	- rebased to mmotm-2010-07-01
	- dropped the null pointer deref patch for the root cause is wrong SWP_FILE enum
	- some minor build fixes
	- fix a null pointer deref with mmotm-2010-07-01
	- fix a bug when swap with multi files on the same nfs server

Regards
Xiaotian
--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From 8a554f86ab2fd4d8a1daecaef4cc5bb3901c4423 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:59:52 +0800
Subject: [PATCH 03/30] mm: expose gfp_to_alloc_flags()

Expose the gfp to alloc_flags mapping, so we can use it in other parts
of the vm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/internal.h   |   15 +++++++++++++++
 mm/page_alloc.c |   16 +---------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..3e2cc3a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -185,6 +185,21 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
+#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
+
+#define ALLOC_HARDER		0x10 /* try to alloc harder */
+#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebf0af7..72a6be5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1345,19 +1345,6 @@ failed:
 	return NULL;
 }
 
-/* The ALLOC_WMARK bits are used as an index to zone->watermark */
-#define ALLOC_WMARK_MIN		WMARK_MIN
-#define ALLOC_WMARK_LOW		WMARK_LOW
-#define ALLOC_WMARK_HIGH	WMARK_HIGH
-#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
-
-/* Mask to get the watermark bits */
-#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
-
-#define ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From 14f7823986429b8374d18e3f648cd84575296e03 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 18:00:37 +0800
Subject: [PATCH 04/30] mm: tag reseve pages

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Since low-memory situations are transient, and unrelated the the actual
page (any page can be on the freelist when we run low), don't mark the
page in any permanent way - just pass along the information to the
allocatee.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a95a202 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -71,6 +71,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 72a6be5..b9989c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1656,8 +1656,10 @@ zonelist_scan:
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
-- 
1.7.1.1

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From bc55bacd6bcc0f8a69c0d7e0d554c78237233e07 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:58:34 +0800
Subject: [PATCH 01/30] mm: serialize access to min_free_kbytes

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_wmarks(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50a6d10..ebf0af7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,6 +165,7 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 static unsigned long __meminitdata nr_kernel_pages;
@@ -4839,13 +4840,13 @@ static void setup_per_zone_lowmem_reserve(void)
 }
 
 /**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * __setup_per_zone_wmarks - called when min_free_kbytes changes
  * or when memory is hot-{added|removed}
  *
  * Ensures that the watermark[min,low,high] values for each zone are set
  * correctly with respect to min_free_kbytes.
  */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4943,6 +4944,15 @@ static void __init setup_per_zone_inactive_ratio(void)
 		calculate_zone_inactive_ratio(zone);
 }
 
+void setup_per_zone_wmarks(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_wmarks();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4978,7 +4988,7 @@ static int __init init_per_zone_wmark_min(void)
 ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From 8c68e4dc644be32cd82ba9711ba3ef89cb687cdf Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:59:16 +0800
Subject: [PATCH 02/30] Swap over network documentation

Document describing the problem and proposed solution

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 Documentation/network-swap.txt |  268 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 268 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/network-swap.txt

diff --git a/Documentation/network-swap.txt b/Documentation/network-swap.txt
new file mode 100644
index 0000000..760af2e
--- /dev/null
+++ b/Documentation/network-swap.txt
@@ -0,0 +1,268 @@
+
+Problem:
+   When Linux needs to allocate memory it may find that there is
+   insufficient free memory so it needs to reclaim space that is in
+   use but not needed at the moment.  There are several options:
+
+   1/ Shrink a kernel cache such as the inode or dentry cache.  This
+      is fairly easy but provides limited returns.
+   2/ Discard 'clean' pages from the page cache.  This is easy, and
+      works well as long as there are clean pages in the page cache.
+      Similarly clean 'anonymous' pages can be discarded - if there
+      are any.
+   3/ Write out some dirty page-cache pages so that they become clean.
+      The VM limits the number of dirty page-cache pages to e.g. 40%
+      of available memory so that (among other reasons) a "sync" will
+      not take excessively long.  So there should never be excessive
+      amounts of dirty pagecache.
+      Writing out dirty page-cache pages involves work by the
+      filesystem which may need to allocate memory itself.  To avoid
+      deadlock, filesystems use GFP_NOFS when allocating memory on the
+      write-out path.  When this is used, cleaning dirty page-cache
+      pages is not an option so if the filesystem finds ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From fba0bdebc34d3db41a2c975eb38e9548ea5c2ed1 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:40:05 +0800
Subject: [PATCH 05/30] mm: sl[au]b: add knowledge of reserve pages

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

The basic pattern used for all # allocators is the following, for each active
slab page we store if it came from an emergency allocation. When we find it
did, make sure the current allocation context would have been able to allocate
page from the emergency reserves as well. In that case allow the allocation. If
not, force a new slab allocation. When that works the memory pressure has
lifted enough to allow this context to get an object, otherwise fail the
allocation.

[mszeredi@suse.cz: Fix use of uninitialized variable in cache_grow]
[dfeng@redhat.com: Minor fix related with SLABDEBUG]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/slub_def.h |    1 +
 mm/slab.c                |   62 +++++++++++++++++++++++++++++++++++++++------
 mm/slob.c                |   16 +++++++++++-
 mm/slub.c                |   42 ++++++++++++++++++++++++++-----
 4 files changed, 104 insertions(+), 17 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 6447a72..9ef61f4 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -39,6 +39,7 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	int reserve;		/* Did the current page come from the reserve */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff ...
From: Pekka Enberg
Date: Tuesday, July 13, 2010 - 1:33 pm

Hi Xiaotian!

I would actually prefer that the SLAB, SLOB, and SLUB changes were in
separate patches to make reviewing easier.

Looking at SLUB:


OK, so assume that:

  (1) c->reserve is set to one

  (2) GFP flags don't allow dipping into the reserves

  (3) we've managed to free enough pages so normal
       allocations are fine

  (4) the page from reserves is not yet empty

we will call flush_slab() and put the "emergency page" on partial list
and clear c->reserve. This effectively means that now some other
allocation can fetch the partial page and start to use it. Is this OK?
Who makes sure the emergency reserves are large enough for the next
--

From: Xiaotian Feng
Date: Thursday, July 15, 2010 - 5:37 am

Good catch. I'm just wondering if above check is necessary. For
"emergency page", we don't set c->freelist. How can we get a

--

From: Neil Brown
Date: Monday, August 2, 2010 - 6:44 pm

On Tue, 13 Jul 2010 23:33:14 +0300

Yes, this is OK.  The emergency reserves are maintained at a lower level -
within alloc_page.
The fact that (3) normal allocations are fine means that there are enough
free pages to satisfy any swap-out allocation - so any pages that were
previously allocated as 'emergency' pages can have their emergency status
forgotten (the emergency has passed).

This is a subtle but important aspect of the emergency reservation scheme in
swap-over-NFS.  It is the act-of-allocating that is emergency-or-not.  The
memory itself, once allocated, is not special.

c->reserve means "the last page allocated required an emergency allocation".
This means that parts of that page, or any other page, can only be given as
emergency allocations.  Once the slab succeeds at a non-emergency allocation,
the flag should obviously be cleared.

Similarly the page->reserve flag does not mean "this is a reserve page", but
simply "when this page was allocated, it was an emergency allocation".  The
flag is often soon lost as it is in a union with e.g. freelist.  But that
doesn't matter as it is only really meaningful at the moment of allocation.

I hope that clarifies the situation,


--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:17 am

From 4a2dff5cf02e9d7f6ee9345c337697c4ab66c6dc Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:41:22 +0800
Subject: [PATCH 06/30] mm: kmem_alloc_estimate()

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/slab.h |    4 ++
 mm/slab.c            |   75 +++++++++++++++++++++++++++++++++++++++++++
 mm/slob.c            |   67 ++++++++++++++++++++++++++++++++++++++
 mm/slub.c            |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 233 insertions(+), 0 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 49d1247..b57b9ca 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kern_ptr_validate(const void *ptr, unsigned long size);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+			gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -144,6 +146,8 @@ void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 void kzfree(const void *);
 size_t ksize(const void *);
+unsigned kmalloc_estimate_objs(size_t, gfp_t, int);
+unsigned kmalloc_estimate_bytes(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
diff --git a/mm/slab.c b/mm/slab.c
index d8cd757..2a0dd0d 100644
--- a/mm/slab.c
+++ ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:18 am

From 6d9ae36dfe34cdbc6e1fc52e6db0b27286eb4b58 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:46:50 +0800
Subject: [PATCH 09/30] mm: system wide ALLOC_NO_WATERMARK

The reserve is proportionally distributed over all (!highmem) zones in the
system. So we need to allow an emergency allocation access to all zones. In
order to do that we need to break out of any mempolicy boundaries we might
have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed to be
within mempolicy boundaries. For instance IRQs don't even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/page_alloc.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6c85e9e..21d64f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1995,6 +1995,11 @@ restart:
 rebalance:
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		/*
+		 * break out mempolicy boundaries
+		 */
+		zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-- 
1.7.1.1

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:18 am

From 973ce2ee797025f055ccb61359180e53c3ecc681 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:45:48 +0800
Subject: [PATCH 08/30] mm: emergency pool

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mmzone.h |    3 ++
 mm/page_alloc.c        |   84 ++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 ++--
 3 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9ed9c45..75a0871 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -279,6 +279,7 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
+	unsigned long           pages_emerg;    /* emergency pool */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
@@ -770,6 +771,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
+int adjust_memalloc_reserve(int pages);
+
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 extern char numa_zonelist_order[];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b1fa10..6c85e9e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -167,6 +167,8 @@ static char * const zone_names[MAX_NR_ZONES] = {
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:18 am

From 78618cc83e6a21b876c30a8fb3940ccc1f5b99e1 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:49:16 +0800
Subject: [PATCH 10/30] mm: __GFP_MEMALLOC

__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC.

It allows one to pass along the memalloc state in object related allocation
flags as opposed to task related flags, such as sk->sk_allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 975609c..c608e26 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -47,6 +47,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -98,7 +99,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21d64f7..f7d3060 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:18 am

From 32f474cffc76551ddb792454845bd473634219b5 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:41:53 +0800
Subject: [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being associated
with a task, and therefore not having task flags to fiddle with - thus the gfp
to alloc flag mapping ignores the task flags when in interrupts (hard or soft)
context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some trickery.
We basically borrow the task flags from whatever process happens to be
preempted by the softirq.

So we modify the gfp to alloc flags mapping to not exclude task flags in
softirq context, and modify the softirq code to save, clear and restore the
PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag cannot
leak back into the preempted process.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f25798..85b74b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1765,6 +1765,13 @@ static inline void rcu_copy_process(struct task_struct *p)
 
 #endif
 
+static inline void tsk_restore_flags(struct task_struct *p,
+				     unsigned long pflags, unsigned long mask)
+{
+	p->flags &= ~mask;
+	p->flags |= pflags & mask;
+}
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:18 am

From c598d3bbb8b5e900ad0f8397b1acbe922003ac5d Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:01:48 +0800
Subject: [PATCH 11/30] mm: memory reserve management

Generic reserve management code.

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

[dfeng@redhat.com: build fix for CONFIG_SLAB]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/reserve.h |  198 +++++++++++++++
 include/linux/slab.h    |   25 +-
 mm/Makefile             |    2 +-
 mm/reserve.c            |  637 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c               |    2 +-
 5 files changed, 851 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/reserve.h
 create mode 100644 mm/reserve.c

diff --git a/include/linux/reserve.h b/include/linux/reserve.h
new file mode 100644
index 0000000..e1d3dbb
--- /dev/null
+++ b/include/linux/reserve.h
@@ -0,0 +1,198 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007-2008 Red Hat, Inc.,
+ *			    Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+
+struct mem_reserve {
+	struct mem_reserve *parent;
+	struct list_head children;
+	struct list_head siblings;
+
+	const char *name;
+
+	long pages;
+	long limit;
+	long usage;
+	spinlock_t lock;	/* protects limit and usage */
+
+	wait_queue_head_t waitqueue;
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:19 am

From 6c3a91091b2910c23908a9f9953efcf3df14e522 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:02:41 +0800
Subject: [PATCH 12/30] selinux: tag avc cache alloc as non-critical

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 security/selinux/avc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index 3662b0f..9029395 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -284,7 +284,7 @@ static struct avc_node *avc_alloc_node(void)
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 
-- 
1.7.1.1

--

From: Mitchell Erblich
Date: Tuesday, July 13, 2010 - 3:55 am

Why not just replace GFP_ATOMIC with GFP_NOWAIT?

This would NOT consume the valuable last pages.


--

From: Xiaotian Feng
Date: Thursday, July 15, 2010 - 4:51 am

But replace GFP_ATOMIC with GFP_NOWAIT can not prevent avc_alloc_node 

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:19 am

From 8d908090b5314bed0c3318d82891b8c3bbf27815 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:03:55 +0800
Subject: [PATCH 13/30] net: packet split receive api

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Thanks to Jiri Bohac for fixing a bug in bnx2.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 drivers/net/bnx2.c             |    9 +++------
 drivers/net/e1000e/netdev.c    |    7 ++-----
 drivers/net/igb/igb_main.c     |    6 +-----
 drivers/net/ixgbe/ixgbe_main.c |   14 ++++++--------
 drivers/net/sky2.c             |   16 ++++++----------
 5 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index a5dd81f..f6f83d0 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2670,7 +2670,7 @@ bnx2_alloc_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	struct sw_pg *rx_pg = &rxr->rx_pg_ring[index];
 	struct rx_bd *rxbd =
 		&rxr->rx_pg_desc_ring[RX_RING(index)][RX_IDX(index)];
-	struct page *page = alloc_page(GFP_ATOMIC);
+	struct page *page = netdev_alloc_page(bp->dev);
 
 	if (!page)
 		return -ENOMEM;
@@ -2700,7 +2700,7 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	pci_unmap_page(bp->pdev, dma_unmap_addr(rx_pg, mapping), PAGE_SIZE,
 		       PCI_DMA_FROMDEVICE);
 
-	__free_page(page);
+	netdev_free_page(bp->dev, page);
 	rx_pg->page = NULL;
 }
 
@@ -3035,7 +3035,7 @@ bnx2_rx_skb(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, struct sk_buff *skb,
 			if (i == pages - 1)
 				frag_len -= 4;
 
-			skb_fill_page_desc(skb, i, rx_pg->page, 0, frag_len);
+			skb_add_rx_frag(skb, i, rx_pg->page, 0, frag_len);
 ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:19 am

From 3bc4f5211d8716267891ff85385177f181e418ea Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:04:45 +0800
Subject: [PATCH 14/30] net: sk_allocation() - concentrate socket related allocations

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/skbuff.h |    3 +++
 include/net/sock.h     |    5 +++++
 net/ipv4/tcp.c         |    3 ++-
 net/ipv4/tcp_output.c  |   11 ++++++-----
 net/ipv6/tcp_ipv6.c    |   15 +++++++++++----
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac74ee0..988a4dc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1119,6 +1119,9 @@ static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
 extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
 			    int off, int size);
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_has_frags(skb))
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
diff --git a/include/net/sock.h b/include/net/sock.h
index 4f26f2f..9ddb37b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -563,6 +563,11 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4e6ddfb..8ffe2c8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -683,7 +683,8 @@ struct sk_buff ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:19 am

From e8a09c013cf8416ece804ddcbc0d016d2f936e6d Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:06:54 +0800
Subject: [PATCH 15/30] netvm: network reserve infrastructure

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)    network TX reserve
3)      protocol TX pages
4)    network RX reserve
5)      SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/net/sock.h |   43 ++++++++++++++++++++-
 net/Kconfig        |    3 +
 net/core/sock.c    |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 9ddb37b..1de14b6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -52,6 +52,7 @@
 #include <linux/mm.h>
 #include <linux/security.h>
 #include <linux/slab.h>
+#include <linux/reserve.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -532,6 +533,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:19 am

From e1a39da88a7b093474c48bb5f22f3b715e5ec205 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:17:23 +0800
Subject: [PATCH 16/30] netvm: INET reserves.

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Adds to the reserve tree:

  total network reserve
    network TX reserve
      protocol TX pages
    network RX reserve
+     IPv6 route cache
+     IPv4 route cache
      SKB data reserve
+       IPv6 fragment cache
+       IPv4 fragment cache

[jeffm@suse.de: PROCFS typo fix]
[dfeng@redhat.com: build fix for removing .strategy]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/net/inet_frag.h  |    7 ++++++
 include/net/netns/ipv6.h |    4 +++
 net/ipv4/inet_fragment.c |    3 ++
 net/ipv4/ip_fragment.c   |   34 +++++++++++++++++++++++++++-
 net/ipv4/route.c         |   42 ++++++++++++++++++++++++++++++++++-
 net/ipv6/reassembly.c    |   55 ++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv6/route.c         |   47 +++++++++++++++++++++++++++++++++++++-
 7 files changed, 186 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 16ff29a..958ee27 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,6 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+
 struct netns_frags {
 	int			nqueues;
 	atomic_t		mem;
@@ -10,6 +13,10 @@ struct ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From 3e824860934af5aa0150608314693c5e0e3608b6 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:30:27 +0800
Subject: [PATCH 17/30] netvm: hook skb allocation to reserves

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref.

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mm_types.h |    1 +
 include/linux/skbuff.h   |   27 +++++++--
 net/core/skbuff.c        |  137 +++++++++++++++++++++++++++++++++++++---------
 3 files changed, 133 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a95a202..73e0526 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -72,6 +72,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
 		int reserve;		/* page_alloc: page is a reserve page */
+		atomic_t frag_count;	/* skb fragment use count */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 988a4dc..4ac45ad 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -385,9 +385,12 @@ struct sk_buff {
 #else
 	__u8			deliver_no_wcard:1;
 #endif
+#ifdef CONFIG_NETVM
+	__u8   ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From b0240dd1e2ee0b4dc30f98c67cfe35e8c1833753 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:36:53 +0800
Subject: [PATCH 18/30] netvm: filter emergency skbs.

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 net/core/filter.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 52b051f..bdcbc14 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -82,6 +82,9 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	int err;
 	struct sk_filter *filter;
 
+	if (skb_emergency(skb) && !sk_has_memalloc(sk))
+		return -ENOMEM;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
1.7.1.1

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From 97cab9c6b5964ba48bee576214d71edbef74d0a6 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:37:56 +0800
Subject: [PATCH 19/30] netvm: prevent a stream specific deadlock

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/net/sock.h   |    7 ++++---
 net/core/sock.c      |    2 +-
 net/ipv4/tcp_input.c |   12 ++++++------
 net/sctp/ulpevent.c  |    2 +-
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1de14b6..ac87f6f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -977,12 +977,13 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
-static inline int sk_rmem_schedule(struct sock *sk, int size)
+static inline int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
-		__sk_mem_schedule(sk, size, SK_MEM_RECV);
+	return skb->truesize <= sk->sk_forward_alloc ||
+		__sk_mem_schedule(sk, skb->truesize, SK_MEM_RECV) ||
+		skb_emergency(skb);
 }
 
 static inline void sk_mem_reclaim(struct sock *sk)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6bd5765..f24560c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -399,7 +399,7 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		return err;
 
-	if (!sk_rmem_schedule(sk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb)) {
 		atomic_inc(&sk->sk_drops);
 		return -ENOBUFS;
 ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From 6c5ccad4c45a73a6d9ebecbfcb1bce8ff3ca462f Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 11:38:29 +0800
Subject: [PATCH 20/30] netfilter: NF_QUEUE vs emergency skbs

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 net/netfilter/core.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 78b505d..cc04549 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
 	if (verdict == NF_ACCEPT || verdict == NF_STOP) {
 		ret = 1;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(skb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+		if (skb_emergency(skb))
+			goto drop;
 		if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
 			goto next_hook;
-- 
1.7.1.1

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From 15437174f171e197ecdfa5fe71ae89334bb58fd2 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:07:28 +0800
Subject: [PATCH 21/30] netvm: skb processing

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/net/sock.h |    5 ++++
 net/core/dev.c     |   55 ++++++++++++++++++++++++++++++++++++++++++++++++---
 net/core/sock.c    |   16 +++++++++++++++
 3 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ac87f6f..aadf15c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -680,8 +680,13 @@ static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *s
 	return 0;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	if (skb_emergency(skb))
+		return __sk_backlog_rcv(sk, skb);
+
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e85cc5f..7169b9b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2801,6 +2801,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	struct net_device *orig_or_bond;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
 
 	if (!netdev_tstamp_prequeue)
 		net_timestamp_check(skb);
@@ -2808,9 +2809,21 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
 		return NET_RX_SUCCESS;
 
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_MEMALLOC sockets only
+	 *  - stay away from ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:20 am

From 0c3f6a5db5c61a222135286e8a6ada7411b3ac3b Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:08:45 +0800
Subject: [PATCH 22/30] mm: add support for non block device backed swap files

New addres_space_operations methods are added:
  int swapon(struct file *);
  int swapoff(struct file *);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the ->swapon() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of ->swap_{out,in}() to write/read swapcache pages.

The ->swapon() method will be used to communicate to the file that the VM
relies on it, and the address_space should take adequate measures (like
reserving memory for mempools or the like). The ->swapoff() method will be
called on sys_swapoff() when ->swapon() was found and returned no error.

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapon() so that
->swap_{out,in}() have instant access to it. It can be released on ->swapoff().

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
    in the context of writepage, as the page content is normally dirtied
    using either of the following interfaces:
      write_{begin,end}()
      {prepare,commit}_write()
      page_mkwrite()
    which do have the file context.

[miklos@szeredi.hu: cleanups]
[dfeng@redhat.com: fix get_swap_info return value]
[dfeng@redhat.com: fix wrong SWP_FILE enum]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:21 am

From 743090cf0c129f3c83506260866f525a9f181f99 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:10:26 +0800
Subject: [PATCH 24/30] nfs: teach the NFS client how to treat PG_swapcache pages

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 fs/nfs/dir.c           |    4 ++--
 fs/nfs/file.c          |   14 +++++++-------
 fs/nfs/fscache-index.c |    2 +-
 fs/nfs/fscache.c       |   14 +++++++-------
 fs/nfs/internal.h      |    7 ++++---
 fs/nfs/pagelist.c      |    6 +++---
 fs/nfs/read.c          |    6 +++---
 fs/nfs/write.c         |   45 +++++++++++++++++++++++----------------------
 8 files changed, 50 insertions(+), 48 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 782b431..0305786 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -182,7 +182,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page *page)
 
 	dfprintk(DIRCACHE, "NFS: %s: reading cookie %Lu into page %lu\n",
 			__func__, (long long)desc->entry->cookie,
-			page->index);
+			page_file_index(page));
 
  again:
 	timestamp = jiffies;
@@ -207,7 +207,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page *page)
 	 * Note: assumes we have exclusive access to this mapping either
 	 *	 through inode->i_mutex or some other mechanism.
 	 */
-	if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
+	if (invalidate_inode_pages2_range(inode->i_mapping, page_file_index(page) + 1, -1) < 0) {
 		/* Should never happen */
 		nfs_zap_mapping(inode, inode->i_mapping);
 	}
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 36a5e74..8f066fe 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -480,9 +480,9 @@ static void nfs_invalidate_page(struct page ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:21 am

From ee72952409a0b811d61f435682e6d161e3b5189b Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:10:49 +0800
Subject: [PATCH 25/30] nfs: disable data cache revalidation for swapfiles

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 fs/nfs/inode.c |    6 +++++
 fs/nfs/write.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 099b351..45293af 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -798,6 +798,12 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_cache_expired(inode)
 			|| NFS_STALE(inode)) {
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 109a970..0d7ea95 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -109,25 +109,62 @@ static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:21 am

From 61388a8872071bb5b0015b9f5e3183410a98d949 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:11:13 +0800
Subject: [PATCH 26/30] nfs: enable swap on NFS

Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
SOCK_MEMALLOC before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
packets required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 fs/nfs/Kconfig              |   10 ++++++
 fs/nfs/file.c               |   18 +++++++++++
 fs/nfs/write.c              |   22 ++++++++++++++
 include/linux/nfs_fs.h      |    2 +
 include/linux/sunrpc/xprt.h |    5 ++-
 mm/page_io.c                |    1 +
 net/sunrpc/Kconfig          |    5 +++
 net/sunrpc/sched.c          |    9 ++++-
 net/sunrpc/xprtsock.c       |   68 +++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 137 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index a43d07e..b022858 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -74,6 +74,16 @@ config NFS_V4
 
 	  If unsure, say N.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/network-swap.txt
+
 config NFS_V4_1
 	bool "NFS client support for NFSv4.1 (DEVELOPER ONLY)"
 	depends on NFS_V4 && EXPERIMENTAL
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8f066fe..319c2b3 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -524,6 +524,18 @@ ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:22 am

From fd03848cadf5719228f617b72039cc8302d892ef Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 14:00:02 +0800
Subject: [PATCH 30/30] fix mess up on swap with multi files from same nfs server

xs_swapper() will set xprt->swapper when swapon nfs files, unset xprt->swapper
when swapoff nfs files. This will lead a bug if we swapon multi files from
the same nfs server, they had the same xprt, then the reserved memory could
not be disconnected when we swapoff all files.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/sunrpc/xprt.h |    4 ++--
 net/sunrpc/xprtsock.c       |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index ba2330d..bc49091 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -171,8 +171,8 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1, /* use a reserved port */
-				swapper    : 1; /* we're swapping over this
+				resvport   : 1;	/* use a reserved port */
+	unsigned int		swapper;	/* we're swapping over this
 						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 5c8b918..30bb8ce 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1662,11 +1662,11 @@ int xs_swapper(struct rpc_xprt *xprt, int enable)
 		 */
 		err = sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
 		if (!err) {
-			xprt->swapper = 1;
+			xprt->swapper++;
 			xs_set_memalloc(xprt);
 		}
 	} else if (xprt->swapper) {
-		xprt->swapper = 0;
+		xprt->swapper--;
 		sk_clear_memalloc(transport->inet);
 		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
 	}
-- 
1.7.1.1

--

From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:21 am

From df0106f58d7ac2337f74efb1d8caaf27f635e050 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:11:32 +0800
Subject: [PATCH 27/30] nfs: fix various memory recursions possible with swap over NFS.

GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
just not of any filesystem data.

The problem is that previuosly NOFS was correct because that avoids
recursion into the NFS code, it now is not, because also IO (swap) can
lead to this recursion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 2be94bb..c0247e9 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
 	struct nfs_page	*p;
-	p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+	p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
 	if (p) {
 		memset(p, 0, sizeof(*p));
 		INIT_LIST_HEAD(&p->wb_list);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5852b20..dfa08cb 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -50,7 +50,7 @@ static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commitdata_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -69,7 +69,7 @@ void nfs_commit_free(struct nfs_write_data *p)
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -79,7 +79,7 @@ struct nfs_write_data ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:21 am

From 50e813068c51de733bbbdd04eb4af9c43919cd57 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:09:50 +0800
Subject: [PATCH 23/30]  mm: methods for teaching filesystems about PG_swapcache pages

In order to teach filesystems to handle swap cache pages, three new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  loff_t page_file_offset(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index() - gives the offset of this page in the file in
PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this function
also gives the correct index for PG_swapcache pages.

page_file_offset() - uses page_file_index(), so that it will give the expected
result, even for PG_swapcache pages.

page_file_mapping() - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mm.h      |   25 +++++++++++++++++++++++++
 include/linux/pagemap.h |    5 +++++
 mm/swapfile.c           |   19 +++++++++++++++++++
 3 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32033ba..0cf97fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -663,6 +663,17 @@ static inline void *page_rmapping(struct page *page)
 	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
 }
 
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_mapping(page);
+
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -679,6 +690,20 @@ static inline pgoff_t page_index(struct ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:22 am

From 50d2e72527b3e821544cc97c4dd5b1e5a44b6659 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:21:10 +0800
Subject: [PATCH 28/30] build fix for skb_emergency_protocol

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 net/core/dev.c |   48 ++++++++++++++++++++++++------------------------
 1 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7169b9b..fd7f8ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2791,6 +2791,30 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 }
 EXPORT_SYMBOL(__skb_bond_should_drop);
 
+/*
+ * Filter the protocols for which the reserves are adequate.
+ *
+ * Before adding a protocol make sure that it is either covered by the existing
+ * reserves, or add reserves covering the memory need of the new protocol's
+ * packet processing.
+ */
+static int skb_emergency_protocol(struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		switch (skb->protocol) {
+		case __constant_htons(ETH_P_ARP):
+		case __constant_htons(ETH_P_IP):
+		case __constant_htons(ETH_P_IPV6):
+		case __constant_htons(ETH_P_8021Q):
+			break;
+
+		default:
+			return 0;
+		}
+
+	return 1;
+}
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
@@ -2942,30 +2966,6 @@ out:
 	return ret;
 }
 
-/*
- * Filter the protocols for which the reserves are adequate.
- *
- * Before adding a protocol make sure that it is either covered by the existing
- * reserves, or add reserves covering the memory need of the new protocol's
- * packet processing.
- */
-static int skb_emergency_protocol(struct sk_buff *skb)
-{
-	if (skb_emergency(skb))
-		switch (skb->protocol) {
-		case __constant_htons(ETH_P_ARP):
-		case __constant_htons(ETH_P_IP):
-		case __constant_htons(ETH_P_IPV6):
-		case __constant_htons(ETH_P_8021Q):
-			break;
-
-		default:
-			return 0;
-		}
-
-	return 1;
-}
-
 /**
  ...
From: Xiaotian Feng
Date: Tuesday, July 13, 2010 - 3:22 am

From ea7b13006f42f7dcadd1bfb874d5e525b4c259e3 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 13:44:08 +0800
Subject: [PATCH 29/30] fix null pointer deref in swap_entry_free

Commit b3a27d uses p->bdev->bd_disk, this will lead a null pointer
deref with swap over nfs.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/swapfile.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d8a05e4..3eb53fc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -577,7 +577,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
-		struct gendisk *disk = p->bdev->bd_disk;
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit)
@@ -587,9 +586,11 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p->type;
 		nr_swap_pages++;
 		p->inuse_pages--;
-		if ((p->flags & SWP_BLKDEV) &&
-				disk->fops->swap_slot_free_notify)
-			disk->fops->swap_slot_free_notify(p->bdev, offset);
+		if (p->flags & SWP_BLKDEV) {
+			struct gendisk *disk = p->bdev->bd_disk;
+			if (disk->fops->swap_slot_free_notify)
+				disk->fops->swap_slot_free_notify(p->bdev, offset);
+		}
 	}
 
 	return usage;
-- 
1.7.1.1

--

From: Américo Wang
Date: Tuesday, July 13, 2010 - 5:53 am

Please use the "From:" line correctly, as stated in
Documentation/SubmittingPatches:

The "from" line must be the very first line in the message body,
and has the form:

        From: Original Author <author@example.com>

The "from" line specifies who will be credited as the author of the
patch in the permanent changelog.  If the "from" line is missing,
then the "From:" line from the email header will be used to determine
the patch author in the changelog.


I think you are using git format-patch to generate those patches, please supply
--author=<author> to git commit when you commit them to your local
tree. (or git am
if the patches you received already had the correct From: line.)

Thanks.
--

Previous thread: [PATCH RESEND] tools/usb: Add Makefile by Davidlohr Bueso on Tuesday, July 13, 2010 - 2:50 am. (3 messages)

Next thread: [PATCH -tip] rbtree: avoid func twice when node doesn't have a right sibling by Xiaotian Feng on Tuesday, July 13, 2010 - 3:30 am. (1 message)