Re: 2.6.26-rc: nfsd hangs for a few sec

Previous thread: [PATCH] ramfs: enable splice write by Octavian Purdila on Saturday, June 21, 2008 - 5:36 am. (1 message)

Next thread: Contact Mr Richard Smith by Free Lotto Email Promo on Saturday, June 21, 2008 - 5:58 am. (1 message)
From: Alexander Beregalov
Date: Saturday, June 21, 2008 - 5:57 am

One more try, added some CC's.

--

From: Linus Torvalds
Date: Saturday, June 21, 2008 - 11:36 am

are kind of scary, because they are both filesystem memory allocation 
paths that don't have GFP_NOFS, so they cause a callback back into the 
filesystem to free things.

Which in general isn't necessarily wrong: under inode pressure, it may 
well make sense to try to shrink the inode caches when allocating a new 
inode, or things may well blow up out of proportion, but it does make me a 
big nervous.

However, it's not clear why things apparently bisected down to the commit 
it did (54a6eb5c4765aa573a030ceeba2c14e3d2ea5706: "mm: use two zonelist 
that are filtered by GFP mask"). That part makes me worry that that commit 
screwed up the freeing pressure logic. 

Mel?

			Linus
--

From: Mel Gorman
Date: Saturday, June 21, 2008 - 3:41 pm

Thanks for the report. I was offline for two weeks and I would have missed
this without a direct cc. Today is my first day back online so if I miss
any context, sorry about that. What I have is;

1. This appeared in 2.6.26-rc1 (http://lkml.org/lkml/2008/5/10/60 is a copy
   of the original report)

2. The circular lock itself was considered to be a false positive by David
   Chinner (http://lkml.org/lkml/2008/5/11/253). I've added David to the
   cc. I hate to ask the obvious, but is it possible that LOCKDEP checking
   was not turned on for the kernels before 2.6.26-rc1?

3. The bisect shows commit 54a6eb5c4765aa573a030ceeba2c14e3d2ea5706 to trigger
   the circular locking logic. Even if the deadlock warning is a false
   positive, it's possible that reclaim has been altered in some way.

For each stack listed in the report, I'm going to look at how the patch affects
that path and see can I spot where the problem alteration happened. I'm still
dozy after holidays so a double check of reasoning from anyone watching would
be a plus as this is not a trivial revert.

I spotted at least one problem in the patch in a change made to SLAB that
needs to be fixed but it is not relevant to the problem at hand as I believe
Alexandar is using SLUB instead of SLAB.  That patch is at the end of the
mail. Christoph, can you double check that patch please?


I'm assuming that the few seconds are being spent in reclaim rather than
working out lock dependency logic. Any chance there is profile information
showing where all the time is being spent? Just in case, does the stall
still occur with lockdep turned off?

In the questionable patch, the first relevant change is how buffers are
freed up here;

diff --git a/fs/buffer.c b/fs/buffer.c
index 7135849..9b5434a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -360,16 +360,18 @@ void invalidate_bdev(struct block_device *bdev)
  */
 static void free_more_memory(void)
 {
-	struct zonelist *zonelist;
+	struct zone **zones;
 	int nid;
 
 ...
From: Christoph Lameter
Date: Saturday, June 21, 2008 - 4:46 pm

Right we have a significant memory leak here. Potentially one object for 
each zone is allocated and abandoned. May trigger more allocations
and therefore trigger more frequent reclaim because the free objects are
rapidly consumed on a system that relies on fallback allocations 
(memoryless nodes f.e.). Not a direct explanation for the problem but the
memory wastage could certainly can heretofore undiscovered locking 

Ok. That would work but its better to put the check into the if branch:


Subject: Slab: Fix memory leak in fallback_alloc()

The zonelist patches caused the loop that checks for available
objects in permitted zones to not terminate immediately. One object
per zone per allocation may be allocated and then abandoned.

Break the loop when we have successfully allocated one object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slab.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2008-06-21 16:39:04.336377178 -0700
+++ linux-2.6/mm/slab.c	2008-06-21 16:40:07.637834699 -0700
@@ -3263,9 +3263,12 @@ retry:
 
 		if (cpuset_zone_allowed_hardwall(zone, flags) &&
 			cache->nodelists[nid] &&
-			cache->nodelists[nid]->free_objects)
+			cache->nodelists[nid]->free_objects) {
 				obj = ____cache_alloc_node(cache,
 					flags | GFP_THISNODE, nid);
+				if (obj)
+					break;
+		}
 	}
 
 	if (!obj) {

--

From: Linus Torvalds
Date: Saturday, June 21, 2008 - 4:54 pm

Well, not for these traces, no. The trace contains __slab_alloc() in the 
call chain, which definitely fingers SLUB, not slab, despite the name 
(slab calls its allocation routines "cache_alloc", while slub calls them 
"slab_alloc" ;)

So the patch looks fine, and I applied it, but as Mel already mentioned, 
it looks like it won't be making any difference for Alexander.

		Linus
--

From: Christoph Lameter
Date: Saturday, June 21, 2008 - 5:18 pm

After the change we walk only zones for GFP_KERNEL. Meaning no HIGHMEM 
and MOVABLE zones. Doesnt that mean that reclaim is limited to ZONE_DMA 
and ZONE_NORMAL? Is that really intended?

If not then the following patch should return us to old behavior:

---
 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2008-06-21 17:15:45.597627317 -0700
+++ linux-2.6/mm/vmscan.c	2008-06-21 17:17:16.273293260 -0700
@@ -1249,13 +1249,12 @@ static unsigned long shrink_zone(int pri
 static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long nr_reclaimed = 0;
 	struct zoneref *z;
 	struct zone *zone;
 
 	sc->all_unreclaimable = 1;
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
 		if (!populated_zone(zone))
 			continue;
 		/*
--

From: Mel Gorman
Date: Saturday, June 21, 2008 - 6:38 pm

Yeah, but the zonelist is for GFP_KERNEL so it should not include the HIGHMEM
zones, right? The key change is that after the patch there are fewer zonelists


I think the effect of that patch is that zones get shrunk that have

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Saturday, June 21, 2008 - 9:13 pm

But the HIGHMEM zones etc were included before. There was no check for 

Right. AFAICT That was the behavior before the change.
--

From: Mel Gorman
Date: Sunday, June 22, 2008 - 10:07 am

Well, the mask is not totally ignored, it's part of the scan_control and
used later when deciding what can and can't be done as part of reclaim.
However, you are right in that it is apparently ignored for zone
selection.

However, try_to_free_pages() received a struct zone **zones which was
a zonelist which is a zonelist->zones selected based on the gfp_mask in
__alloc_pages. By the time shrink_zones() is called, it can ignore the
mask because only relevant zones are in there. For GFP_KERNEL, that would


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Alexander Beregalov
Date: Saturday, June 21, 2008 - 7:10 pm

Yes, I bisected it with about the same config (as much as possible
with changing Kconfig),
Hm, I cannot forecast the time when I will have this message, I
gathered readprofile statistics
right after the message:
   102 add_preempt_count                          1.0303
   102 net_rx_action                              0.2991
   104 __flush_tlb_all                            2.3111
   115 __tcp_push_pending_frames                  0.0626
   129 e1000_clean_rx_irq                         0.1549
   132 _read_unlock_irq                           1.7838
   136 __rcu_read_unlock                          1.3878
   137 native_read_tsc                            7.2105
   153 tcp_ack                                    0.0261
   155 __rcu_advance_callbacks                    0.8960
   197 local_bh_enable                            0.8277
   205 e1000_clean                                0.4092
   206 free_hot_cold_page                         0.5754
   308 _write_unlock_irq                          4.1622
   352 get_page_from_freelist                     0.3157
   448 lock_acquired                              0.8854
   564 acpi_pm_read                              28.2000
   618 vprintk                                    0.6897
   738 _spin_unlock_irq                           9.9730
   863 __do_softirq                               5.2945
   993 lock_release                               2.4458
  1166 netpoll_setup                              1.5630
  1633 _spin_unlock_irqrestore                   17.0104
  1712 lock_acquire                              12.7761
  1714 kfree                                      7.6178
  1724 __kmalloc_track_caller                     7.7309
  2308 kmem_cache_free                           12.3422
  3189 kmem_cache_alloc                          18.7588
 18261 default_idle                             214.8353
 43758 total                                      0.0175

Could it be useful? I am afraid it is not.
I can try to gather it for lesser time ...
Previous thread: [PATCH] ramfs: enable splice write by Octavian Purdila on Saturday, June 21, 2008 - 5:36 am. (1 message)

Next thread: Contact Mr Richard Smith by Free Lotto Email Promo on Saturday, June 21, 2008 - 5:58 am. (1 message)