2.6.23-rc8 network problem. Mem leak? ip1000a?

Previous thread: [PATCH] netns: CLONE_NEWNET don't use the same clone flag as the pid namespace. by Eric W. Biederman on Thursday, September 27, 2007 - 3:40 pm. (2 messages)

Next thread: Re: [PATCH] e1000: Add device IDs of blade version of the 82571 quad port by Jeff Garzik on Thursday, September 27, 2007 - 8:38 pm. (1 message)
From: linux
Date: Thursday, September 27, 2007 - 7:06 pm

Uniprocessor Althlon 64, 64-bit kernel, 2G ECC RAM,
2.6.23-rc8 + linuxpps (5.0.0) + ip1000a driver.
(patch from http://marc.info/?l=linux-netdev&m=118980588419882)

After a few hours of operation, ntp loses the ability to send packets.
sendto() returns -EAGAIN to everything, including the 24-byte UDP packet
that is a response to ntpq.

-EAGAIN on a sendto() makes me think of memory problems, so here's
meminfo at the time:

### FAILED state ###
# cat /proc/meminfo 
MemTotal:      2059384 kB
MemFree:         15332 kB
Buffers:        665608 kB
Cached:          18212 kB
SwapCached:          0 kB
Active:         380384 kB
Inactive:       355020 kB
SwapTotal:     5855208 kB
SwapFree:      5854552 kB
Dirty:           28504 kB
Writeback:           0 kB
AnonPages:       51608 kB
Mapped:          11852 kB
Slab:          1285348 kB
SReclaimable:   152968 kB
SUnreclaim:    1132380 kB
PageTables:       3888 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   6884900 kB
Committed_AS:   590528 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    265628 kB
VmallocChunk: 34359472059 kB


Killing and restarting ntpd gets it running again for a few hours.
Here's after about two hours of successful operation.  (I'll try to
remember to run slabinfo before killing ntpd next time.)

### WORKING state ###
# cat /proc/meminfo
MemTotal:      2059384 kB
MemFree:         20252 kB
Buffers:        242688 kB
Cached:          41556 kB
SwapCached:        200 kB
Active:         285012 kB
Inactive:       147348 kB
SwapTotal:     5855208 kB
SwapFree:      5854212 kB
Dirty:              36 kB
Writeback:           0 kB
AnonPages:      148052 kB
Mapped:          12756 kB
Slab:          1582512 kB
SReclaimable:   134348 kB
SUnreclaim:    1448164 kB
PageTables:       4500 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   6884900 kB
Committed_AS:   689956 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    265628 kB
VmallocChunk: 34359472059 kB
# ...
From: Andrew Morton
Date: Friday, September 28, 2007 - 2:20 am

CONFIG_DEBUG_SLAB_LEAK?
-

From: linux
Date: Sunday, September 30, 2007 - 12:59 am

> ntpd.  Sounds like pps leaking to me.

That's what I'd think, except that pps does no allocation in the normal
running state, so there's nothing to leak.  The interrupt path just
records the time in some preallocated, static buffers and wakes up
blocked readers.  The read path copies the latest data out of those
static buffers.  There's allocation when the PPS device is created,

Ah, thanks you; I've been using SLUB which doesn't support this option.
Here's what I've extracted.  I've only presented the top few
slab_allocators and a small subset of the oom-killer messages, but I
have full copies if desired.  Unfortunately, I've discovered that the
machine doesn't live in this unhappy state forever.  Indeed, I'm not
sure if killing ntpd "fixes" anything; my previous observations
may have been optimistic ignorance.

(For my own personal reference looking for more oom-kill, I nuked ntpd
at 06:46:56.  And the oom-kills are continuing, with the latest at
07:43:52.)

Anyway, I have a bunch of information from the slab_allocators file, but
I'm not quire sure how to make sense of it.


With a machine in the unhappy state and firing the OOM killer, the top
20 slab_allocators are:
$ sort -rnk2 /proc/slab_allocators | head -20
skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121
size-512: 1706572 tcp_send_ack+0x23/0x102
skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121
size-2048: 148500 tcp_sendmsg+0x1b5/0xae1
sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec
size-512: 2613 sock_alloc_send_skb+0x93/0x1dd
Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e
size-32: 1995 sysfs_new_dirent+0x29/0xec
vm_area_struct: 1679 mmap_region+0x18f/0x421
size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd
size-512: 1571 arp_create+0x4e/0x1cd
vm_area_struct: 1544 copy_process+0x9f1/0x1108
anon_vma: 1448 anon_vma_prepare+0x29/0x74
filp: 1201 get_empty_filp+0x44/0xcd
UDP: 1173 sk_alloc+0x25/0xaf
size-128: 1048 r1bio_pool_alloc+0x23/0x3b
size-128: 1024 nfsd_cache_init+0x2d/0xcf
Acpi-Namespace: ...
From: Andrew Morton
Date: Sunday, September 30, 2007 - 2:23 am

Absolutely.  Normally a driver's transmit completion interrupt handler will
run dev_kfree_skb_irq() against the skbs which have been fully sent.

However it'd be darned odd if the driver was leaking only tcp acks.

I can find no occurrence of "dev_kfree_skb" in drivers/net/ipg.c, which is
suspicious.


I assume that meminfo was not captured when the system was ooming?  There
isn't much slab there.

-

From: linux
Date: Sunday, September 30, 2007 - 4:40 am

> OK.  Did you try to reproduce it without the pps patch applied?

No.  But I've yanked the ip1000a driver (using old crufy vendor-supplied

It's leaking lots of things... you can see ARP packets in there and
all sorts of stuff.  But the big traffic hog is BackupPC doing inbound
rsyncs all night long, which generates a lot of acks.  Those are the

Look for "IPG_DEV_KFREE_SKB", which is a wrapper macro.  (Or just add
"-i" to your grep.)  It should probably be deleted (it just expands to

As I wrote originally, I got it from
http://marc.info/?l=linux-netdev&m=118980588419882
which was a reuqest for mainline submission.

If there are other patches floating around, I'm happy to try them.

Oops, sorry.  I captured slabinfo but not meminfo.


Thank you very much!  Sorry to jump the gun and post a lot before I had
all the data, but if it WAS a problem in -rc8, I wanted to mention it
before -final.

Now, the rush is to get the ip1000a driver fixed before the merge
window opens.  I've added all the ip1000a developers to the Cc: list in
an attempt to speed that up.
-

From: linux
Date: Monday, January 7, 2008 - 11:52 pm

Just to keep the issue open, drivers/net/ipg.c currently in 2.6.24-rc6
still leaks skbuffs like a sieve.  Run it for a few hours with network
traffic and the machine swaps like crazy while the oom killer goes nuts.

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index d9107e5..4fa392c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -172,6 +172,10 @@ config IP1000
 	select MII
 	---help---
 	  This driver supports IP1000 gigabit Ethernet cards.
+	  It works, but suffers from a memory leak.  Signifcant
+	  use will consume unswappable kernel memory until the
+	  machine runs out of memory and crashes.  Thus, this
+	  driver cannot be considered usable at the the present time.
 
 	  To compile this driver as a module, choose M here: the module
 	  will be called ipg.  This is recommended.

Or should it be demoted to BROKEN?  It compiles, and sends and receives
packets, which is better than a lot of BROKEN drivers.
--

From: David Miller
Date: Tuesday, January 8, 2008 - 12:07 am

From: linux@horizon.com

This is not how we handle and track bugs.

Such a patch is inappropriate, and I'd like to ask that you just be
patient until someone has a chance to try and figure out what the
problem is.  Or even better, you can try to track down the problem
yourself since you seem to have a specific interest in this problem.

Thanks.
--

From: David Miller
Date: Tuesday, January 8, 2008 - 12:14 am

From: David Miller <davem@davemloft.net>

Actually, the bug is amazingly obvious after a quick scan of this
driver.

ipg_nic_rx_free_skb() is called from various places and is given zero
context to work with.  It assumes that the caller wants
"sp->rx_current % IPG_RFCLIST_LENGTH" to be freed.

But that's not right in most cases.  For example, consider the call in
ipg_nic_rx_with_end().  This function is invoked from ipg_nic_rx()
like so:

	unsigned int curr = sp->rx_current;
 ...
	for (i = 0; i < IPG_MAXRFDPROCESS_COUNT; i++, curr++) {
		unsigned int entry = curr % IPG_RFDLIST_LENGTH;
		struct ipg_rx *rxfd = sp->rxd + entry;

		if (!(rxfd->rfs & le64_to_cpu(IPG_RFS_RFDDONE)))
			break;

		switch (ipg_nic_rx_check_frame_type(dev)) {
 ...
		case Frame_WithEnd:
			ipg_nic_rx_with_end(dev, tp, rxfd, entry);
			break;
 ...
		}
	}

	sp->rx_current = curr;

So sp->rx_current does not correspond to the packet being processed
currently, so ipg_nic_rx_free_skb() will only look at and try to free
only the first packet the above loop tries to processe.

WOW!!!!  Amazing!!!

I invested 30 seconds of code reading to figure out the leak.  A much
better investment of time than adding bogus comments to the Kconfig
help text don't you think? :-)

--

From: Francois Romieu
Date: Tuesday, January 8, 2008 - 12:51 am

David Miller <davem@davemloft.net> :

Thanks for the hint David.

I'll roll up a patch for it after the day work.

-- 
Ueimor
--

From: linux
Date: Tuesday, January 8, 2008 - 5:28 am

Prompted by davem, this attempt at fixing the memory leak
actually appears to work.  At least, leaving ping -f -s1472 -l64
running doesn't drop packets and doesn't show up in /proc/slabinfo.
---
diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..a0dfba5 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -1110,10 +1110,9 @@ enum {
 	Frame_WithStart_WithEnd = 11
 };
 
-inline void ipg_nic_rx_free_skb(struct net_device *dev)
+inline void ipg_nic_rx_free_skb(struct net_device *dev, unsigned entry)
 {
 	struct ipg_nic_private *sp = netdev_priv(dev);
-	unsigned int entry = sp->rx_current % IPG_RFDLIST_LENGTH;
 
 	if (sp->RxBuff[entry]) {
 		struct ipg_rx *rxfd = sp->rxd + entry;
@@ -1308,7 +1307,7 @@ static void ipg_nic_rx_with_end(struct net_device *dev,
 		jumbo->CurrentSize = 0;
 		jumbo->skb = NULL;
 
-		ipg_nic_rx_free_skb(dev);
+		ipg_nic_rx_free_skb(dev, entry);
 	} else {
 		IPG_DEV_KFREE_SKB(jumbo->skb);
 		jumbo->FoundStart = 0;
@@ -1337,7 +1336,7 @@ static void ipg_nic_rx_no_start_no_end(struct net_device *dev,
 				}
 			}
 			dev->last_rx = jiffies;
-			ipg_nic_rx_free_skb(dev);
+			ipg_nic_rx_free_skb(dev, entry);
 		}
 	} else {
 		IPG_DEV_KFREE_SKB(jumbo->skb);
--

From: linux
Date: Tuesday, January 8, 2008 - 6:19 am

I take that back.  This patch does NOT fix the leak, at least if
ping: sendmsg: No buffer space available
is any indication...

I think I was reading slabinfo wrong.
kmalloc-2048       42111  42112   2048    4    2 : tunables    0    0    0 : slabdata  10528  10528      0

Sorry for the false hope.
--

From: Francois Romieu
Date: Tuesday, January 8, 2008 - 2:36 pm

Can you try the patch below ?

diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..c304e5c 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -860,7 +860,7 @@ static void ipg_nic_txfree(struct net_device *dev)
 	void __iomem *ioaddr = sp->ioaddr;
 	unsigned int curr;
 	u64 txd_map;
-	unsigned int released, pending;
+	unsigned int released, pending, dirty;
 
 	txd_map = (u64)sp->txd_map;
 	curr = ipg_r32(TFD_LIST_PTR_0) -
@@ -869,9 +869,9 @@ static void ipg_nic_txfree(struct net_device *dev)
 	IPG_DEBUG_MSG("_nic_txfree\n");
 
 	pending = sp->tx_current - sp->tx_dirty;
+	dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
 
 	for (released = 0; released < pending; released++) {
-		unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
 		struct sk_buff *skb = sp->TxBuff[dirty];
 		struct ipg_tx *txfd = sp->txd + dirty;
 
@@ -898,6 +898,7 @@ static void ipg_nic_txfree(struct net_device *dev)
 
 			sp->TxBuff[dirty] = NULL;
 		}
+		dirty = (dirty + 1) % IPG_TFDLIST_LENGTH;
 	}
 
 	sp->tx_dirty += released;
-- 
1.5.3.3

--

From: David Miller
Date: Tuesday, January 8, 2008 - 4:00 pm

From: Francois Romieu <romieu@fr.zoreil.com>

Same kind of bug as the RX side :-)  I bet this fixes his
problem...
--

From: Francois Romieu
Date: Tuesday, January 8, 2008 - 4:28 pm

David Miller <davem@davemloft.net> :

I am not sure but the Rx side is probably just here to distract
from the real problem. Please don't ask... :o)

Anyway I'll poke an adapter in the test computer and give it a
try tomorrow. Nobody will complain if I crash it.

-- 
Ueimor
--

From: linux
Date: Tuesday, January 8, 2008 - 5:38 pm

> Can you try the patch below ?

Testing now... (I presume you noticed the one-character typo in my
earlier patch.  That should be "mc = mc->next", not "mv = mc->next".)

That doesn't seem to do it.  Not entirely, at least.  After downloading
and partially re-uploading an 800M file, slabtop reports:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
341576 341574  99%    0.50K  42697        8    170788K kmalloc-512
342006 341953  99%    0.19K  16286       21     65144K kmalloc-192
 30592  30575  99%    2.00K   7648        4     61184K kmalloc-2048
 30213  30193  99%    0.44K   3357        9     13428K skbuff_fclone_cache
  7650   7643  99%    0.08K    150       51       600K sysfs_dir_cache
  4000   3938  98%    0.12K    125       32       500K kmalloc-128
   258    258 100%    1.15K     43        6       344K raid5-md5
   232    221  95%    1.00K     58        4       232K kmalloc-1024
  3136   3110  99%    0.06K     49       64       196K kmalloc-64
   264     80  30%    0.68K     24       11       192K ext3_inode_cache

The "kmalloc-2048" was down in the noise before the upload started.
This is in single-user mode, after sync and echo 3 > /proc/sys/vm/drop_caches.


I'll have to try this after this evening's social plans, but I'm thinking
of implementing more rapid bug detection: explicitly zero the sp->TxBuff
slot when the skb is freed, and check that it is zero before putting
anything else in there.  (And likewise for RxBuff.)

That way, I don't have to use up a noticeable amount of memory to see
the bug and reboot to clear up the damage each test cycle.
--

From: David Miller
Date: Wednesday, January 9, 2008 - 1:39 am

From: linux@horizon.com

Ok, I'll let you and Francois work out how to fix this for
good.

Please submit just the outright leak bug fixes once this is
all resolved.  All of that code cleanup stuff needs to wait
until later, let's fix bugs before adding new ones. :-)

Thanks.
--

From: Francois Romieu
Date: Wednesday, January 9, 2008 - 4:34 pm

David Miller <davem@davemloft.net> :

Yes.

I should be able to test your r8169 NAPI changes tomorrow.

-- 
Ueimor
--

From: David Miller
Date: Wednesday, January 9, 2008 - 4:56 pm

From: Francois Romieu <romieu@fr.zoreil.com>

Thank you.
--

From: Francois Romieu
Date: Wednesday, January 9, 2008 - 4:30 pm

linux@horizon.com <linux@horizon.com> :

Ok, enjoy this one. It is definitely better wrt the current problem.

More work tomorrow.

diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..42f300d 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -860,18 +860,18 @@ static void ipg_nic_txfree(struct net_device *dev)
 	void __iomem *ioaddr = sp->ioaddr;
 	unsigned int curr;
 	u64 txd_map;
-	unsigned int released, pending;
+	unsigned int released, pending, dirty;
 
 	txd_map = (u64)sp->txd_map;
 	curr = ipg_r32(TFD_LIST_PTR_0) -
 		do_div(txd_map, sizeof(struct ipg_tx)) - 1;
 
 	IPG_DEBUG_MSG("_nic_txfree\n");
 
 	pending = sp->tx_current - sp->tx_dirty;
+	dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
 
 	for (released = 0; released < pending; released++) {
-		unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
 		struct sk_buff *skb = sp->TxBuff[dirty];
 		struct ipg_tx *txfd = sp->txd + dirty;
 
@@ -882,8 +884,11 @@ static void ipg_nic_txfree(struct net_device *dev)
 		 * If the TFDDone bit is set, free the associated
 		 * buffer.
 		 */
-		if (dirty == curr)
+		if (!(txfd->tfc & cpu_to_le64(IPG_TFC_TFDDONE))) {
+			printk(KERN_INFO "%s: released = %d pending = %d\n",
+				dev->name, released, pending);
 			break;
+		}
 
 		/* Setup TFDDONE for compatible issue. */
 		txfd->tfc |= cpu_to_le64(IPG_TFC_TFDDONE);
@@ -898,6 +903,7 @@ static void ipg_nic_txfree(struct net_device *dev)
 
 			sp->TxBuff[dirty] = NULL;
 		}
+		dirty = (dirty + 1) % IPG_TFDLIST_LENGTH;
 	}
 
 	sp->tx_dirty += released;
@@ -1943,10 +1948,7 @@ static int ipg_nic_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 */
 	if (sp->tenmbpsmode)
 		txfd->tfc |= cpu_to_le64(IPG_TFC_TXINDICATE);
-	else if (!((sp->tx_current - sp->tx_dirty + 1) >
-	    IPG_FRAMESBETWEENTXDMACOMPLETES)) {
-		txfd->tfc |= cpu_to_le64(IPG_TFC_TXDMAINDICATE);
-	}
+	txfd->tfc |= cpu_to_le64(IPG_TFC_TXDMAINDICATE);
 	/* Based on compilation option, determine if FCS is to be
 	 * ...
From: linux
Subject: ipg.c bugs
Date: Thursday, January 10, 2008 - 12:28 am

I'm just about to test that second memory leak patch, but I gave the
original code a careful reading, and found a few problems:

* Huge monstrous glaring bug

  In ipg_interrupt_handler the code to habdle a shared interrupt
  not caused by this device:
	if (!(status & IPG_IS_RSVD_MASK))
		goto out_enable
  is *before* spin_lock(&sp->lock), but the code following
  out_enable does spin_unlock(&sp->lock).

  Thus, the sp->lock is all f*ed up.  The lack of any sort of
  locking between the interrupt handler and hard_start_xmit
  could cause all sort of issues.

  I'm not actually sure if it's even necessary; I'd think some
  suitable atomic access to sp->tx_current would suffice.

* Lesser bugs

  There's a general pattern of loops over the range from
  s->rx_current to sp->rx_dirty.  Some of them are call code
  that refers to s->rx_current, even though that hasn't been
  updated yet.

  One instance is in ipg_nic_check_frame_type.
  A second is in ipg_nic_check_error.

  In ipg_nic_set_multicast(), the code to enable the multicast flags
  is of the form "if (dev->flags & IFF_MULTICAST & (dev->mc_count > ...))".
  But IFF_MULTI CAST is not 1, so this will always be false.
  The seond & needs to be && (2x).


  In ipg_io_config(), there's
	/* Transmitter and receiver must be disabled before setting
	 * IFSSelect.
	 */
	ipg_w32((origmacctrl & (IPG_MC_RX_DISABLE | IPG_MC_TX_DISABLE)) &
		IPG_MC_RSVD_MASK, MAC_CTRL);
  I don't know what's going on there, but unless the IPG_MC_RX_DISABLE
  bit is already set in origmacctrl, that's going to write 0, which
  won't disable anything.

  Immediately following, there's some similarly buggy code doing something
  I don't understand with IPG_MC_IFS_96BIT.


  The setting of curr in ipg_nic_txfree, with that bizarre do_div, can't
  possibly be working right.


* Possible bugs

  I'm not very sanguine about the handling in init_rfdlist, of the
  code that handles a failed ipg_get_rxbuff.  In particular, it leaves
  ...
Previous thread: [PATCH] netns: CLONE_NEWNET don't use the same clone flag as the pid namespace. by Eric W. Biederman on Thursday, September 27, 2007 - 3:40 pm. (2 messages)

Next thread: Re: [PATCH] e1000: Add device IDs of blade version of the 82571 quad port by Jeff Garzik on Thursday, September 27, 2007 - 8:38 pm. (1 message)