Uniprocessor Althlon 64, 64-bit kernel, 2G ECC RAM, 2.6.23-rc8 + linuxpps (5.0.0) + ip1000a driver. (patch from http://marc.info/?l=linux-netdev&m=118980588419882) After a few hours of operation, ntp loses the ability to send packets. sendto() returns -EAGAIN to everything, including the 24-byte UDP packet that is a response to ntpq. -EAGAIN on a sendto() makes me think of memory problems, so here's meminfo at the time: ### FAILED state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 15332 kB Buffers: 665608 kB Cached: 18212 kB SwapCached: 0 kB Active: 380384 kB Inactive: 355020 kB SwapTotal: 5855208 kB SwapFree: 5854552 kB Dirty: 28504 kB Writeback: 0 kB AnonPages: 51608 kB Mapped: 11852 kB Slab: 1285348 kB SReclaimable: 152968 kB SUnreclaim: 1132380 kB PageTables: 3888 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 590528 kB VmallocTotal: 34359738367 kB VmallocUsed: 265628 kB VmallocChunk: 34359472059 kB Killing and restarting ntpd gets it running again for a few hours. Here's after about two hours of successful operation. (I'll try to remember to run slabinfo before killing ntpd next time.) ### WORKING state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 20252 kB Buffers: 242688 kB Cached: 41556 kB SwapCached: 200 kB Active: 285012 kB Inactive: 147348 kB SwapTotal: 5855208 kB SwapFree: 5854212 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 148052 kB Mapped: 12756 kB Slab: 1582512 kB SReclaimable: 134348 kB SUnreclaim: 1448164 kB PageTables: 4500 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 689956 kB VmallocTotal: 34359738367 kB VmallocUsed: 265628 kB VmallocChunk: 34359472059 kB # ...
> ntpd. Sounds like pps leaking to me. That's what I'd think, except that pps does no allocation in the normal running state, so there's nothing to leak. The interrupt path just records the time in some preallocated, static buffers and wakes up blocked readers. The read path copies the latest data out of those static buffers. There's allocation when the PPS device is created, Ah, thanks you; I've been using SLUB which doesn't support this option. Here's what I've extracted. I've only presented the top few slab_allocators and a small subset of the oom-killer messages, but I have full copies if desired. Unfortunately, I've discovered that the machine doesn't live in this unhappy state forever. Indeed, I'm not sure if killing ntpd "fixes" anything; my previous observations may have been optimistic ignorance. (For my own personal reference looking for more oom-kill, I nuked ntpd at 06:46:56. And the oom-kills are continuing, with the latest at 07:43:52.) Anyway, I have a bunch of information from the slab_allocators file, but I'm not quire sure how to make sense of it. With a machine in the unhappy state and firing the OOM killer, the top 20 slab_allocators are: $ sort -rnk2 /proc/slab_allocators | head -20 skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121 size-512: 1706572 tcp_send_ack+0x23/0x102 skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121 size-2048: 148500 tcp_sendmsg+0x1b5/0xae1 sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec size-512: 2613 sock_alloc_send_skb+0x93/0x1dd Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e size-32: 1995 sysfs_new_dirent+0x29/0xec vm_area_struct: 1679 mmap_region+0x18f/0x421 size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd size-512: 1571 arp_create+0x4e/0x1cd vm_area_struct: 1544 copy_process+0x9f1/0x1108 anon_vma: 1448 anon_vma_prepare+0x29/0x74 filp: 1201 get_empty_filp+0x44/0xcd UDP: 1173 sk_alloc+0x25/0xaf size-128: 1048 r1bio_pool_alloc+0x23/0x3b size-128: 1024 nfsd_cache_init+0x2d/0xcf Acpi-Namespace: ...
Absolutely. Normally a driver's transmit completion interrupt handler will run dev_kfree_skb_irq() against the skbs which have been fully sent. However it'd be darned odd if the driver was leaking only tcp acks. I can find no occurrence of "dev_kfree_skb" in drivers/net/ipg.c, which is suspicious. I assume that meminfo was not captured when the system was ooming? There isn't much slab there. -
> OK. Did you try to reproduce it without the pps patch applied? No. But I've yanked the ip1000a driver (using old crufy vendor-supplied It's leaking lots of things... you can see ARP packets in there and all sorts of stuff. But the big traffic hog is BackupPC doing inbound rsyncs all night long, which generates a lot of acks. Those are the Look for "IPG_DEV_KFREE_SKB", which is a wrapper macro. (Or just add "-i" to your grep.) It should probably be deleted (it just expands to As I wrote originally, I got it from http://marc.info/?l=linux-netdev&m=118980588419882 which was a reuqest for mainline submission. If there are other patches floating around, I'm happy to try them. Oops, sorry. I captured slabinfo but not meminfo. Thank you very much! Sorry to jump the gun and post a lot before I had all the data, but if it WAS a problem in -rc8, I wanted to mention it before -final. Now, the rush is to get the ip1000a driver fixed before the merge window opens. I've added all the ip1000a developers to the Cc: list in an attempt to speed that up. -
Just to keep the issue open, drivers/net/ipg.c currently in 2.6.24-rc6 still leaks skbuffs like a sieve. Run it for a few hours with network traffic and the machine swaps like crazy while the oom killer goes nuts. diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index d9107e5..4fa392c 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -172,6 +172,10 @@ config IP1000 select MII ---help--- This driver supports IP1000 gigabit Ethernet cards. + It works, but suffers from a memory leak. Signifcant + use will consume unswappable kernel memory until the + machine runs out of memory and crashes. Thus, this + driver cannot be considered usable at the the present time. To compile this driver as a module, choose M here: the module will be called ipg. This is recommended. Or should it be demoted to BROKEN? It compiles, and sends and receives packets, which is better than a lot of BROKEN drivers. --
From: linux@horizon.com This is not how we handle and track bugs. Such a patch is inappropriate, and I'd like to ask that you just be patient until someone has a chance to try and figure out what the problem is. Or even better, you can try to track down the problem yourself since you seem to have a specific interest in this problem. Thanks. --
From: David Miller <davem@davemloft.net>
Actually, the bug is amazingly obvious after a quick scan of this
driver.
ipg_nic_rx_free_skb() is called from various places and is given zero
context to work with. It assumes that the caller wants
"sp->rx_current % IPG_RFCLIST_LENGTH" to be freed.
But that's not right in most cases. For example, consider the call in
ipg_nic_rx_with_end(). This function is invoked from ipg_nic_rx()
like so:
unsigned int curr = sp->rx_current;
...
for (i = 0; i < IPG_MAXRFDPROCESS_COUNT; i++, curr++) {
unsigned int entry = curr % IPG_RFDLIST_LENGTH;
struct ipg_rx *rxfd = sp->rxd + entry;
if (!(rxfd->rfs & le64_to_cpu(IPG_RFS_RFDDONE)))
break;
switch (ipg_nic_rx_check_frame_type(dev)) {
...
case Frame_WithEnd:
ipg_nic_rx_with_end(dev, tp, rxfd, entry);
break;
...
}
}
sp->rx_current = curr;
So sp->rx_current does not correspond to the packet being processed
currently, so ipg_nic_rx_free_skb() will only look at and try to free
only the first packet the above loop tries to processe.
WOW!!!! Amazing!!!
I invested 30 seconds of code reading to figure out the leak. A much
better investment of time than adding bogus comments to the Kconfig
help text don't you think? :-)
--
David Miller <davem@davemloft.net> : Thanks for the hint David. I'll roll up a patch for it after the day work. -- Ueimor --
Prompted by davem, this attempt at fixing the memory leak
actually appears to work. At least, leaving ping -f -s1472 -l64
running doesn't drop packets and doesn't show up in /proc/slabinfo.
---
diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..a0dfba5 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -1110,10 +1110,9 @@ enum {
Frame_WithStart_WithEnd = 11
};
-inline void ipg_nic_rx_free_skb(struct net_device *dev)
+inline void ipg_nic_rx_free_skb(struct net_device *dev, unsigned entry)
{
struct ipg_nic_private *sp = netdev_priv(dev);
- unsigned int entry = sp->rx_current % IPG_RFDLIST_LENGTH;
if (sp->RxBuff[entry]) {
struct ipg_rx *rxfd = sp->rxd + entry;
@@ -1308,7 +1307,7 @@ static void ipg_nic_rx_with_end(struct net_device *dev,
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
- ipg_nic_rx_free_skb(dev);
+ ipg_nic_rx_free_skb(dev, entry);
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
jumbo->FoundStart = 0;
@@ -1337,7 +1336,7 @@ static void ipg_nic_rx_no_start_no_end(struct net_device *dev,
}
}
dev->last_rx = jiffies;
- ipg_nic_rx_free_skb(dev);
+ ipg_nic_rx_free_skb(dev, entry);
}
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
--
I take that back. This patch does NOT fix the leak, at least if ping: sendmsg: No buffer space available is any indication... I think I was reading slabinfo wrong. kmalloc-2048 42111 42112 2048 4 2 : tunables 0 0 0 : slabdata 10528 10528 0 Sorry for the false hope. --
Can you try the patch below ?
diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..c304e5c 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -860,7 +860,7 @@ static void ipg_nic_txfree(struct net_device *dev)
void __iomem *ioaddr = sp->ioaddr;
unsigned int curr;
u64 txd_map;
- unsigned int released, pending;
+ unsigned int released, pending, dirty;
txd_map = (u64)sp->txd_map;
curr = ipg_r32(TFD_LIST_PTR_0) -
@@ -869,9 +869,9 @@ static void ipg_nic_txfree(struct net_device *dev)
IPG_DEBUG_MSG("_nic_txfree\n");
pending = sp->tx_current - sp->tx_dirty;
+ dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
for (released = 0; released < pending; released++) {
- unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
struct sk_buff *skb = sp->TxBuff[dirty];
struct ipg_tx *txfd = sp->txd + dirty;
@@ -898,6 +898,7 @@ static void ipg_nic_txfree(struct net_device *dev)
sp->TxBuff[dirty] = NULL;
}
+ dirty = (dirty + 1) % IPG_TFDLIST_LENGTH;
}
sp->tx_dirty += released;
--
1.5.3.3
--
From: Francois Romieu <romieu@fr.zoreil.com> Same kind of bug as the RX side :-) I bet this fixes his problem... --
David Miller <davem@davemloft.net> : I am not sure but the Rx side is probably just here to distract from the real problem. Please don't ask... :o) Anyway I'll poke an adapter in the test computer and give it a try tomorrow. Nobody will complain if I crash it. -- Ueimor --
> Can you try the patch below ? Testing now... (I presume you noticed the one-character typo in my earlier patch. That should be "mc = mc->next", not "mv = mc->next".) That doesn't seem to do it. Not entirely, at least. After downloading and partially re-uploading an 800M file, slabtop reports: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 341576 341574 99% 0.50K 42697 8 170788K kmalloc-512 342006 341953 99% 0.19K 16286 21 65144K kmalloc-192 30592 30575 99% 2.00K 7648 4 61184K kmalloc-2048 30213 30193 99% 0.44K 3357 9 13428K skbuff_fclone_cache 7650 7643 99% 0.08K 150 51 600K sysfs_dir_cache 4000 3938 98% 0.12K 125 32 500K kmalloc-128 258 258 100% 1.15K 43 6 344K raid5-md5 232 221 95% 1.00K 58 4 232K kmalloc-1024 3136 3110 99% 0.06K 49 64 196K kmalloc-64 264 80 30% 0.68K 24 11 192K ext3_inode_cache The "kmalloc-2048" was down in the noise before the upload started. This is in single-user mode, after sync and echo 3 > /proc/sys/vm/drop_caches. I'll have to try this after this evening's social plans, but I'm thinking of implementing more rapid bug detection: explicitly zero the sp->TxBuff slot when the skb is freed, and check that it is zero before putting anything else in there. (And likewise for RxBuff.) That way, I don't have to use up a noticeable amount of memory to see the bug and reboot to clear up the damage each test cycle. --
From: linux@horizon.com Ok, I'll let you and Francois work out how to fix this for good. Please submit just the outright leak bug fixes once this is all resolved. All of that code cleanup stuff needs to wait until later, let's fix bugs before adding new ones. :-) Thanks. --
David Miller <davem@davemloft.net> : Yes. I should be able to test your r8169 NAPI changes tomorrow. -- Ueimor --
From: Francois Romieu <romieu@fr.zoreil.com> Thank you. --
linux@horizon.com <linux@horizon.com> : Ok, enjoy this one. It is definitely better wrt the current problem. More work tomorrow. diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c index dbd23bb..42f300d 100644 --- a/drivers/net/ipg.c +++ b/drivers/net/ipg.c @@ -860,18 +860,18 @@ static void ipg_nic_txfree(struct net_device *dev) void __iomem *ioaddr = sp->ioaddr; unsigned int curr; u64 txd_map; - unsigned int released, pending; + unsigned int released, pending, dirty; txd_map = (u64)sp->txd_map; curr = ipg_r32(TFD_LIST_PTR_0) - do_div(txd_map, sizeof(struct ipg_tx)) - 1; IPG_DEBUG_MSG("_nic_txfree\n"); pending = sp->tx_current - sp->tx_dirty; + dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH; for (released = 0; released < pending; released++) { - unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH; struct sk_buff *skb = sp->TxBuff[dirty]; struct ipg_tx *txfd = sp->txd + dirty; @@ -882,8 +884,11 @@ static void ipg_nic_txfree(struct net_device *dev) * If the TFDDone bit is set, free the associated * buffer. */ - if (dirty == curr) + if (!(txfd->tfc & cpu_to_le64(IPG_TFC_TFDDONE))) { + printk(KERN_INFO "%s: released = %d pending = %d\n", + dev->name, released, pending); break; + } /* Setup TFDDONE for compatible issue. */ txfd->tfc |= cpu_to_le64(IPG_TFC_TFDDONE); @@ -898,6 +903,7 @@ static void ipg_nic_txfree(struct net_device *dev) sp->TxBuff[dirty] = NULL; } + dirty = (dirty + 1) % IPG_TFDLIST_LENGTH; } sp->tx_dirty += released; @@ -1943,10 +1948,7 @@ static int ipg_nic_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) */ if (sp->tenmbpsmode) txfd->tfc |= cpu_to_le64(IPG_TFC_TXINDICATE); - else if (!((sp->tx_current - sp->tx_dirty + 1) > - IPG_FRAMESBETWEENTXDMACOMPLETES)) { - txfd->tfc |= cpu_to_le64(IPG_TFC_TXDMAINDICATE); - } + txfd->tfc |= cpu_to_le64(IPG_TFC_TXDMAINDICATE); /* Based on compilation option, determine if FCS is to be * ...
I'm just about to test that second memory leak patch, but I gave the original code a careful reading, and found a few problems: * Huge monstrous glaring bug In ipg_interrupt_handler the code to habdle a shared interrupt not caused by this device: if (!(status & IPG_IS_RSVD_MASK)) goto out_enable is *before* spin_lock(&sp->lock), but the code following out_enable does spin_unlock(&sp->lock). Thus, the sp->lock is all f*ed up. The lack of any sort of locking between the interrupt handler and hard_start_xmit could cause all sort of issues. I'm not actually sure if it's even necessary; I'd think some suitable atomic access to sp->tx_current would suffice. * Lesser bugs There's a general pattern of loops over the range from s->rx_current to sp->rx_dirty. Some of them are call code that refers to s->rx_current, even though that hasn't been updated yet. One instance is in ipg_nic_check_frame_type. A second is in ipg_nic_check_error. In ipg_nic_set_multicast(), the code to enable the multicast flags is of the form "if (dev->flags & IFF_MULTICAST & (dev->mc_count > ...))". But IFF_MULTI CAST is not 1, so this will always be false. The seond & needs to be && (2x). In ipg_io_config(), there's /* Transmitter and receiver must be disabled before setting * IFSSelect. */ ipg_w32((origmacctrl & (IPG_MC_RX_DISABLE | IPG_MC_TX_DISABLE)) & IPG_MC_RSVD_MASK, MAC_CTRL); I don't know what's going on there, but unless the IPG_MC_RX_DISABLE bit is already set in origmacctrl, that's going to write 0, which won't disable anything. Immediately following, there's some similarly buggy code doing something I don't understand with IPG_MC_IFS_96BIT. The setting of curr in ipg_nic_txfree, with that bizarre do_div, can't possibly be working right. * Possible bugs I'm not very sanguine about the handling in init_rfdlist, of the code that handles a failed ipg_get_rxbuff. In particular, it leaves ...
