Hi list, Testing 2.6.35-rc1 on my powerpc based routerboard I saw the following page allocation error happening during an apt-get update with a semi loaded wlan interface [309611.189267] __alloc_pages_slowpath: 52 callbacks suppressed [309611.194959] gzip: page allocation failure. order:1, mode:0x4020 [309611.200981] Call Trace: [309611.203547] [c399bc50] [c0008144] show_stack+0x48/0x15c (unreliable) [309611.210041] [c399bc80] [c006268c] __alloc_pages_nodemask+0x3d4/0x52c [309611.216512] [c399bd20] [c008619c] __slab_alloc+0x560/0x570 [309611.222111] [c399bd60] [c0086a98] __kmalloc_track_caller+0xd4/0x104 [309611.228505] [c399bd80] [c01dd220] __alloc_skb+0x64/0x124 [309611.233944] [c399bda0] [c994e034] ath_rxbuf_alloc+0x34/0xbc [ath] [309611.240178] [c399bdc0] [c9a1ec9c] ath_rx_tasklet+0x480/0x7c4 [ath9k] [309611.246658] [c399be80] [c9a1dae0] ath9k_tasklet+0x114/0x13c [ath9k] [309611.253055] [c399bea0] [c002532c] tasklet_action+0x88/0x104 [309611.258746] [c399bec0] [c0025e30] __do_softirq+0xb4/0x134 [309611.264261] [c399bf00] [c0005ec4] do_softirq+0x58/0x5c [309611.269514] [c399bf10] [c0025c20] irq_exit+0x7c/0x9c [309611.274591] [c399bf20] [c0005f64] do_IRQ+0x9c/0xb4 [309611.279509] [c399bf40] [c00117d8] ret_from_except+0x0/0x14 [309611.285112] --- Exception: 501 at 0xff31f0c [309611.285121] LR = 0xff32548 [309611.292536] Mem-Info: [309611.294899] DMA per-cpu: [309611.297528] CPU 0: hi: 42, btch: 7 usd: 18 [309611.302444] active_anon:1040 inactive_anon:1160 isolated_anon:0 [309611.302455] active_file:14871 inactive_file:9440 isolated_file:0 [309611.302467] unevictable:491 dirty:1258 writeback:0 unstable:0 [309611.302478] free:628 slab_reclaimable:832 slab_unreclaimable:2312 [309611.302490] mapped:2254 shmem:36 pagetables:202 bounce:0 [309611.332409] DMA free:2512kB min:1440kB low:1800kB high:2160kB active_anon:4160kB inactive_anon:4640kB active_file:59484kB inactive_file:37760kB unevictable:1964kB isolated(anon):0kB isolated(file):0kB present:130048kB ...
order-1 allocations are unfortunate, since this hardware should use order-0 ones if possible, and it seems it was its goal. 3872 (0xF20) comes from #define IEEE80211_MAX_MPDU_LEN (3840 + FCS_LEN + (IEEE80211_WEP_IVLEN + \ IEEE80211_WEP_KIDLEN + \ IEEE80211_WEP_CRCLEN)) common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len, min(common->cachelsz, (u16)64)); Then __dev_alloc_skb() adds two more blocs : NET_SKB_PAD (64 bytes on your platform ?) sizeof(struct skb_shared_info) (on 32bit : 0x104 ... oh well that might be the problem : it is rounded to 0x140) And ath driver adds common->cachelsz (I dont know its value) -> more than 4096 bytes 1) Maybe rx_bufsize should not include the roundup() since ath_rxbuf_alloc() also do an alignment adjustment ? 2) We should try to reduce skb_shared_info by four bytes. Could you try this patch ? We make sure rx_bufsize + various overhead <= PAGE_SIZE But I am not sure its legal for the hardware... diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c index ca6065b..0a0dc3a 100644 --- a/drivers/net/wireless/ath/ath9k/recv.c +++ b/drivers/net/wireless/ath/ath9k/recv.c @@ -226,10 +226,10 @@ static int ath_rx_edma_init(struct ath_softc *sc, int nbufs) u32 size; - common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN + - ah->caps.rx_status_len, - min(common->cachelsz, (u16)64)); - + size = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len, + min(common->cachelsz, (u16)64)); + common->rx_bufsize = max_t(u32, size, + SKB_MAX_ORDER(NET_SKB_PAD + common->cachelsz, 0)); ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize - ah->caps.rx_status_len); --
I applied the patch recompiled and run it on the routerboard, trying to trigger the bug again. Kind regards, Michael --
Hi Eric, Up to now I was not able to reproduce the bug, do you think this patch can be pushed to mainline or is there a "better"/other fix for it? Kind regards, Michael --
Thanks Michael for testing. I'll submit ASAP an official patch, sent to all people involved in this driver to get their Ack (or Nack). IEEE80211_MAX_MPDU_LEN being 3840 + somebits is suspect, since it doesnt match 802.11 specs. It should be more close of 2304 + MAC header (32bytes) + FCS (4 bytes) ? --
Hello. I am running 2.5.35.3 with the above patch, and I still get these failures. Though they are much less often than without the patch. A snippet from dmesg below. Please let me know what other details I should provide. Thanks skbuff alloc of size 3872 failed java: page allocation failure. order:1, mode:0x4020 Pid: 11464, comm: java Not tainted 2.6.35.3 #3 Call Trace: [<c0243d26>] ? __alloc_pages_nodemask+0x3e6/0x513 [<c025c293>] ? __slab_alloc+0x2d7/0x2eb [<c025c946>] ? __kmalloc_track_caller+0x74/0x95 [<d09e801a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<d09e801a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<c0334097>] ? __alloc_skb+0x57/0x100 [<d09e801a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<d0af4100>] ? ath_rx_tasklet+0x2fb/0x808 [ath9k] [<d0cbc89f>] ? br_handle_frame+0x1b3/0x1c3 [bridge] [<c033d18a>] ? __netif_receive_skb+0x141/0x25f [<d0af23c1>] ? ath9k_tasklet+0xcc/0x107 [ath9k] [<c02195bf>] ? tasklet_action+0x5f/0x65 [<c0219873>] ? __do_softirq+0x60/0xc6 [<c0219907>] ? do_softirq+0x2e/0x30 [<c02199f9>] ? irq_exit+0x53/0x55 [<c020392c>] ? do_IRQ+0x3a/0x72 [<c0202be9>] ? common_interrupt+0x29/0x30 Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: hi: 90, btch: 15 usd: 32 active_anon:6158 inactive_anon:15748 isolated_anon:0 active_file:11477 inactive_file:22926 isolated_file:0 unevictable:468 dirty:5788 writeback:0 unstable:0 free:990 slab_reclaimable:3091 slab_unreclaimable:2349 mapped:1410 shmem:5 pagetables:227 bounce:0 DMA free:1000kB min:124kB low:152kB high:184kB active_anon:1304kB inactive_anon:2304kB active_file:2584kB inactive_file:4900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15864kB mlocked:0kB dirty:0kB writeback:0kB mapped:284kB shmem:0kB slab_reclaimable:648kB slab_unreclaimable:752kB kernel_stack:152kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 238 238 Normal free:2960kB min:1908kB low:2384kB ...
Le mercredi 25 août 2010 à 13:25 -0700, Nikhil Sethi (निखिल सेठी) a
reading my former patch, I believe I mistakenly used a max_t() instead
of a min_() macro :
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
index ca6065b..0a0dc3a 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -226,10 +226,10 @@ static int ath_rx_edma_init(struct ath_softc *sc, int nbufs)
u32 size;
- common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN +
- ah->caps.rx_status_len,
- min(common->cachelsz, (u16)64));
-
+ size = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len,
+ min(common->cachelsz, (u16)64));
+ common->rx_bufsize = max_t(u32, size,
+ SKB_MAX_ORDER(NET_SKB_PAD + common->cachelsz, 0));
ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize -
ah->caps.rx_status_len);
Please use instead this one :
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
index ca6065b..0a0dc3a 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -226,10 +226,10 @@ static int ath_rx_edma_init(struct ath_softc *sc, int nbufs)
u32 size;
- common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN +
- ah->caps.rx_status_len,
- min(common->cachelsz, (u16)64));
-
+ size = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len,
+ min(common->cachelsz, (u16)64));
+ common->rx_bufsize = min_t(u32, size,
+ SKB_MAX_ORDER(NET_SKB_PAD + common->cachelsz, 0));
ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize -
ah->caps.rx_status_len);
--
Thanks Eric. I have not been able to reproduce this problem after applying your new patch in about 12 hours of operation. Will let you know if the bug re-appears. Thanks Nikhil --
This came back. Here is the dmesg output. It might be the problem only happens if I have wmm_enabled=1 in my hostapd.conf. That is the only thing that I changed yesterday, and today I saw these messages in my logs. I will try to conform this. I am running 2.6.35.4 with this patch: *** 225,234 **** int error = 0, i; u32 size; ! ! common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN + ! ah->caps.rx_status_len, ! min(common->cachelsz, (u16)64)); ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize - ah->caps.rx_status_len); --- 225,234 ---- int error = 0, i; u32 size; ! size = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len, ! min(common->cachelsz, (u16)64)); ! common->rx_bufsize = min_t(u32, size, ! SKB_MAX_ORDER(NET_SKB_PAD + common->cachelsz, 0)); ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize - ah->caps.rx_status_len); dmesg output: skbuff alloc of size 3872 failed ps: page allocation failure. order:1, mode:0x4020 Pid: 16022, comm: ps Tainted: G W 2.6.35.4 #1 Call Trace: [<c0243bfe>] ? __alloc_pages_nodemask+0x3e6/0x513 [<c025be97>] ? slob_new_pages+0xc/0x23 [<c025c0a1>] ? __kmalloc_node+0x39/0x72 [<c03338d8>] ? __alloc_skb+0x5c/0x105 [<d09fa01a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<d0ab7109>] ? ath_rx_tasklet+0x2fb/0x808 [ath9k] [<c030fd49>] ? scsi_next_command+0x27/0x34 [<d0ab53c5>] ? ath9k_tasklet+0xcc/0x107 [ath9k] [<c02193c3>] ? tasklet_action+0x5f/0x65 [<c0219677>] ? __do_softirq+0x60/0xc6 [<c021970b>] ? do_softirq+0x2e/0x30 [<c02197fd>] ? irq_exit+0x53/0x55 [<c020392c>] ? do_IRQ+0x3a/0x72 [<c0202be9>] ? common_interrupt+0x29/0x30 [<c0247874>] ? __remove_mapping+0x4b/0x8d [<c023ea9f>] ? try_to_release_page+0x20/0x30 [<c0247d36>] ? shrink_page_list+0x2b8/0x46c [<c03158ea>] ? sd_prep_fn+0x1a5/0x92b [<c024816b>] ? shrink_inactive_list+0x281/0x552 [<c02445d9>] ? ...
Le dimanche 05 septembre 2010 à 08:51 -0700, Nikhil Sethi (निखिल सेठी) a Unfortunately you use SLOB, and it seems to ask order-1 pages, while we want less than a page. --
Thanks Eric, for all the help. I get very occasional errors with vanilla 2.6.36 kernel with your patch. Error messages are slightly different, and very rare compared to without your patch. 2.6.35.7 seems fine, and does not show this error. rsync: page allocation failure. order:1, mode:0x4020 Pid: 31479, comm: rsync Not tainted 2.6.36-rc1 #1 Call Trace: [<c0246b2c>] ? __alloc_pages_nodemask+0x3f1/0x52e [<c0261330>] ? __slab_alloc+0x40f/0x45a [<c02617ca>] ? __kmalloc_track_caller+0x91/0x97 [<d09db01a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<d09db01a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<c033a8d0>] ? __alloc_skb+0x57/0x100 [<d09db01a>] ? ath_rxbuf_alloc+0x1a/0x78 [ath] [<d0b01f2e>] ? ath_rx_tasklet+0x680/0xcfe [ath9k] [<c02234ae>] ? __queue_work+0xe4/0x1a2 [<d0b00244>] ? ath9k_tasklet+0xd0/0x134 [ath9k] [<c0219a43>] ? tasklet_action+0x5f/0x65 [<c0219cf4>] ? __do_softirq+0x60/0xc6 [<c0219d88>] ? do_softirq+0x2e/0x30 [<c0219e7b>] ? irq_exit+0x54/0x56 [<c02038e0>] ? do_IRQ+0x3a/0x72 [<c0202bc9>] ? common_interrupt+0x29/0x30 [<c02d89a0>] ? cfq_may_queue+0x0/0x92 [<c02d1567>] ? get_request+0x121/0x23e [<c02d16a4>] ? get_request_wait+0x20/0xfd [<c02d8f16>] ? cfq_merge+0x0/0x69 [<c02cfd40>] ? elv_merge+0x14b/0x170 [<c02d2375>] ? __make_request+0x71/0x364 [<c02d1fe2>] ? generic_make_request+0x14f/0x318 [<c0243fb1>] ? mempool_alloc+0x35/0xb9 [<c02d21e9>] ? submit_bio+0x3e/0x90 [<c0281c14>] ? bio_alloc_bioset+0x33/0x9e [<c027e780>] ? submit_bh+0xbc/0xec [<c027f4e2>] ? ll_rw_block+0x72/0x74 [<c027f504>] ? __breadahead+0x20/0x30 [<c02a3a8f>] ? __ext4_get_inode_loc+0x308/0x416 [<c02a8154>] ? ext4_iget+0x52/0x652 [<c02aa327>] ? ext4_lookup+0x65/0xbf [<c026a6bb>] ? d_alloc_and_lookup+0x30/0x58 [<c026b84a>] ? do_lookup+0xc5/0xfe [<c026c239>] ? link_path_walk+0x4d1/0x8d6 [<c026c722>] ? path_walk+0x48/0xa9 [<c026c7be>] ? do_path_lookup+0x3b/0x3f [<c026d06d>] ? user_path_at+0x30/0x64 [<c0266446>] ? cp_new_stat64+0xee/0x100 [<c026664a>] ? ...
Le lundi 08 novembre 2010 à 08:25 -0800, Nikhil Sethi (निखिल सेठी) a Its very hard to follow this thread, with a message every two months. As I said, this patch got a min_t() instead of max_t() http://kerneltrap.org/mailarchive/linux-kernel/2010/8/25/4611322 If you still have kmalloc-8192 allocations, I suspect you dont use right patch. --
As I said, this seems like a regression in 2.6.36 from 2.6.35 (which is why I emailed after 2 months :) ). You mentioned earlier in the thread that you were going to submit the patch to head, if there is a separate bug/thread tracking that, I can send followups to that. I believe I have the right patch: *** 225,234 **** int error = 0, i; u32 size; ! ! common->rx_bufsize = roundup(IEEE80211_MAX_MPDU_LEN + ! ah->caps.rx_status_len, ! min(common->cachelsz, (u16)64)); ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize - ah->caps.rx_status_len); --- 225,234 ---- int error = 0, i; u32 size; ! size = roundup(IEEE80211_MAX_MPDU_LEN + ah->caps.rx_status_len, ! min(common->cachelsz, (u16)64)); ! common->rx_bufsize = min_t(u32, size, ! SKB_MAX_ORDER(NET_SKB_PAD + common->cachelsz, 0)); ath9k_hw_set_rx_bufsize(ah, common->rx_bufsize - ah->caps.rx_status_len); --
