I've recently switched some of my boxes from a 32 to a 64 bit kernel. These are usenet server boxes that do a lot of I/O. They are running 2.6.24 / 2.6.25 Every 15 minutes a cronjob calls a management utility, tw_cli, to read the raid status of the 3ware disk arrays. That often fails with a segmentation violation .. tw_cli: page allocation failure. order:0, mode:0x10d0 Pid: 9296, comm: tw_cli Not tainted 2.6.25.4 #2 Call Trace: [<ffffffff802604b6>] __alloc_pages+0x336/0x390 [<ffffffff80210ff4>] dma_alloc_pages+0x24/0xa0 [<ffffffff80211113>] dma_alloc_coherent+0xa3/0x2e0 [<ffffffff8804a58f>] :3w_9xxx:twa_chrdev_ioctl+0x11f/0x810 [<ffffffff802826c0>] chrdev_open+0x0/0x1c0 [<ffffffff8027d997>] __dentry_open+0x197/0x210 [<ffffffff8028c4ed>] vfs_ioctl+0x7d/0xa0 [<ffffffff8028c584>] do_vfs_ioctl+0x74/0x2d0 [<ffffffff8028c829>] sys_ioctl+0x49/0x80 [<ffffffff8020b29b>] system_call_after_swapgs+0x7b/0x80 Mem-info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 60 CPU 1: hi: 186, btch: 31 usd: 185 CPU 2: hi: 186, btch: 31 usd: 176 CPU 3: hi: 186, btch: 31 usd: 165 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 120 CPU 1: hi: 186, btch: 31 usd: 164 CPU 2: hi: 186, btch: 31 usd: 177 CPU 3: hi: 186, btch: 31 usd: 182 Active:265929 inactive:1657355 dirty:663189 writeback:62890 unstable:0 free:49079 slab:65923 mapped:1238 pagetables:927 bounce:0 DMA free:12308kB min:184kB low:228kB high:276kB active:0kB inactive:0kB present:11816kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 3255 8053 8053 DMA32 free:94200kB min:52912kB low:66140kB high:79368kB active:440616kB inactive:2505772kB present:3333792kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 4797 4797 Normal free:86792kB min:77968kB low:97460kB high:116952kB active:623100kB ...
probably andi has a better idea on why it was added, since it used to --
d_a_c() tries a couple of zones, and running the oom killer for each is inconvenient. Especially for the 16MB DMA zone which is unlikely to be cleared by the OOM killer anyways because normal user applications don't put pages in there. There was a real report with some problems in this area. Also for the earlier tries you don't want to really bring the system into swap. Mask allocator would clean most of that up. -Andi --
I understand, but I do think using __GFP_NORETRY causes problems. Most drivers call pci_alloc_consistent() which calls dma_alloc_coherent(.... GFP_ATOMIC) which can dip deep into reserves so it won't fail so easily. Just a handful use dma_alloc_coherent() directly. However, in 2.6.26-rc1, dpt_i2o.c was updated for 64 bit support, and all it's kmalloc(.... GFP_KERNEL) + virt_to_bus() calls have been replaced by dma_alloc_coherent(.... GFP_KERNEL). In that case, it's not a very good idea to add __GFP_NORETRY. It will cause problems. It certainly does in 3w-xxxx.c and it probably will cause worse problems in dpt_i2o.c. I think we should do something. How about one of these two patches. # ----- linux-2.6.26-d_a_c-fix-noretry.patch diff -ruN linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c --- linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c 2008-05-18 23:36:41.000000000 +0200 +++ linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c 2008-05-22 21:21:37.000000000 +0200 @@ -398,7 +398,8 @@ return NULL; /* Don't invoke OOM killer */ - gfp |= __GFP_NORETRY; + if (!(gfp & __GFP_WAIT)) + gfp |= __GFP_NORETRY; #ifdef CONFIG_X86_64 /* Why <=? Even when the mask is smaller than 4GB it is often # ----- linux-2.6.26-gfp-no-oom.patch diff -ruN linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c --- linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c 2008-05-18 23:36:41.000000000 +0200 +++ linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c 2008-05-22 20:42:10.000000000 +0200 @@ -398,7 +398,7 @@ return NULL; /* Don't invoke OOM killer */ - gfp |= __GFP_NORETRY; + gfp |= __GFP_NO_OOM; #ifdef CONFIG_X86_64 /* Why <=? Even when the mask is smaller than 4GB it is often diff -ruN linux-2.6.26-rc3.orig/include/linux/gfp.h linux-2.6.26-rc3/include/linux/gfp.h --- linux-2.6.26-rc3.orig/include/linux/gfp.h 2008-05-18 23:36:41.000000000 +0200 +++ ...
And Andi wrote: So how about linux-2.6.26-gfp-no-oom.patch (see previous mail) for 2.6.26 ? Mike. --
Changing the gfp once globally like you did is not right, because the different fallback cases have to be handled differently (see the different cases I discussed in my earlier mail) Especially the 16MB zone allocation should never trigger the OOM killer. That could be special cased, but __GFP_NO_OOM_KILLER is likely better as a short term fix although I'm still not 100% sure what implications it will have to do more VM replies in the early fallbacks. -Andi --
On Sun, 25 May 2008 18:35:39 +0200 That depends how much memory you have. Alan --
No it doesn't because the lower zone protection basically never puts anything that is not GFP_DMA into the 16MB zone. Just check yourself on your machine using sysrq. That was one of the motivations behind the mask allocator design. -Andi --
Can you give some pointers please ? Thanks, tglx --
To the bug report? Memory is fuzzy, but I think it was some SUSE bugzilla report, might have been for SLES. Anyways the reasoning is still valid. Longer term the mask allocator would be the right fix, shorter term a new GFP flag as proposed sounds reasonable. The trick is just that you need different __GFP_ flags for the different allocations. e.g. the first the "higher zone" quick try should continue to use __GFP_NORETRY. And the 16MB one should too. It would only make sense for the main request. In the mask allocator patchkit kernel it should be also ok already. -Andi --
