Re: 2.6.26: x86/kernel/pci_dma.c: gfp |= __GFP_NORETRY ?

Previous thread: [PATCH] i2c: adds support for i2c bus on Freescale CPM1/CPM2 controllers by Jochen Friedrich on Wednesday, May 21, 2008 - 3:43 am. (3 messages)

Next thread: Sony laptop battery has odd behavior with 2.6.26-rc3ish build. by Robin Holt on Wednesday, May 21, 2008 - 4:57 am. (5 messages)
From: Miquel van Smoorenburg
Date: Wednesday, May 21, 2008 - 4:30 am

I've recently switched some of my boxes from a 32 to a
64 bit kernel. These are usenet server boxes that do
a lot of I/O. They are running 2.6.24 / 2.6.25

Every 15 minutes a cronjob calls a management utility, tw_cli,
 to read the raid status of the 3ware disk arrays. That
often fails with a segmentation violation .. 

tw_cli: page allocation failure. order:0, mode:0x10d0
Pid: 9296, comm: tw_cli Not tainted 2.6.25.4 #2

Call Trace:
 [<ffffffff802604b6>] __alloc_pages+0x336/0x390
 [<ffffffff80210ff4>] dma_alloc_pages+0x24/0xa0
 [<ffffffff80211113>] dma_alloc_coherent+0xa3/0x2e0
 [<ffffffff8804a58f>] :3w_9xxx:twa_chrdev_ioctl+0x11f/0x810
 [<ffffffff802826c0>] chrdev_open+0x0/0x1c0
 [<ffffffff8027d997>] __dentry_open+0x197/0x210
 [<ffffffff8028c4ed>] vfs_ioctl+0x7d/0xa0
 [<ffffffff8028c584>] do_vfs_ioctl+0x74/0x2d0
 [<ffffffff8028c829>] sys_ioctl+0x49/0x80
 [<ffffffff8020b29b>] system_call_after_swapgs+0x7b/0x80

Mem-info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
CPU    2: hi:    0, btch:   1 usd:   0
CPU    3: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  60
CPU    1: hi:  186, btch:  31 usd: 185
CPU    2: hi:  186, btch:  31 usd: 176
CPU    3: hi:  186, btch:  31 usd: 165
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 120
CPU    1: hi:  186, btch:  31 usd: 164
CPU    2: hi:  186, btch:  31 usd: 177
CPU    3: hi:  186, btch:  31 usd: 182
Active:265929 inactive:1657355 dirty:663189 writeback:62890 unstable:0
 free:49079 slab:65923 mapped:1238 pagetables:927 bounce:0
DMA free:12308kB min:184kB low:228kB high:276kB active:0kB inactive:0kB present:11816kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3255 8053 8053
DMA32 free:94200kB min:52912kB low:66140kB high:79368kB active:440616kB inactive:2505772kB present:3333792kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 4797 4797
Normal free:86792kB min:77968kB low:97460kB high:116952kB active:623100kB ...
From: Glauber Costa
Date: Wednesday, May 21, 2008 - 5:49 am

probably andi has a better idea on why it was added, since it used to 

--

From: Andi Kleen
Date: Thursday, May 22, 2008 - 1:47 am

d_a_c() tries a couple of zones, and running the oom killer for each
is inconvenient. Especially for the 16MB DMA zone which is unlikely
to be cleared by the OOM killer anyways because normal user applications
don't put pages in there. There was a real report with some problems
in this area. Also for the earlier tries you don't want to really
bring the system into swap.

Mask allocator would clean most of that up.

-Andi
--

From: Miquel van Smoorenburg
Date: Thursday, May 22, 2008 - 12:25 pm

I understand, but I do think using __GFP_NORETRY causes problems.

Most drivers call pci_alloc_consistent() which calls
dma_alloc_coherent(.... GFP_ATOMIC) which can dip deep into reserves so
it won't fail so easily. Just a handful use dma_alloc_coherent()
directly.

However, in 2.6.26-rc1, dpt_i2o.c was updated for 64 bit support, and
all it's kmalloc(.... GFP_KERNEL) + virt_to_bus() calls have been
replaced by dma_alloc_coherent(.... GFP_KERNEL).

In that case, it's not a very good idea to add __GFP_NORETRY. It will
cause problems. It certainly does in 3w-xxxx.c and it probably will
cause worse problems in dpt_i2o.c.

I think we should do something. How about one of these two patches.

# -----

linux-2.6.26-d_a_c-fix-noretry.patch

diff -ruN linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c
--- linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c	2008-05-18 23:36:41.000000000 +0200
+++ linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c	2008-05-22 21:21:37.000000000 +0200
@@ -398,7 +398,8 @@
 		return NULL;
 
 	/* Don't invoke OOM killer */
-	gfp |= __GFP_NORETRY;
+	if (!(gfp & __GFP_WAIT))
+		gfp |= __GFP_NORETRY;
 
 #ifdef CONFIG_X86_64
 	/* Why <=? Even when the mask is smaller than 4GB it is often


# -----


linux-2.6.26-gfp-no-oom.patch

diff -ruN linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c
--- linux-2.6.26-rc3.orig/arch/x86/kernel/pci-dma.c	2008-05-18 23:36:41.000000000 +0200
+++ linux-2.6.26-rc3/arch/x86/kernel/pci-dma.c	2008-05-22 20:42:10.000000000 +0200
@@ -398,7 +398,7 @@
 		return NULL;
 
 	/* Don't invoke OOM killer */
-	gfp |= __GFP_NORETRY;
+	gfp |= __GFP_NO_OOM;
 
 #ifdef CONFIG_X86_64
 	/* Why <=? Even when the mask is smaller than 4GB it is often
diff -ruN linux-2.6.26-rc3.orig/include/linux/gfp.h linux-2.6.26-rc3/include/linux/gfp.h
--- linux-2.6.26-rc3.orig/include/linux/gfp.h	2008-05-18 23:36:41.000000000 +0200
+++ ...
From: Miquel van Smoorenburg
Date: Saturday, May 24, 2008 - 12:38 pm

And Andi wrote:


So how about linux-2.6.26-gfp-no-oom.patch (see previous mail) for
2.6.26 ?

Mike.

--

From: Andi Kleen
Date: Sunday, May 25, 2008 - 9:35 am

Changing the gfp once globally like you did is not right, because
the different fallback cases have to be handled differently
(see the different cases I discussed in my earlier mail)

Especially the 16MB zone allocation should never trigger the OOM killer.

That could be special cased, but __GFP_NO_OOM_KILLER is likely better
as a short term fix although I'm still not 100% sure what implications
it will have to do more VM replies in the early fallbacks.

-Andi
--

From: Alan Cox
Date: Sunday, May 25, 2008 - 12:55 pm

On Sun, 25 May 2008 18:35:39 +0200

That depends how much memory you have.

Alan
--

From: Andi Kleen
Date: Sunday, May 25, 2008 - 2:23 pm

No it doesn't because the lower zone protection basically never puts
anything that is not GFP_DMA into the 16MB zone.

Just check yourself on your machine using sysrq.

That was one of the motivations behind the mask allocator design.

-Andi
--

From: Alan Cox
Date: Sunday, May 25, 2008 - 3:02 pm

Try a 16MB embedded PC 

Alan
--

From: Thomas Gleixner
Date: Thursday, May 22, 2008 - 12:58 pm

Can you give some pointers please ?

Thanks,
	tglx


--

From: Andi Kleen
Date: Thursday, May 22, 2008 - 3:59 pm

To the bug report? Memory is fuzzy, but I think it was some SUSE bugzilla
report, might have been for SLES.

Anyways the reasoning is still valid. Longer term the mask allocator
would be the right fix, shorter term a new GFP flag as proposed 
sounds reasonable.

The trick is just that you need different __GFP_ flags for the different
allocations. e.g. the first the "higher zone" quick try should
continue to use __GFP_NORETRY. And the 16MB one should too. It would
only make sense for the main request.

In the mask allocator patchkit kernel it should be also ok already.

-Andi
--

Previous thread: [PATCH] i2c: adds support for i2c bus on Freescale CPM1/CPM2 controllers by Jochen Friedrich on Wednesday, May 21, 2008 - 3:43 am. (3 messages)

Next thread: Sony laptop battery has odd behavior with 2.6.26-rc3ish build. by Robin Holt on Wednesday, May 21, 2008 - 4:57 am. (5 messages)