Re: [PATCH] Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks

Previous thread: 2.6.21-rc3-mm2 by Andrew Morton on Wednesday, March 7, 2007 - 9:19 pm. (36 messages)

Next thread: [git pull] Input fixes for 2.6.21-rc3 by Dmitry Torokhov on Wednesday, March 7, 2007 - 9:38 pm. (1 message)
From: Andrew Morton
Date: Wednesday, March 7, 2007 - 9:18 pm

Temporarily at

  http://userweb.kernel.org/~akpm/2.6.21-rc3-mm1/

Will appear later at

  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc3/2.6.21-rc3-mm1/



- The wireless changes in here need a lot of testers, please.  It is major
  rework.

  Of course the config files got all changed around so `make oldconfig'
  breaks everything.  I was able to get ipw2200 working after some fumbling,
  but perhaps John can tell people what has been changed in there?  What has
  happened, from a big picture perspective?

- This patchset contains Con's rip-up-and-rewrite of the CPU scheduling
  algorithm.  It oopsed for me on one machine so I'll do an rc3-mm2 without
  those changes shortly.  If 2.6.21-rc3-mm1 crashes and 2.6.rc3-mm2 does not,
  don't forget to Cc: Con Kolivas <kernel@kolivas.org> on the report ;)

  Feedback on this change is sought.  Especially from the
  enterprise-database and volanomark loonies: this stuff might be headed your
  way so don't tell us afterwards that it hurt.

- Added Nick's lock-the-page-in-the-pagefault-handler patches.  These reduce
  the incidence of one bug and increase the incidence of another.  VM is fun. 

- Re-added the ext4 development tree to the -mm lineup.  It has stuff in
  it.  



Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

        echo "subscribe mm-commits" | mail majordomo@vger.kernel.org

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

        ...
From: Jeff Garzik
Date: Thursday, March 8, 2007 - 2:59 am

The big big pictures is that Linux is getting a new wireless stack, one 
that active developers seem to agree on.  It /should/ support all the 
old wireless-tools binaries out there, except the wext/netlink stuff 

IMO I've been a bit disappointed at the rate of development.  I had 
hoped that patches to ext4 would be making it into mainline more rapidly.

	Jeff


-

From: John W. Linville
Date: Thursday, March 8, 2007 - 7:50 am

Specifically regarding 'make oldconfig', it is mostly a clean-up
to make a distinction between 802.11 wireless LANs and pre-standard
wireless LANs.  Here is the commit message:

	commit 1a9e0dd0bd60474465e0b0f1bca774d8c042d879
	Author: Johannes Berg <johannes@sipsolutions.net>
	Date:   Sat Mar 3 13:06:15 2007 +0100
	
	    [PATCH] rework wireless Kconfig
	
	    This patch
	     * kills NET_RADIO
	     * adds a new "Wireless LAN" menu
	     * adds two new options WLAN_PRE80211 and WLAN_80211 that drivers
	       depend on
	     * makes WIRELESS_EXT visible (to avoid the arguments we had in
	       commit c1783454a31e05b94774951b0b5d1eb9075ebfb4)
	     * changes everything that depended on NET_RADIO to select WIRELESS_EXT
	       and to depend on WLAN_PRE80211 or WLAN_80211
	
	    By removing NET_RADIO, these changes pave the way to making wireless
	    extensions optional when cfg80211 can fully take over for some drivers
	    and you don't have any older drivers that still require wext.

Honestly, I'm tempted to add the pre-802.11 stuff to the features
removal list.  I wonder if any of it still actually works...

As to the larger question of "what is happening w/ wireless in -mm",
I'll add a few words for those who don't know.

As the commit referenced earlier suggests, work is underway on a
new configuration regime for wireless LANs.  This should result in
a cleaner API for driver and userland tool developers, and hopefully
better matches the expected semantics for wireless LAN configuration.
An optional sub-component of that is a compatibility layer for
existing WEXT-based tools, so there should be no need for a wireless
tools "flag day".  Still, hopefully this enables better wireless
configuration/management tools in the future.

In addition, we are adding a new component: mac80211.  This component
implements the higher-layer wireless MAC functionality for those
cards that don't do it in hardware or firmware, as is true for many
new cards.  Traditionally cards ...
From: Dave Jones
Date: Thursday, March 8, 2007 - 9:37 am

On Thu, Mar 08, 2007 at 09:50:43AM -0500, John W. Linville wrote:
 > On Wed, Mar 07, 2007 at 08:18:39PM -0800, Andrew Morton wrote:
 > 	    By removing NET_RADIO, these changes pave the way to making wireless
 > 	    extensions optional when cfg80211 can fully take over for some drivers
 > 	    and you don't have any older drivers that still require wext.
 > 
 > Honestly, I'm tempted to add the pre-802.11 stuff to the features
 > removal list.  I wonder if any of it still actually works...

FWIW, I've built these drivers in Fedora for aeons, and never had a single
bug filed against them. Either they're perfect, or no-one has that junk
any more.  I should turn them off for a build and see if anyone complains :)

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Valdis.Kletnieks
Date: Thursday, March 8, 2007 - 10:56 am

Working on it - the new MAC80211 stack landed in the -mm tree, but the matching
iwlwifi driver for the Intel 3945ABG is still out-of-tree and acting wonky
for me.  The card comes up, 'iwlist scanning' sees 4 access points, but it
won't associate.  Not sure what I borked up.
From: John W. Linville
Date: Thursday, March 8, 2007 - 11:34 am

FWIW, I have had best results w/ that driver by manually selecting
the freq and ap as well as essid.

Hth!

John
-- 
John W. Linville
linville@tuxdriver.com
-

From: Valdis.Kletnieks
Date: Thursday, March 8, 2007 - 1:27 pm

Confirmed - if I used 'iwconfig ap <mac address> channel <n>' to match something
I found via 'iwlist scan', it was able to associate and connect.
From: Michal Piotrowski
Date: Thursday, March 8, 2007 - 1:50 pm

cpu_hotplug (AutoTest) hangs at this

=============================================
[ INFO: possible recursive locking detected ]
2.6.21-rc3-mm1 #2
---------------------------------------------
sh/7213 is trying to acquire lock:
 (sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f

but task is already holding lock:
 (sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f

other info that might help us debug this:
4 locks held by sh/7213:
 #0:  (cpu_add_remove_lock){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
 #1:  (sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
 #2:  (cache_chain_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
 #3:  (workqueue_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f

stack backtrace:
 [<c0105256>] show_trace_log_lvl+0x1a/0x2f
 [<c010597b>] show_trace+0x12/0x14
 [<c0105a3d>] dump_stack+0x16/0x18
 [<c013fc73>] __lock_acquire+0x1aa/0xceb
 [<c014082d>] lock_acquire+0x79/0x93
 [<c03385dc>] __mutex_lock_slowpath+0x107/0x349
 [<c033883a>] mutex_lock+0x1c/0x1f
 [<c011d924>] sched_getaffinity+0x14/0x91
 [<c015796d>] __synchronize_sched+0x11/0x5f
 [<c011d257>] detach_destroy_domains+0x2c/0x30
 [<c011fc1a>] update_sched_domains+0x27/0x3a
 [<c012fe7a>] notifier_call_chain+0x2b/0x4a
 [<c012fec6>] __raw_notifier_call_chain+0x19/0x1e
 [<c0145756>] _cpu_down+0x70/0x282
 [<c014598e>] cpu_down+0x26/0x38
 [<c0272714>] store_online+0x27/0x5a
 [<c026f610>] sysdev_store+0x20/0x25
 [<c01b7a8e>] sysfs_write_file+0xc1/0xe9
 [<c0180052>] vfs_write+0xd1/0x15a
 [<c0180682>] sys_write+0x3d/0x72
 [<c0104270>] syscall_call+0x7/0xb

l *0xc033883a
0xc033883a is in mutex_lock (/mnt/md0/devel/linux-mm/kernel/mutex.c:92).
87              /*
88               * The locking fastpath is the 1->0 transition from
89               * 'unlocked' into 'locked' state.
90               */
91              __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
92      }
93
94      EXPORT_SYMBOL(mutex_lock);
95
96      static ...
From: Andrew Morton
Date: Friday, March 9, 2007 - 7:18 pm

That's pretty useless, isn't it?  We need to know the mutex_lock() caller

I can't immediately spot the bug.  Probably it's caused by rcu-preempt's
changes to synchronize_sched(): that function now does a heap more than it
used to, including taking sched_hotcpu_muex.

So, what to do about this.  Paul, I'm thinking that I should drop
rcu-preempt for now - I don't think we ended up being able to identify any
particular benefit which it brings to current mainline, and I suspect that
things will become simpler if/when we start using the process freezer for
CPU hotplug.

-

From: Paul E. McKenney
Date: Saturday, March 10, 2007 - 8:45 am

It certainly makes sense for Michal to try backing out rcu-preempt using
your broken-out list of patches.  If that makes the problem go away,
then I would certainly have a hard time arguing with you.  We are working
on getting measurements showing benefit of rcu-preempt, but aren't there
yet.

						Thanx, Paul
-

From: Frederik Deweerdt
Date: Friday, March 9, 2007 - 4:40 am

Any details on the symptoms? I'm unable to boot rc3-mm2, and it hangs
right after printing the ipw2200 driver message. I'll investigate that
this week-end.

Regards,
Frederik
-

From: Frederik Deweerdt
Date: Thursday, March 15, 2007 - 2:22 am

Sorry for the delay, I could give it a try today. It appears
that it doesn't hang, it just spends a lot of time in
ipw_init:pci_register_driver, due to a firmware loading failure

[   12.296637] RAMDISK driver initialized: 16 RAM disks of 32000K size 1024 blocksize
[   12.322581] ipw2200: Intel(R) PRO/Wireless 2200/2915 Network Driver, 1.2.0kdmprq
[   12.348822] ipw2200: Copyright(c) 2003-2006 Intel Corporation
[   12.366936] PCI: Found IRQ 10 for device 0000:03:03.0
[   12.376280] PCI: Sharing IRQ 10 with 0000:00:1d.1
[   12.385729] PCI: Sharing IRQ 10 with 0000:00:1e.3
[   12.395134] PCI: Sharing IRQ 10 with 0000:00:1f.2
[   12.404385] ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
[   72.391870] ipw2200: ipw2200-bss.fw request_firmware failed: Reason -2
[   72.400563] ipw2200: Unable to load firmware: -2
[   72.408956] ipw2200: failed to register network device
[   72.417178] ipw2200: probe of 0000:03:03.0 failed with error -5

(Booted with acpi=off due to some framebuffer problems mangling the
console output, but the problem persists even without that kernel
parameter)

Regards,
Frederik
-

From: Mariusz Kozlowski
Date: Saturday, March 10, 2007 - 1:33 am

Works for me ... so far ;-) Anyway to the point:

When moving my laptop I reattached the usb mouse. Then I found this in syslog:

usb 2-1: new low speed USB device using uhci_hcd and address 3
usb 2-1: new device found, idVendor=046d, idProduct=c00e
usb 2-1: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-1: Product: USB-PS/2 Optical Mouse
usb 2-1: Manufacturer: Logitech
usb 2-1: configuration #1 chosen from 1 choice
khubd: page allocation failure. order:5, mode:0xd0
 [<c01045c4>] show_trace_log_lvl+0x1a/0x30
 [<c0104d50>] show_trace+0x12/0x14
 [<c0104e07>] dump_stack+0x16/0x18
 [<c014acc6>] __alloc_pages+0x2e4/0x303
 [<c01604df>] cache_alloc_refill+0x2e4/0x516
 [<c01608e4>] kmem_cache_zalloc+0x78/0x7c
 [<c0341343>] hid_parse_report+0xce/0x26b
 [<c032d99b>] hid_probe+0x264/0xdba
 [<c030feb9>] usb_probe_interface+0x5a/0x89
 [<c02bf74f>] driver_probe_device+0x86/0x178
 [<c02bf849>] __device_attach+0x8/0xa
 [<c02beadc>] bus_for_each_drv+0x4a/0x68
 [<c02bfb39>] device_attach+0x8b/0xd2
 [<c02bea4e>] bus_attach_device+0x40/0x84
 [<c02bd835>] device_add+0x5d0/0x6c8
 [<c030e8bd>] usb_set_configuration+0x2d6/0x4c6
 [<c03158ab>] generic_probe+0x15c/0x251
 [<c030fb87>] usb_probe_device+0x36/0x3c
 [<c02bf74f>] driver_probe_device+0x86/0x178
 [<c02bf849>] __device_attach+0x8/0xa
 [<c02beadc>] bus_for_each_drv+0x4a/0x68
 [<c02bfb39>] device_attach+0x8b/0xd2
 [<c02bea4e>] bus_attach_device+0x40/0x84
 [<c02bd835>] device_add+0x5d0/0x6c8
 [<c030a078>] usb_new_device+0x128/0x196
 [<c030aedd>] hub_thread+0x28a/0xb4a
 [<c0126c75>] kthread+0xa2/0xc9
 [<c010422f>] kernel_thread_helper+0x7/0x18
 =======================
Mem-info:
DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
Active:82948 inactive:29211 dirty:18 writeback:0 unstable:0
 free:1631 slab:3634 mapped:21032 pagetables:390 bounce:0
DMA free:2316kB min:92kB low:112kB ...
From: Mariusz Kozlowski
Date: Saturday, March 10, 2007 - 1:48 am

Ok I was wrong. Able to reproduce quite easily. Let me know if you need
anything more.

usb 2-1: new low speed USB device using uhci_hcd and address 11
usb 2-1: new device found, idVendor=046d, idProduct=c00e
usb 2-1: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-1: Product: USB-PS/2 Optical Mouse
usb 2-1: Manufacturer: Logitech
usb 2-1: configuration #1 chosen from 1 choice
khubd: page allocation failure. order:5, mode:0xd0
 [<c01045c4>] show_trace_log_lvl+0x1a/0x30
 [<c0104d50>] show_trace+0x12/0x14
 [<c0104e07>] dump_stack+0x16/0x18
 [<c014acc6>] __alloc_pages+0x2e4/0x303
 [<c01604df>] cache_alloc_refill+0x2e4/0x516
 [<c01608e4>] kmem_cache_zalloc+0x78/0x7c
 [<c0341343>] hid_parse_report+0xce/0x26b
 [<c032d99b>] hid_probe+0x264/0xdba
 [<c030feb9>] usb_probe_interface+0x5a/0x89
 [<c02bf74f>] driver_probe_device+0x86/0x178
 [<c02bf849>] __device_attach+0x8/0xa
 [<c02beadc>] bus_for_each_drv+0x4a/0x68
 [<c02bfb39>] device_attach+0x8b/0xd2
 [<c02bea4e>] bus_attach_device+0x40/0x84
 [<c02bd835>] device_add+0x5d0/0x6c8
 [<c030e8bd>] usb_set_configuration+0x2d6/0x4c6
 [<c03158ab>] generic_probe+0x15c/0x251
 [<c030fb87>] usb_probe_device+0x36/0x3c
 [<c02bf74f>] driver_probe_device+0x86/0x178
 [<c02bf849>] __device_attach+0x8/0xa
 [<c02beadc>] bus_for_each_drv+0x4a/0x68
 [<c02bfb39>] device_attach+0x8b/0xd2
 [<c02bea4e>] bus_attach_device+0x40/0x84
 [<c02bd835>] device_add+0x5d0/0x6c8
 [<c030a078>] usb_new_device+0x128/0x196
 [<c030aedd>] hub_thread+0x28a/0xb4a
 [<c0126c75>] kthread+0xa2/0xc9
 [<c010422f>] kernel_thread_helper+0x7/0x18
 =======================
Mem-info:
DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
Active:71402 inactive:8979 dirty:610 writeback:0 unstable:0
 free:23528 slab:13478 mapped:18407 pagetables:424 bounce:0
DMA free:2160kB min:92kB low:112kB high:136kB active:9264kB inactive:0kB ...
From: Andrew Morton
Date: Saturday, March 10, 2007 - 1:58 am

hid_parse_report() is doing kmalloc(128k kbytes).  We canot sanely support
that and the code shold be rewritten to not do that.  A simple though
somewhat lame fix would be to switch to vmalloc().

It's been this way for some time, so it's odd that the failures have just
popped up now.
-

From: Greg KH
Date: Saturday, March 10, 2007 - 2:18 am

Jiri is the person to ask about this now.  Jiri, any thoughts about
this?

thanks,

greg k-h
-

From: Jiri Kosina
Date: Saturday, March 10, 2007 - 5:43 am

Hi,

I have just queued the patch below to HID tree for the next upstream 
merge. Mariusz, I guess it solves your issue, right?

I have already been talking with Vojtech some time ago that rewritting the 
hid parser so that it would use less memory (but probably slightly a bit 
more CPU) would be a good thing to do, and it's been sitting in my TODO 
list for quite some time already. It's really not a straightforward 
rewrite, so I would incline to use the vmalloc() solution until the parser 
code has been rewritten. The hid_parser structure in question is living 
for very short time anyway, so it shouldn't be that big issue.

Thanks.


From: Jiri Kosina <jkosina@suse.cz>
Subject: [PATCH] HID: allocate hid_parser through vmalloc()

hid_parser is non-trivially large structure, so it should be allocated
using vmalloc() to avoid unsuccessful allocations when memory fragmentation
is too high.
This structue has a very short life, it's destroyed as soon as the report
descriptor has been completely parsed.

This should be considered a temporary solution, until the hid_parser is
rewritten to consume less memory during report descriptor parsing.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
 drivers/hid/hid-core.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c
index f4ee1af..e5894a7 100644
--- a/drivers/hid/hid-core.c
+++ b/drivers/hid/hid-core.c
@@ -26,6 +26,7 @@
 #include <asm/byteorder.h>
 #include <linux/input.h>
 #include <linux/wait.h>
+#include <linux/vmalloc.h>
 
 #include <linux/hid.h>
 #include <linux/hiddev.h>
@@ -654,12 +655,13 @@ struct hid_device *hid_parse_report(__u8 *start, unsigned size)
 	memcpy(device->rdesc, start, size);
 	device->rsize = size;
 
-	if (!(parser = kzalloc(sizeof(struct hid_parser), GFP_KERNEL))) {
+	if (!(parser = vmalloc(sizeof(struct hid_parser)))) {
 		kfree(device->rdesc);
 		kfree(device->collection);
 		kfree(device);
 		return ...
From: Mariusz Kozlowski
Date: Saturday, March 10, 2007 - 8:36 am

Right. Can't be 100% sure but without the patch it would have probably
failed by now so I guess the patch is ok. Not sure how to make usb mouse
plugging/unplugging process automatic ;-)

Thanks,

	Mariusz Kozlowski
-

From: Alan Stern
Date: Saturday, March 10, 2007 - 9:00 am

echo FOO >/sys/bus/usb/drivers/usbhid/unbind

to simulate an unplug (actually, to do an unbind), and

	echo FOO >/sys/bus/usb/drivers/usbhid/unbind

to do a bind, where FOO is the name of the USB mouse device link present
in the /sys/bus/usb/drivers/usbhid directory.

Alan Stern

-

From: Mariusz Kozlowski
Date: Saturday, March 10, 2007 - 9:36 am

# ls -al /sys/bus/usb/drivers/usbhid
total 0
drwxr-xr-x 2 root root    0 Mar 10 17:30 .
drwxr-xr-x 8 root root    0 Mar 10 17:14 ..
lrwxrwxrwx 1 root root    0 Mar 10 17:30 2-2:1.0 -> ../../../../devices/pci0000:00/0000:00:0c.0/usb2/2-2/2-2:1.0
--w------- 1 root root 4096 Mar 10 17:17 bind
lrwxrwxrwx 1 root root    0 Mar 10 17:17 module -> ../../../../module/usbhid
--w------- 1 root root 4096 Mar 10 17:17 new_id
--w------- 1 root root    0 Mar 10 17:22 unbind

# echo "2-2:1.0" > /sys/bus/usb/drivers/usbhid/unbind
bash: echo: write error: No such device

Any thoughts?

Regards,

	Mariusz Kozlowski
-

From: Alan Stern
Date: Saturday, March 10, 2007 - 12:02 pm

Another mistake on my part.  The correct command is

	echo -n '2-2:1.0' >/sys/bus/usb/drivers/usbhid/unbind

Without the "-n", the system thinks that the newline character at the end 
of the line written by "echo" is part of the filename.

Alan Stern

-

From: Mariusz Kozlowski
Date: Monday, March 12, 2007 - 1:50 pm

Nice tip. Thanks. I've run some tests and as expected -> no failure so far.



Regards,

	Mariusz Kozlowski
-

From: Jiri Kosina
Date: Monday, March 12, 2007 - 1:53 pm

Thanks for testing. The patch fixing this already went to Linus in todays 
HID/USB HID update (which has not yet been merged).

Thanks,

-- 
Jiri Kosina
-

From: Mariusz Kozlowski
Date: Saturday, March 10, 2007 - 6:32 am

Hello,

	iMac G3 build fails:
	  
CC      drivers/macintosh/adbhid.o
drivers/macintosh/adbhid.c: In function 'adbhid_init':
drivers/macintosh/adbhid.c:1275: error: too many arguments to function 'register_sysctl_table'
make[2]: *** [drivers/macintosh/adbhid.o] Blad 1
make[1]: *** [drivers/macintosh] Blad 2
make: *** [drivers] Blad 2

processor       : 0
cpu             : 740/750
temperature     : 43-45 C (uncalibrated)
clock           : 400MHz
revision        : 2.2 (pvr 0008 0202)
bogomips        : 796.67
machine         : PowerMac2,1
motherboard     : PowerMac2,1 MacRISC2 MacRISC Power Macintosh
detected as     : 66 (iMac FireWire)
pmac flags      : 00000005
L2 cache        : 512K unified
memory          : 256MB
pmac-generation : NewWorld

Gnu C                  4.1.2
Gnu make               3.81
binutils               2.17
util-linux             2.12r
mount                  2.12r
module-init-tools      3.3-pre2
e2fsprogs              1.40-WIP
Linux C Library        2.3.6
Dynamic linker (ldd)   2.3.6
Procps                 3.2.7
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.97
Modules Loaded         ipv6 af_packet tsdev eth1394 tulip crc32 ohci1394 ieee1394 uninorth_agp agpgart dm_snapshot dm_mirror dm_mod snd_powermac snd_pcm_oss snd_pcm snd_page_alloc snd_mixer_oss snd_seq_oss snd_seq_device snd_seq_midi_event snd_seq snd_timer snd soundcore ide_cd cdrom joydev evdev ext3 jbd mbcache usbhid uhci_hcd ohci_hcd usbcore ide_disk unix

From: Radoslaw Szkodzinski
Date: Monday, March 12, 2007 - 11:14 am

And broken stuff too :-)
The nanoseconds patch is broken on x86_64 - makes mtimes from the future:
e.g. year 2431. I suspect an endianness issue.
x86 works fine according to my sources.

The files themselves have correct mtimes, as booting previous kernel
or one w/o the nanoseconds patch works fine.
-

From: Mariusz Kozlowski
Date: Wednesday, March 14, 2007 - 12:06 pm

Hello,

	Today after +- 24h of uptime I found some more page allocation
failures ('eth1: Can't allocate skb for Rx'). You'll find more here:

http://tuxland.pl/misc/2.6.21-rc3-mm1-page-allocation-failure.txt

System wasn't doing anything unusual, as usual ;-) X, some p2p 
software, firefox+flash playing music.

Regards,

	Mariusz Kozlowski
From: Andrew Morton
Date: Wednesday, March 14, 2007 - 6:07 pm

Do other kernels do this, or is 2.6.21-rc3-mm1 worse?

It is of course a non-fatal problem and will inevitably happen sometimes,
but we would like the VM to be able to minimise the occurrence of this
problem.

I think we were rather hoping that Mel's anti-fragmentation work would
improve things.
-

From: Mariusz Kozlowski
Date: Wednesday, March 14, 2007 - 11:09 pm

I've never seen page allocation failures before 2.6.21-rc3-mm1 (first


Thanks,

	Mariusz Kozlowski
-

From: Mel Gorman
Date: Thursday, March 15, 2007 - 3:16 am

I'm looking at this now. In the vanilla allocator, min_free_kbytes has the
effect of keeping largest blocks free for as long as possible. This means
that if high-order allocations are rare, they'll tend to succeed up to a point.

In the case of the earlier report on HID failing the order-5 allocation, it's
because we didn't reclaim for long enough because the order was too high. It
would need to reclaim for quite a long time before it would get an order-5
block free. That is what Andy's lumpy-reclaim patches address - reclaiming

I'm still optimistic it will. This is the widest it's been tested so far
so there were going to be bases

In this case, the order is high but GFP_ATOMIC. These allocations get grouped
together but with a low min_free_kbytes, as is the case on this machine,
the high-order atomic blocks eventually get used by others. I had assumed
that if high-order atomic allocations were important (e.g. e1000) that the
machine would be configured with min_free_kbytes of at least 16384 i.e.
1 MAX_ORDER_NR_PAGES per MIGRATE_TYPE.

Mariusz, I would be interested in finding out if this problem still occurs when
you set min_free_kbytes to 16384 via /proc/sys/vm/min_free_kbytes. I understand
that the problem is not easily reproduced and requiring configuration changes
is far from ideal but it'd allow me to find out if options 2 or 3 below make
sense in advance.

I'm looking at three ways of addressing this;

1. Anti-fragmentation currently favours breaking larger blocks to
   group-by-mobility as opposed to the vanilla allocator which always favours
   using the smallest possible block no matter where it is.  I'm looking
   at preserving the behaviour of the vanilla allocator to keeping larger
   blocks free as much as possible to see how adverse an effect that
   subsequently has on fragmentation.

2. Set min_free_kbytes higher automatically at boot time when
   CONFIG_PAGE_GROUP_BY_MOBILITY is set.

3. Disable PAGE_GROUP_BY_MOBILITY when min_free_kbytes is set too ...
From: Mariusz Kozlowski
Date: Thursday, March 15, 2007 - 8:37 am

After a few hours I can confirm that this happens with 

$ cat /proc/sys/vm/min_free_kbytes 
16384

as well. See the syslog output below. Feel free to mail me to do some more tests.

Regards,

	Mariusz Kozlowski


Mar 15 12:00:47: echo 16384 > /proc/sys/vm/min_free_kbytes
Mar 15 12:38:33: eth1: MAC controller error (WTERR). Ignoring.
Mar 15 13:29:31: eth1: MAC controller error (WTERR). Ignoring.
Mar 15 13:34:34: eth1: MAC controller error (WTERR). Ignoring.
Mar 15 13:52:58: swapper: page allocation failure. order:1, mode:0x80020
Mar 15 13:52:58:  [<c0104664>] show_trace_log_lvl+0x1a/0x30
Mar 15 13:52:58:  [<c0104e00>] show_trace+0x12/0x14
Mar 15 13:52:58:  [<c0104eb7>] dump_stack+0x16/0x18
Mar 15 13:52:58:  [<c0150e8e>] __alloc_pages+0x2e6/0x2fd
Mar 15 13:52:58:  [<c0167556>] cache_alloc_refill+0x350/0x65d
Mar 15 13:52:58:  [<c0167957>] __kmalloc_track_caller+0xf4/0xf9
Mar 15 13:52:58:  [<c0396213>] __alloc_skb+0x6e/0x122
Mar 15 13:52:58:  [<ded154a9>] orinoco_interrupt+0x986/0x10a4 [orinoco]
Mar 15 13:52:58:  [<c0144d43>] handle_IRQ_event+0x28/0x59
Mar 15 13:52:58:  [<c01462f7>] handle_level_irq+0x6e/0xe7
Mar 15 13:52:58:  [<c0105e7e>] do_IRQ+0x3d/0x7f
Mar 15 13:52:58:  [<c01041d6>] common_interrupt+0x2e/0x34
Mar 15 13:52:58:  [<c0102352>] cpu_idle+0x46/0x74
Mar 15 13:52:58:  [<c0101131>] rest_init+0x37/0x46
Mar 15 13:52:58:  [<c0546bc1>] start_kernel+0x33c/0x3cb
Mar 15 13:52:58:  [<00000000>] 0x0
Mar 15 13:52:58:  =======================
Mar 15 13:52:58: Mem-info:
Mar 15 13:52:58: DMA per-cpu:
Mar 15 13:52:58: CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Mar 15 13:52:58: Normal per-cpu:
Mar 15 13:52:58: CPU    0: Hot: hi:  186, btch:  31 usd: 171   Cold: hi:   62, btch:  15 usd:  53
Mar 15 13:52:58: Active:44723 inactive:54877 dirty:1323 writeback:0 unstable:0
Mar 15 13:52:58:  free:5602 slab:11245 mapped:7463 pagetables:284 bounce:0
Mar 15 13:52:58: DMA free:2728kB min:544kB low:680kB high:816kB active:952kB inactive:3900kB ...
From: Mel Gorman
Date: Thursday, March 15, 2007 - 12:59 pm

Ok, great. Well, not great because it's broken, but I know what's going
on. I was able to reproduce the problem based on your report on my desktop
and put together a fix for it. Full regression tests are still running but
it should be in good enough state for you to test.

Without this patch, I got allocation failures within 15 minutes by stressing
the machine. With the patch below, it's been up an hour and 15 minutes and
I'm seeing no problems so far. Will keep the machine running a few days to
see what happens.

For people watching, this patch is potentially better than MIGRATE_HIGHALLOC
for preserving areas for atomic allocations - particularly if
the size of the reserve is based on min_free_kbytes instead of
MIGATE_TYPES*MAX_ORDER_NR_PAGES. If high-order allocation reports disappear
altogether, I'll put together a patch that gets rid of MIGRATE_HIGHALLOC
altogether and see if anyone reacts. That will bring the number of free
lists down and reduce the number of bits required in pageblock flags again.

Mariusz, please try the following patch. It should not be necessary to
adjust your min_free_kbytes again but if you see a failure, please try
with min_free_kbytes set to 16384. Thanks a lot.

===== Candidate fix as follows =====
The standard buddy allocator always favours the smallest block of pages. The
effect of this is that the min_free_kbytes reserved tends to be preserved at
the same location of memory for a very long time and often as a contiguous
block. When an administrator sets the reserve at 16384, it tends to be the
same MAX_ORDER blocks that remain free. This allows the occasional high atomic
allocation to succeed. In practice, it is difficult to split these blocks
with any load but when they do split, the benefit of having min_free_kbytes
for contiguous blocks disappears.

On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
blocks when there are no free pages of the appropriate type available. A
side-effect of this is that all blocks in ...
From: Mariusz Kozlowski
Date: Thursday, March 15, 2007 - 11:43 pm

Works for me. min_free_kbytes was left at default 2791. I left the laptop with
X + aMule + azureus + firefox&flash (playing music) + kernel compilation so
the box was pushed a bit. Uptime close to 9 hours and no page allocation
failures. I leave it running some more. If anything pops out you'll know it :-)

Thanks,

	Mariusz Kozlowski
-

From: Mel Gorman
Date: Friday, March 16, 2007 - 3:03 am

Excellent news.

This patch has a few flaws in it and it failed regression tests on 
sparsemem so it's far from ready but I know the basic idea appears sound 
now. I'll work on bringing the patch up to scratch and hopefully send on 
another version later today.

Thanks a million for testing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Mel Gorman
Date: Saturday, March 17, 2007 - 11:26 am

Begorrah - patches from Irish people on Paddy's day, who would have though it.
Clearly, this patch is full of lucky charms.

On a more serious note, this patch appears to address the page allocation
problem reported by Mariusz Kozlowski.

Changelog since v1
o Select the number of blocks to mark MIGRATE_RESERVE intelligently
o Take bootmem into account when placing MIGRATE_RESERVE blocks

The standard buddy allocator always favours the smallest block of pages. The
effect of this is that the pages free to satisfy min_free_kbytes tends to be
preserved since boot time at the same location of memory ffor a very long time
and as a contiguous block. When an administrator sets the reserve at 16384 at
boot time, it tends to be the same MAX_ORDER blocks that remain free. This
allows the occasional high atomic allocation to succeed up until the point
the blocks are split. In practice, it is difficult to split these blocks
but when they do split, the benefit of having min_free_kbytes for contiguous
blocks disappears. Additionally, increasing min_free_kbytes once the system
has been running for some time has no guarantee of creating contiguous blocks.

On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large blocks
when there are no free pages of the appropriate type available. A side-effect
of this is that all blocks in memory tends to be used up and the contiguous
free blocks from boot time are not preserved like in the vanilla allocator.
This can cause a problem if a new caller is unwilling to reclaim or does
not reclaim for long enough.

A failure scenario was found for a wireless network device allocating order-1
atomic allocations but the allocations were not intense or frequent enough
for a whole block of pages to be preserved for MIGRATE_HIGHALLOC. This was
reproduced on a desktop by booting with mem=256mb, forcing the driver to
allocate at order-1, running a bittorrent client (downloading a debian ISO)
and building a kernel with -j2.

This patch addresses the ...
From: Andrew Morton
Date: Sunday, March 18, 2007 - 1:22 am

I still haven't got onto reviewing all your mm patches :(   Nor has anyone
else, afaik.


Why does this config item exist?  It's not good to have some mysterious
knob which affects mm behaviour at compile time.  We need to make up our
minds and stick with it.

-


They have been reviewed at various points in their development and most of 
the feedback was fed back in. Marcelo Tosatti commented on version 19 for 
example. Joel Schopp commented heavily on earlier versions around the 
v15-v19 mark as well as Dave Hansen around the same time. Christoph 
Lameter has commented on recent versions but I'm not sure how detailed his 
review was of if the comments were based on the patch description. There 
have been comments from various other people as well, mainly around the 
v19-v20 mark. Andy Whitcroft has reviewed most versions of the patches 
including the most recent ones.

I suspect there may not have been detailed review recently because so many 

The configuration item exists because there were concerns over the memory 
footprint and cache line footprint. It was introduced to address that 
concern and also so that it would be possible to compare the performance 
behavior of anti-fragmentation. Your comment rang a bell though so I 
searched the archives to see this comment from Andi Kleen;

===
If anything this should be a boot time option or perhaps sysctl, not a 
config. In general CONFIGs that change runtime behaviour are evil - just 
makes changing the option more painful, causes problems for distribution 
users, doesn't make much sense, etc.etc.

Also #ifdef as a documentation device is a really really scary concept.
Yuck.
===

A sysctl would avoid any cache line footprint but not the memory overhead 
because the freelists in struct zone as those freelists would still exist. 
I could make the option depend on CONFIG_EMBEDDED for the zone overhead. 
Would that make sense or would it be preferable to ditch the option 
altogether?

I'll start looking at doing a sysctl so it can be disabled at runtime if 
necessary. I strongly suspect that it cannot be enabled again once 
disabled but I don't see that as a problem as such.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of ...
From: Andrew Morton
Date: Sunday, March 18, 2007 - 11:12 am

How much additional memory consumption are we expecting here?

Whether it's runtime or compile-time, the optionality is not good.
-

From: Mel Gorman
Date: Sunday, March 18, 2007 - 12:05 pm

Short answer, about 1.5KB on a 1GB system of which 1.3KB is statically 
defined in the 3 struct zones on a 1 node x86 system.

Longer answer that I hopefully have not made any mistakes in - There is 
the zone overhead which is statically sized and a runtime overhead which 
depends on the amount of memory in the system. The additional zone 
overhead is the overhead for additional freelists (larger struct 
free_area) and is as follows;

(MIGRATE_TYPES-1) * sizeof(list_head) * (MAX_ORDER-1)

so, on 32 bit in general, thats

4 * 8 * 10 = 320 bytes per zone (would be 240 bytes if MIGRATE_RESERVE is
 				sufficient for higher order allocations
 				instead of MIGRATE_HIGHALLOC)

on x86 with DMA, Normal and HighMem, thats 1280 bytes. On a NUMA system, 
it's 1280 bytes per node. On 64 bit, it would be double because of the 
larger pointer size. At worst, I guess you are looking at 3KB per node.

The size of the bitmap for the flags depends on the amount of memory. On 
FLATMEM and DISCONTIG, it's 3 bits per MAX_ORDER_NR_PAGES spanned by the 
zones (note spanned, not present). On a 1GiB system without holes on an 
x86 with MAX_ORDER of 11, that would be 96 bytes. The calculation is more 
complex for sparsemem but will be at least 4 bytes per active section for 
a pointer and another 4 bytes for most bitmaps. I have a patch that uses 
the pointer itself when the bitmap is small enough to be contained in 4 
bytes (which is usually is) so it could be just 4 bytes per active section 

The main situation where I think it would be desirable to disable 
anti-fragmentation is for zones that are smaller than 
MIGRATE_TYPES*MAX_ORDER_NR_PAGES and this is for performance reasons to 
avoid regularly entering __rmqueue_fallback(). However, that situation 
could be automatically detected and handled by having 
allocflags_to_migratetype() always return MIGRATE_UNMOVABLE for small 
zones and similar for get_pageblock_migratetype(). There would be no need 
for a sysctl.

If a sysctl was ...
From: Andrew Morton
Date: Sunday, March 18, 2007 - 12:28 pm

That a very modest overhead - not worth the config option, IMO.

The runtime overhead might be a concern - is it possible to quantify
it?
-


Do you mean performance wise or memory wise?

Memory-wise,  something like

===
FLATMEM Case
bits = 0;
for_each_zone(zone) {
 	bits += (zone->spanned_pages >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS);
}
bytes_consumed = bits / 8;

=== SPARSEMEM Case, a rough approximation is
((vm_total_pages * PAGE_SIZE) >> SECTION_SIZE_BITS) * 8

The consumption could be stored in a zone variable similar to 
zone->present_pages and visible through /proc/zoneinfo. Would that be 
useful?

Performance wise is harder to quantify. There are three places where 
issues can show up. The first is with allocation fallbacks where 
__rmqueue_fallback() is called. Fallbacks are expensive but fallbacks are 
rare except when the zone is too small which is why I probably should be 
catching that case explicitly. I used to have a counters patch for 
fallbacks. I could bring it up to date to use __count_vm_events() to 
quantify fallbacks if you think it would be useful?

The second hotpoint is where the per-cpu lists are searched for a page of 
the suitable migrate type. An instruction-level profile on x86 when I 
looked at this on x86 showed about 2-4% of the time spent in 
get_page_from_freelist() was searching the per-cpu lists for a page of a 
suitable type. IIRC, something like 85% of the time there was clearing the 
pages although I'd need to double check this to be 100% sure.

The last potential performance hotpoint is where the pageblock flags are 
read on every free in get_pageblock_flags_group(). There is probably room 
for optimisation there. I haven't an exact quantification available at the 
moment but I remember seeing it far down the list of functions time was 
spent when I was last looking at this.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Andrew Morton
Date: Sunday, March 18, 2007 - 1:45 pm

CPU load.  From your earlier email I'd decided memory consumption was a

hm, well.  It'd be good to drill down, quantify and, where needed, fix
these things.  Because the existence of that config option is quite
undesirabe.


-


I figured that was the case but thought I would try pin it down more and 
offer storing the overhead in a counter just in case there is a situation 

After my last mail, I turned on my main desktop (Intel(R) Pentium(R) D on 
32 bit) and set it going with 2.6.21-rc3-mm2 with the fix for Mariusz 
applied. I built a kernel while music was playing and I went off watching 
someone else make my dinner - a scientific test to be sure.

4.2% of the time in get_page_from_freelist() is spent in this 
list_for_each_entry;

                 /* Find a page of the appropriate migrate type */
                 list_for_each_entry(page, &pcp->list, lru) {

Something like 3% is spent on one instruction

  84768  0.0334 :c014c535:       lea    0x0(%esi),%esi

Maybe I can avoid some of this by optimistically checking if the first 
entry is suitable before entering into a loop, prefetching data and the 
like.

To put the loop into perspective though, 82% of the time was spent on one 
instruction within __constant_c_and_count_memset() called from 
prep_new_page() here;

2059075  0.8117 :c014c6d8:      rep stos %eax,%es:(%edi)

On architectures with a cheaper prep_new_page(), the list search may be 
more noticable. I'll see can I check what this looks like on ppc64 during 
the week because I believe ppc64 is able to zero pages faster.

__rmqueue_fallback didn't even appear in readprofile or oprofile even 
though it has to have been executed during boot time. I guess it just 
wasn't sampled enough. The overhead should be visible on ia64 with low 
memory machines because MAX_ORDER_NR_PAGES is so ridiculously large there. 
I have access to an IA64 machine with 1GB so I'll experiement with forcing 
allocflags_to_migratetype() to return MIGRATE_UNMOVABLE when 
vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES).

get_pageblock_flags_group() was the 40th most executed function according 
to oprofile. 70% of the time in that function was spent here;

 	for (; start_bitidx <= end_bitidx; ...
Previous thread: 2.6.21-rc3-mm2 by Andrew Morton on Wednesday, March 7, 2007 - 9:19 pm. (36 messages)

Next thread: [git pull] Input fixes for 2.6.21-rc3 by Dmitry Torokhov on Wednesday, March 7, 2007 - 9:38 pm. (1 message)