Temporarily at http://userweb.kernel.org/~akpm/2.6.21-rc3-mm1/ Will appear later at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc3/2.6.21-rc3-mm1/ - The wireless changes in here need a lot of testers, please. It is major rework. Of course the config files got all changed around so `make oldconfig' breaks everything. I was able to get ipw2200 working after some fumbling, but perhaps John can tell people what has been changed in there? What has happened, from a big picture perspective? - This patchset contains Con's rip-up-and-rewrite of the CPU scheduling algorithm. It oopsed for me on one machine so I'll do an rc3-mm2 without those changes shortly. If 2.6.21-rc3-mm1 crashes and 2.6.rc3-mm2 does not, don't forget to Cc: Con Kolivas <kernel@kolivas.org> on the report ;) Feedback on this change is sought. Especially from the enterprise-database and volanomark loonies: this stuff might be headed your way so don't tell us afterwards that it hurt. - Added Nick's lock-the-page-in-the-pagefault-handler patches. These reduce the incidence of one bug and increase the incidence of another. VM is fun. - Re-added the ext4 development tree to the -mm lineup. It has stuff in it. Boilerplate: - See the `hot-fixes' directory for any important updates to this patchset. - To fetch an -mm tree using git, use (for example) git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1 git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1 - -mm kernel commit activity can be reviewed by subscribing to the mm-commits mailing list. echo "subscribe mm-commits" | mail majordomo@vger.kernel.org - If you hit a bug in -mm and it is not obvious which patch caused it, it is most valuable if you can perform a bisection search to identify which patch introduced the bug. Instructions for this process are at ...
The big big pictures is that Linux is getting a new wireless stack, one that active developers seem to agree on. It /should/ support all the old wireless-tools binaries out there, except the wext/netlink stuff IMO I've been a bit disappointed at the rate of development. I had hoped that patches to ext4 would be making it into mainline more rapidly. Jeff -
Specifically regarding 'make oldconfig', it is mostly a clean-up to make a distinction between 802.11 wireless LANs and pre-standard wireless LANs. Here is the commit message: commit 1a9e0dd0bd60474465e0b0f1bca774d8c042d879 Author: Johannes Berg <johannes@sipsolutions.net> Date: Sat Mar 3 13:06:15 2007 +0100 [PATCH] rework wireless Kconfig This patch * kills NET_RADIO * adds a new "Wireless LAN" menu * adds two new options WLAN_PRE80211 and WLAN_80211 that drivers depend on * makes WIRELESS_EXT visible (to avoid the arguments we had in commit c1783454a31e05b94774951b0b5d1eb9075ebfb4) * changes everything that depended on NET_RADIO to select WIRELESS_EXT and to depend on WLAN_PRE80211 or WLAN_80211 By removing NET_RADIO, these changes pave the way to making wireless extensions optional when cfg80211 can fully take over for some drivers and you don't have any older drivers that still require wext. Honestly, I'm tempted to add the pre-802.11 stuff to the features removal list. I wonder if any of it still actually works... As to the larger question of "what is happening w/ wireless in -mm", I'll add a few words for those who don't know. As the commit referenced earlier suggests, work is underway on a new configuration regime for wireless LANs. This should result in a cleaner API for driver and userland tool developers, and hopefully better matches the expected semantics for wireless LAN configuration. An optional sub-component of that is a compatibility layer for existing WEXT-based tools, so there should be no need for a wireless tools "flag day". Still, hopefully this enables better wireless configuration/management tools in the future. In addition, we are adding a new component: mac80211. This component implements the higher-layer wireless MAC functionality for those cards that don't do it in hardware or firmware, as is true for many new cards. Traditionally cards ...
On Thu, Mar 08, 2007 at 09:50:43AM -0500, John W. Linville wrote: > On Wed, Mar 07, 2007 at 08:18:39PM -0800, Andrew Morton wrote: > By removing NET_RADIO, these changes pave the way to making wireless > extensions optional when cfg80211 can fully take over for some drivers > and you don't have any older drivers that still require wext. > > Honestly, I'm tempted to add the pre-802.11 stuff to the features > removal list. I wonder if any of it still actually works... FWIW, I've built these drivers in Fedora for aeons, and never had a single bug filed against them. Either they're perfect, or no-one has that junk any more. I should turn them off for a build and see if anyone complains :) Dave -- http://www.codemonkey.org.uk -
Working on it - the new MAC80211 stack landed in the -mm tree, but the matching iwlwifi driver for the Intel 3945ABG is still out-of-tree and acting wonky for me. The card comes up, 'iwlist scanning' sees 4 access points, but it won't associate. Not sure what I borked up.
FWIW, I have had best results w/ that driver by manually selecting the freq and ap as well as essid. Hth! John -- John W. Linville linville@tuxdriver.com -
Confirmed - if I used 'iwconfig ap <mac address> channel <n>' to match something I found via 'iwlist scan', it was able to associate and connect.
cpu_hotplug (AutoTest) hangs at this
=============================================
[ INFO: possible recursive locking detected ]
2.6.21-rc3-mm1 #2
---------------------------------------------
sh/7213 is trying to acquire lock:
(sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
but task is already holding lock:
(sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
other info that might help us debug this:
4 locks held by sh/7213:
#0: (cpu_add_remove_lock){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
#1: (sched_hotcpu_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
#2: (cache_chain_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
#3: (workqueue_mutex){--..}, at: [<c033883a>] mutex_lock+0x1c/0x1f
stack backtrace:
[<c0105256>] show_trace_log_lvl+0x1a/0x2f
[<c010597b>] show_trace+0x12/0x14
[<c0105a3d>] dump_stack+0x16/0x18
[<c013fc73>] __lock_acquire+0x1aa/0xceb
[<c014082d>] lock_acquire+0x79/0x93
[<c03385dc>] __mutex_lock_slowpath+0x107/0x349
[<c033883a>] mutex_lock+0x1c/0x1f
[<c011d924>] sched_getaffinity+0x14/0x91
[<c015796d>] __synchronize_sched+0x11/0x5f
[<c011d257>] detach_destroy_domains+0x2c/0x30
[<c011fc1a>] update_sched_domains+0x27/0x3a
[<c012fe7a>] notifier_call_chain+0x2b/0x4a
[<c012fec6>] __raw_notifier_call_chain+0x19/0x1e
[<c0145756>] _cpu_down+0x70/0x282
[<c014598e>] cpu_down+0x26/0x38
[<c0272714>] store_online+0x27/0x5a
[<c026f610>] sysdev_store+0x20/0x25
[<c01b7a8e>] sysfs_write_file+0xc1/0xe9
[<c0180052>] vfs_write+0xd1/0x15a
[<c0180682>] sys_write+0x3d/0x72
[<c0104270>] syscall_call+0x7/0xb
l *0xc033883a
0xc033883a is in mutex_lock (/mnt/md0/devel/linux-mm/kernel/mutex.c:92).
87 /*
88 * The locking fastpath is the 1->0 transition from
89 * 'unlocked' into 'locked' state.
90 */
91 __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
92 }
93
94 EXPORT_SYMBOL(mutex_lock);
95
96 static ...That's pretty useless, isn't it? We need to know the mutex_lock() caller I can't immediately spot the bug. Probably it's caused by rcu-preempt's changes to synchronize_sched(): that function now does a heap more than it used to, including taking sched_hotcpu_muex. So, what to do about this. Paul, I'm thinking that I should drop rcu-preempt for now - I don't think we ended up being able to identify any particular benefit which it brings to current mainline, and I suspect that things will become simpler if/when we start using the process freezer for CPU hotplug. -
It certainly makes sense for Michal to try backing out rcu-preempt using your broken-out list of patches. If that makes the problem go away, then I would certainly have a hard time arguing with you. We are working on getting measurements showing benefit of rcu-preempt, but aren't there yet. Thanx, Paul -
Any details on the symptoms? I'm unable to boot rc3-mm2, and it hangs right after printing the ipw2200 driver message. I'll investigate that this week-end. Regards, Frederik -
Sorry for the delay, I could give it a try today. It appears that it doesn't hang, it just spends a lot of time in ipw_init:pci_register_driver, due to a firmware loading failure [ 12.296637] RAMDISK driver initialized: 16 RAM disks of 32000K size 1024 blocksize [ 12.322581] ipw2200: Intel(R) PRO/Wireless 2200/2915 Network Driver, 1.2.0kdmprq [ 12.348822] ipw2200: Copyright(c) 2003-2006 Intel Corporation [ 12.366936] PCI: Found IRQ 10 for device 0000:03:03.0 [ 12.376280] PCI: Sharing IRQ 10 with 0000:00:1d.1 [ 12.385729] PCI: Sharing IRQ 10 with 0000:00:1e.3 [ 12.395134] PCI: Sharing IRQ 10 with 0000:00:1f.2 [ 12.404385] ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection [ 72.391870] ipw2200: ipw2200-bss.fw request_firmware failed: Reason -2 [ 72.400563] ipw2200: Unable to load firmware: -2 [ 72.408956] ipw2200: failed to register network device [ 72.417178] ipw2200: probe of 0000:03:03.0 failed with error -5 (Booted with acpi=off due to some framebuffer problems mangling the console output, but the problem persists even without that kernel parameter) Regards, Frederik -
Works for me ... so far ;-) Anyway to the point: When moving my laptop I reattached the usb mouse. Then I found this in syslog: usb 2-1: new low speed USB device using uhci_hcd and address 3 usb 2-1: new device found, idVendor=046d, idProduct=c00e usb 2-1: new device strings: Mfr=1, Product=2, SerialNumber=0 usb 2-1: Product: USB-PS/2 Optical Mouse usb 2-1: Manufacturer: Logitech usb 2-1: configuration #1 chosen from 1 choice khubd: page allocation failure. order:5, mode:0xd0 [<c01045c4>] show_trace_log_lvl+0x1a/0x30 [<c0104d50>] show_trace+0x12/0x14 [<c0104e07>] dump_stack+0x16/0x18 [<c014acc6>] __alloc_pages+0x2e4/0x303 [<c01604df>] cache_alloc_refill+0x2e4/0x516 [<c01608e4>] kmem_cache_zalloc+0x78/0x7c [<c0341343>] hid_parse_report+0xce/0x26b [<c032d99b>] hid_probe+0x264/0xdba [<c030feb9>] usb_probe_interface+0x5a/0x89 [<c02bf74f>] driver_probe_device+0x86/0x178 [<c02bf849>] __device_attach+0x8/0xa [<c02beadc>] bus_for_each_drv+0x4a/0x68 [<c02bfb39>] device_attach+0x8b/0xd2 [<c02bea4e>] bus_attach_device+0x40/0x84 [<c02bd835>] device_add+0x5d0/0x6c8 [<c030e8bd>] usb_set_configuration+0x2d6/0x4c6 [<c03158ab>] generic_probe+0x15c/0x251 [<c030fb87>] usb_probe_device+0x36/0x3c [<c02bf74f>] driver_probe_device+0x86/0x178 [<c02bf849>] __device_attach+0x8/0xa [<c02beadc>] bus_for_each_drv+0x4a/0x68 [<c02bfb39>] device_attach+0x8b/0xd2 [<c02bea4e>] bus_attach_device+0x40/0x84 [<c02bd835>] device_add+0x5d0/0x6c8 [<c030a078>] usb_new_device+0x128/0x196 [<c030aedd>] hub_thread+0x28a/0xb4a [<c0126c75>] kthread+0xa2/0xc9 [<c010422f>] kernel_thread_helper+0x7/0x18 ======================= Mem-info: DMA per-cpu: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 Active:82948 inactive:29211 dirty:18 writeback:0 unstable:0 free:1631 slab:3634 mapped:21032 pagetables:390 bounce:0 DMA free:2316kB min:92kB low:112kB ...
Ok I was wrong. Able to reproduce quite easily. Let me know if you need anything more. usb 2-1: new low speed USB device using uhci_hcd and address 11 usb 2-1: new device found, idVendor=046d, idProduct=c00e usb 2-1: new device strings: Mfr=1, Product=2, SerialNumber=0 usb 2-1: Product: USB-PS/2 Optical Mouse usb 2-1: Manufacturer: Logitech usb 2-1: configuration #1 chosen from 1 choice khubd: page allocation failure. order:5, mode:0xd0 [<c01045c4>] show_trace_log_lvl+0x1a/0x30 [<c0104d50>] show_trace+0x12/0x14 [<c0104e07>] dump_stack+0x16/0x18 [<c014acc6>] __alloc_pages+0x2e4/0x303 [<c01604df>] cache_alloc_refill+0x2e4/0x516 [<c01608e4>] kmem_cache_zalloc+0x78/0x7c [<c0341343>] hid_parse_report+0xce/0x26b [<c032d99b>] hid_probe+0x264/0xdba [<c030feb9>] usb_probe_interface+0x5a/0x89 [<c02bf74f>] driver_probe_device+0x86/0x178 [<c02bf849>] __device_attach+0x8/0xa [<c02beadc>] bus_for_each_drv+0x4a/0x68 [<c02bfb39>] device_attach+0x8b/0xd2 [<c02bea4e>] bus_attach_device+0x40/0x84 [<c02bd835>] device_add+0x5d0/0x6c8 [<c030e8bd>] usb_set_configuration+0x2d6/0x4c6 [<c03158ab>] generic_probe+0x15c/0x251 [<c030fb87>] usb_probe_device+0x36/0x3c [<c02bf74f>] driver_probe_device+0x86/0x178 [<c02bf849>] __device_attach+0x8/0xa [<c02beadc>] bus_for_each_drv+0x4a/0x68 [<c02bfb39>] device_attach+0x8b/0xd2 [<c02bea4e>] bus_attach_device+0x40/0x84 [<c02bd835>] device_add+0x5d0/0x6c8 [<c030a078>] usb_new_device+0x128/0x196 [<c030aedd>] hub_thread+0x28a/0xb4a [<c0126c75>] kthread+0xa2/0xc9 [<c010422f>] kernel_thread_helper+0x7/0x18 ======================= Mem-info: DMA per-cpu: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 Active:71402 inactive:8979 dirty:610 writeback:0 unstable:0 free:23528 slab:13478 mapped:18407 pagetables:424 bounce:0 DMA free:2160kB min:92kB low:112kB high:136kB active:9264kB inactive:0kB ...
hid_parse_report() is doing kmalloc(128k kbytes). We canot sanely support that and the code shold be rewritten to not do that. A simple though somewhat lame fix would be to switch to vmalloc(). It's been this way for some time, so it's odd that the failures have just popped up now. -
Hi,
I have just queued the patch below to HID tree for the next upstream
merge. Mariusz, I guess it solves your issue, right?
I have already been talking with Vojtech some time ago that rewritting the
hid parser so that it would use less memory (but probably slightly a bit
more CPU) would be a good thing to do, and it's been sitting in my TODO
list for quite some time already. It's really not a straightforward
rewrite, so I would incline to use the vmalloc() solution until the parser
code has been rewritten. The hid_parser structure in question is living
for very short time anyway, so it shouldn't be that big issue.
Thanks.
From: Jiri Kosina <jkosina@suse.cz>
Subject: [PATCH] HID: allocate hid_parser through vmalloc()
hid_parser is non-trivially large structure, so it should be allocated
using vmalloc() to avoid unsuccessful allocations when memory fragmentation
is too high.
This structue has a very short life, it's destroyed as soon as the report
descriptor has been completely parsed.
This should be considered a temporary solution, until the hid_parser is
rewritten to consume less memory during report descriptor parsing.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
drivers/hid/hid-core.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c
index f4ee1af..e5894a7 100644
--- a/drivers/hid/hid-core.c
+++ b/drivers/hid/hid-core.c
@@ -26,6 +26,7 @@
#include <asm/byteorder.h>
#include <linux/input.h>
#include <linux/wait.h>
+#include <linux/vmalloc.h>
#include <linux/hid.h>
#include <linux/hiddev.h>
@@ -654,12 +655,13 @@ struct hid_device *hid_parse_report(__u8 *start, unsigned size)
memcpy(device->rdesc, start, size);
device->rsize = size;
- if (!(parser = kzalloc(sizeof(struct hid_parser), GFP_KERNEL))) {
+ if (!(parser = vmalloc(sizeof(struct hid_parser)))) {
kfree(device->rdesc);
kfree(device->collection);
kfree(device);
return ...Right. Can't be 100% sure but without the patch it would have probably failed by now so I guess the patch is ok. Not sure how to make usb mouse plugging/unplugging process automatic ;-) Thanks, Mariusz Kozlowski -
echo FOO >/sys/bus/usb/drivers/usbhid/unbind to simulate an unplug (actually, to do an unbind), and echo FOO >/sys/bus/usb/drivers/usbhid/unbind to do a bind, where FOO is the name of the USB mouse device link present in the /sys/bus/usb/drivers/usbhid directory. Alan Stern -
# ls -al /sys/bus/usb/drivers/usbhid total 0 drwxr-xr-x 2 root root 0 Mar 10 17:30 . drwxr-xr-x 8 root root 0 Mar 10 17:14 .. lrwxrwxrwx 1 root root 0 Mar 10 17:30 2-2:1.0 -> ../../../../devices/pci0000:00/0000:00:0c.0/usb2/2-2/2-2:1.0 --w------- 1 root root 4096 Mar 10 17:17 bind lrwxrwxrwx 1 root root 0 Mar 10 17:17 module -> ../../../../module/usbhid --w------- 1 root root 4096 Mar 10 17:17 new_id --w------- 1 root root 0 Mar 10 17:22 unbind # echo "2-2:1.0" > /sys/bus/usb/drivers/usbhid/unbind bash: echo: write error: No such device Any thoughts? Regards, Mariusz Kozlowski -
Another mistake on my part. The correct command is echo -n '2-2:1.0' >/sys/bus/usb/drivers/usbhid/unbind Without the "-n", the system thinks that the newline character at the end of the line written by "echo" is part of the filename. Alan Stern -
Nice tip. Thanks. I've run some tests and as expected -> no failure so far. Regards, Mariusz Kozlowski -
Thanks for testing. The patch fixing this already went to Linus in todays HID/USB HID update (which has not yet been merged). Thanks, -- Jiri Kosina -
Hello, iMac G3 build fails: CC drivers/macintosh/adbhid.o drivers/macintosh/adbhid.c: In function 'adbhid_init': drivers/macintosh/adbhid.c:1275: error: too many arguments to function 'register_sysctl_table' make[2]: *** [drivers/macintosh/adbhid.o] Blad 1 make[1]: *** [drivers/macintosh] Blad 2 make: *** [drivers] Blad 2 processor : 0 cpu : 740/750 temperature : 43-45 C (uncalibrated) clock : 400MHz revision : 2.2 (pvr 0008 0202) bogomips : 796.67 machine : PowerMac2,1 motherboard : PowerMac2,1 MacRISC2 MacRISC Power Macintosh detected as : 66 (iMac FireWire) pmac flags : 00000005 L2 cache : 512K unified memory : 256MB pmac-generation : NewWorld Gnu C 4.1.2 Gnu make 3.81 binutils 2.17 util-linux 2.12r mount 2.12r module-init-tools 3.3-pre2 e2fsprogs 1.40-WIP Linux C Library 2.3.6 Dynamic linker (ldd) 2.3.6 Procps 3.2.7 Net-tools 1.60 Console-tools 0.2.3 Sh-utils 5.97 Modules Loaded ipv6 af_packet tsdev eth1394 tulip crc32 ohci1394 ieee1394 uninorth_agp agpgart dm_snapshot dm_mirror dm_mod snd_powermac snd_pcm_oss snd_pcm snd_page_alloc snd_mixer_oss snd_seq_oss snd_seq_device snd_seq_midi_event snd_seq snd_timer snd soundcore ide_cd cdrom joydev evdev ext3 jbd mbcache usbhid uhci_hcd ohci_hcd usbcore ide_disk unix
And broken stuff too :-) The nanoseconds patch is broken on x86_64 - makes mtimes from the future: e.g. year 2431. I suspect an endianness issue. x86 works fine according to my sources. The files themselves have correct mtimes, as booting previous kernel or one w/o the nanoseconds patch works fine. -
Hello,
Today after +- 24h of uptime I found some more page allocation
failures ('eth1: Can't allocate skb for Rx'). You'll find more here:
http://tuxland.pl/misc/2.6.21-rc3-mm1-page-allocation-failure.txt
System wasn't doing anything unusual, as usual ;-) X, some p2p
software, firefox+flash playing music.
Regards,
Mariusz Kozlowski
Do other kernels do this, or is 2.6.21-rc3-mm1 worse? It is of course a non-fatal problem and will inevitably happen sometimes, but we would like the VM to be able to minimise the occurrence of this problem. I think we were rather hoping that Mel's anti-fragmentation work would improve things. -
I'm looking at this now. In the vanilla allocator, min_free_kbytes has the effect of keeping largest blocks free for as long as possible. This means that if high-order allocations are rare, they'll tend to succeed up to a point. In the case of the earlier report on HID failing the order-5 allocation, it's because we didn't reclaim for long enough because the order was too high. It would need to reclaim for quite a long time before it would get an order-5 block free. That is what Andy's lumpy-reclaim patches address - reclaiming I'm still optimistic it will. This is the widest it's been tested so far so there were going to be bases In this case, the order is high but GFP_ATOMIC. These allocations get grouped together but with a low min_free_kbytes, as is the case on this machine, the high-order atomic blocks eventually get used by others. I had assumed that if high-order atomic allocations were important (e.g. e1000) that the machine would be configured with min_free_kbytes of at least 16384 i.e. 1 MAX_ORDER_NR_PAGES per MIGRATE_TYPE. Mariusz, I would be interested in finding out if this problem still occurs when you set min_free_kbytes to 16384 via /proc/sys/vm/min_free_kbytes. I understand that the problem is not easily reproduced and requiring configuration changes is far from ideal but it'd allow me to find out if options 2 or 3 below make sense in advance. I'm looking at three ways of addressing this; 1. Anti-fragmentation currently favours breaking larger blocks to group-by-mobility as opposed to the vanilla allocator which always favours using the smallest possible block no matter where it is. I'm looking at preserving the behaviour of the vanilla allocator to keeping larger blocks free as much as possible to see how adverse an effect that subsequently has on fragmentation. 2. Set min_free_kbytes higher automatically at boot time when CONFIG_PAGE_GROUP_BY_MOBILITY is set. 3. Disable PAGE_GROUP_BY_MOBILITY when min_free_kbytes is set too ...
After a few hours I can confirm that this happens with $ cat /proc/sys/vm/min_free_kbytes 16384 as well. See the syslog output below. Feel free to mail me to do some more tests. Regards, Mariusz Kozlowski Mar 15 12:00:47: echo 16384 > /proc/sys/vm/min_free_kbytes Mar 15 12:38:33: eth1: MAC controller error (WTERR). Ignoring. Mar 15 13:29:31: eth1: MAC controller error (WTERR). Ignoring. Mar 15 13:34:34: eth1: MAC controller error (WTERR). Ignoring. Mar 15 13:52:58: swapper: page allocation failure. order:1, mode:0x80020 Mar 15 13:52:58: [<c0104664>] show_trace_log_lvl+0x1a/0x30 Mar 15 13:52:58: [<c0104e00>] show_trace+0x12/0x14 Mar 15 13:52:58: [<c0104eb7>] dump_stack+0x16/0x18 Mar 15 13:52:58: [<c0150e8e>] __alloc_pages+0x2e6/0x2fd Mar 15 13:52:58: [<c0167556>] cache_alloc_refill+0x350/0x65d Mar 15 13:52:58: [<c0167957>] __kmalloc_track_caller+0xf4/0xf9 Mar 15 13:52:58: [<c0396213>] __alloc_skb+0x6e/0x122 Mar 15 13:52:58: [<ded154a9>] orinoco_interrupt+0x986/0x10a4 [orinoco] Mar 15 13:52:58: [<c0144d43>] handle_IRQ_event+0x28/0x59 Mar 15 13:52:58: [<c01462f7>] handle_level_irq+0x6e/0xe7 Mar 15 13:52:58: [<c0105e7e>] do_IRQ+0x3d/0x7f Mar 15 13:52:58: [<c01041d6>] common_interrupt+0x2e/0x34 Mar 15 13:52:58: [<c0102352>] cpu_idle+0x46/0x74 Mar 15 13:52:58: [<c0101131>] rest_init+0x37/0x46 Mar 15 13:52:58: [<c0546bc1>] start_kernel+0x33c/0x3cb Mar 15 13:52:58: [<00000000>] 0x0 Mar 15 13:52:58: ======================= Mar 15 13:52:58: Mem-info: Mar 15 13:52:58: DMA per-cpu: Mar 15 13:52:58: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Mar 15 13:52:58: Normal per-cpu: Mar 15 13:52:58: CPU 0: Hot: hi: 186, btch: 31 usd: 171 Cold: hi: 62, btch: 15 usd: 53 Mar 15 13:52:58: Active:44723 inactive:54877 dirty:1323 writeback:0 unstable:0 Mar 15 13:52:58: free:5602 slab:11245 mapped:7463 pagetables:284 bounce:0 Mar 15 13:52:58: DMA free:2728kB min:544kB low:680kB high:816kB active:952kB inactive:3900kB ...
Ok, great. Well, not great because it's broken, but I know what's going on. I was able to reproduce the problem based on your report on my desktop and put together a fix for it. Full regression tests are still running but it should be in good enough state for you to test. Without this patch, I got allocation failures within 15 minutes by stressing the machine. With the patch below, it's been up an hour and 15 minutes and I'm seeing no problems so far. Will keep the machine running a few days to see what happens. For people watching, this patch is potentially better than MIGRATE_HIGHALLOC for preserving areas for atomic allocations - particularly if the size of the reserve is based on min_free_kbytes instead of MIGATE_TYPES*MAX_ORDER_NR_PAGES. If high-order allocation reports disappear altogether, I'll put together a patch that gets rid of MIGRATE_HIGHALLOC altogether and see if anyone reacts. That will bring the number of free lists down and reduce the number of bits required in pageblock flags again. Mariusz, please try the following patch. It should not be necessary to adjust your min_free_kbytes again but if you see a failure, please try with min_free_kbytes set to 16384. Thanks a lot. ===== Candidate fix as follows ===== The standard buddy allocator always favours the smallest block of pages. The effect of this is that the min_free_kbytes reserved tends to be preserved at the same location of memory for a very long time and often as a contiguous block. When an administrator sets the reserve at 16384, it tends to be the same MAX_ORDER blocks that remain free. This allows the occasional high atomic allocation to succeed. In practice, it is difficult to split these blocks with any load but when they do split, the benefit of having min_free_kbytes for contiguous blocks disappears. On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large blocks when there are no free pages of the appropriate type available. A side-effect of this is that all blocks in ...
Works for me. min_free_kbytes was left at default 2791. I left the laptop with X + aMule + azureus + firefox&flash (playing music) + kernel compilation so the box was pushed a bit. Uptime close to 9 hours and no page allocation failures. I leave it running some more. If anything pops out you'll know it :-) Thanks, Mariusz Kozlowski -
Excellent news. This patch has a few flaws in it and it failed regression tests on sparsemem so it's far from ready but I know the basic idea appears sound now. I'll work on bringing the patch up to scratch and hopefully send on another version later today. Thanks a million for testing. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
Begorrah - patches from Irish people on Paddy's day, who would have though it. Clearly, this patch is full of lucky charms. On a more serious note, this patch appears to address the page allocation problem reported by Mariusz Kozlowski. Changelog since v1 o Select the number of blocks to mark MIGRATE_RESERVE intelligently o Take bootmem into account when placing MIGRATE_RESERVE blocks The standard buddy allocator always favours the smallest block of pages. The effect of this is that the pages free to satisfy min_free_kbytes tends to be preserved since boot time at the same location of memory ffor a very long time and as a contiguous block. When an administrator sets the reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks that remain free. This allows the occasional high atomic allocation to succeed up until the point the blocks are split. In practice, it is difficult to split these blocks but when they do split, the benefit of having min_free_kbytes for contiguous blocks disappears. Additionally, increasing min_free_kbytes once the system has been running for some time has no guarantee of creating contiguous blocks. On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large blocks when there are no free pages of the appropriate type available. A side-effect of this is that all blocks in memory tends to be used up and the contiguous free blocks from boot time are not preserved like in the vanilla allocator. This can cause a problem if a new caller is unwilling to reclaim or does not reclaim for long enough. A failure scenario was found for a wireless network device allocating order-1 atomic allocations but the allocations were not intense or frequent enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC. This was reproduced on a desktop by booting with mem=256mb, forcing the driver to allocate at order-1, running a bittorrent client (downloading a debian ISO) and building a kernel with -j2. This patch addresses the ...
I still haven't got onto reviewing all your mm patches :( Nor has anyone else, afaik. Why does this config item exist? It's not good to have some mysterious knob which affects mm behaviour at compile time. We need to make up our minds and stick with it. -
They have been reviewed at various points in their development and most of the feedback was fed back in. Marcelo Tosatti commented on version 19 for example. Joel Schopp commented heavily on earlier versions around the v15-v19 mark as well as Dave Hansen around the same time. Christoph Lameter has commented on recent versions but I'm not sure how detailed his review was of if the comments were based on the patch description. There have been comments from various other people as well, mainly around the v19-v20 mark. Andy Whitcroft has reviewed most versions of the patches including the most recent ones. I suspect there may not have been detailed review recently because so many The configuration item exists because there were concerns over the memory footprint and cache line footprint. It was introduced to address that concern and also so that it would be possible to compare the performance behavior of anti-fragmentation. Your comment rang a bell though so I searched the archives to see this comment from Andi Kleen; === If anything this should be a boot time option or perhaps sysctl, not a config. In general CONFIGs that change runtime behaviour are evil - just makes changing the option more painful, causes problems for distribution users, doesn't make much sense, etc.etc. Also #ifdef as a documentation device is a really really scary concept. Yuck. === A sysctl would avoid any cache line footprint but not the memory overhead because the freelists in struct zone as those freelists would still exist. I could make the option depend on CONFIG_EMBEDDED for the zone overhead. Would that make sense or would it be preferable to ditch the option altogether? I'll start looking at doing a sysctl so it can be disabled at runtime if necessary. I strongly suspect that it cannot be enabled again once disabled but I don't see that as a problem as such. -- Mel Gorman Part-time Phd Student Linux Technology Center University of ...
How much additional memory consumption are we expecting here? Whether it's runtime or compile-time, the optionality is not good. -
Short answer, about 1.5KB on a 1GB system of which 1.3KB is statically defined in the 3 struct zones on a 1 node x86 system. Longer answer that I hopefully have not made any mistakes in - There is the zone overhead which is statically sized and a runtime overhead which depends on the amount of memory in the system. The additional zone overhead is the overhead for additional freelists (larger struct free_area) and is as follows; (MIGRATE_TYPES-1) * sizeof(list_head) * (MAX_ORDER-1) so, on 32 bit in general, thats 4 * 8 * 10 = 320 bytes per zone (would be 240 bytes if MIGRATE_RESERVE is sufficient for higher order allocations instead of MIGRATE_HIGHALLOC) on x86 with DMA, Normal and HighMem, thats 1280 bytes. On a NUMA system, it's 1280 bytes per node. On 64 bit, it would be double because of the larger pointer size. At worst, I guess you are looking at 3KB per node. The size of the bitmap for the flags depends on the amount of memory. On FLATMEM and DISCONTIG, it's 3 bits per MAX_ORDER_NR_PAGES spanned by the zones (note spanned, not present). On a 1GiB system without holes on an x86 with MAX_ORDER of 11, that would be 96 bytes. The calculation is more complex for sparsemem but will be at least 4 bytes per active section for a pointer and another 4 bytes for most bitmaps. I have a patch that uses the pointer itself when the bitmap is small enough to be contained in 4 bytes (which is usually is) so it could be just 4 bytes per active section The main situation where I think it would be desirable to disable anti-fragmentation is for zones that are smaller than MIGRATE_TYPES*MAX_ORDER_NR_PAGES and this is for performance reasons to avoid regularly entering __rmqueue_fallback(). However, that situation could be automatically detected and handled by having allocflags_to_migratetype() always return MIGRATE_UNMOVABLE for small zones and similar for get_pageblock_migratetype(). There would be no need for a sysctl. If a sysctl was ...
That a very modest overhead - not worth the config option, IMO. The runtime overhead might be a concern - is it possible to quantify it? -
Do you mean performance wise or memory wise?
Memory-wise, something like
===
FLATMEM Case
bits = 0;
for_each_zone(zone) {
bits += (zone->spanned_pages >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS);
}
bytes_consumed = bits / 8;
=== SPARSEMEM Case, a rough approximation is
((vm_total_pages * PAGE_SIZE) >> SECTION_SIZE_BITS) * 8
The consumption could be stored in a zone variable similar to
zone->present_pages and visible through /proc/zoneinfo. Would that be
useful?
Performance wise is harder to quantify. There are three places where
issues can show up. The first is with allocation fallbacks where
__rmqueue_fallback() is called. Fallbacks are expensive but fallbacks are
rare except when the zone is too small which is why I probably should be
catching that case explicitly. I used to have a counters patch for
fallbacks. I could bring it up to date to use __count_vm_events() to
quantify fallbacks if you think it would be useful?
The second hotpoint is where the per-cpu lists are searched for a page of
the suitable migrate type. An instruction-level profile on x86 when I
looked at this on x86 showed about 2-4% of the time spent in
get_page_from_freelist() was searching the per-cpu lists for a page of a
suitable type. IIRC, something like 85% of the time there was clearing the
pages although I'd need to double check this to be 100% sure.
The last potential performance hotpoint is where the pageblock flags are
read on every free in get_pageblock_flags_group(). There is probably room
for optimisation there. I haven't an exact quantification available at the
moment but I remember seeing it far down the list of functions time was
spent when I was last looking at this.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
CPU load. From your earlier email I'd decided memory consumption was a hm, well. It'd be good to drill down, quantify and, where needed, fix these things. Because the existence of that config option is quite undesirabe. -
I figured that was the case but thought I would try pin it down more and
offer storing the overhead in a counter just in case there is a situation
After my last mail, I turned on my main desktop (Intel(R) Pentium(R) D on
32 bit) and set it going with 2.6.21-rc3-mm2 with the fix for Mariusz
applied. I built a kernel while music was playing and I went off watching
someone else make my dinner - a scientific test to be sure.
4.2% of the time in get_page_from_freelist() is spent in this
list_for_each_entry;
/* Find a page of the appropriate migrate type */
list_for_each_entry(page, &pcp->list, lru) {
Something like 3% is spent on one instruction
84768 0.0334 :c014c535: lea 0x0(%esi),%esi
Maybe I can avoid some of this by optimistically checking if the first
entry is suitable before entering into a loop, prefetching data and the
like.
To put the loop into perspective though, 82% of the time was spent on one
instruction within __constant_c_and_count_memset() called from
prep_new_page() here;
2059075 0.8117 :c014c6d8: rep stos %eax,%es:(%edi)
On architectures with a cheaper prep_new_page(), the list search may be
more noticable. I'll see can I check what this looks like on ppc64 during
the week because I believe ppc64 is able to zero pages faster.
__rmqueue_fallback didn't even appear in readprofile or oprofile even
though it has to have been executed during boot time. I guess it just
wasn't sampled enough. The overhead should be visible on ia64 with low
memory machines because MAX_ORDER_NR_PAGES is so ridiculously large there.
I have access to an IA64 machine with 1GB so I'll experiement with forcing
allocflags_to_migratetype() to return MIGRATE_UNMOVABLE when
vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES).
get_pageblock_flags_group() was the 40th most executed function according
to oprofile. 70% of the time in that function was spent here;
for (; start_bitidx <= end_bitidx; ...