ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5...
- This is a bugfixed version of 2.6.26-rc5-mm2, which was a bugfixed
version of 2.6.26-rc5-mm1. None of the git trees were repulled for -mm3
(and nor were they repulled for -mm2).The aim here is to get all the stupid bugs out of the way so that some
serious MM testing can be performed.- Please perform some serious MM testing.
Boilerplate:
- See the `hot-fixes' directory for any important updates to this patchset.
- To fetch an -mm tree using git, use (for example)
git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1
git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1- -mm kernel commit activity can be reviewed by subscribing to the
mm-commits mailing list.echo "subscribe mm-commits" | mail majordomo@vger.kernel.org
- If you hit a bug in -mm and it is not obvious which patch caused it, it is
most valuable if you can perform a bisection search to identify which patch
introduced the bug. Instructions for this process are athttp://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt
But beware that this process takes some time (around ten rebuilds and
reboots), so consider reporting the bug first and if we cannot immediately
identify the faulty patch, then perform the bisection search.- When reporting bugs, please try to Cc: the relevant maintainer and mailing
list on any email.- When reporting bugs in this kernel via email, please also rewrite the
email Subject: in some manner to reflect the nature of the bug. Some
developers filter by Subject: when looking for messages to read.- Occasional snapshots of the -mm lineup are uploaded to
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
the mm-commits list. These probably are at least compilable.- More-than-daily -mm snapshots may be found at
[ message continues ]
After running some of the libhugetlbfs tests the value for
/proc/meminfo/HugePages_Rsvd becomes really large. It looks like it has
wrapped backwards from zero.
Below is the sequence I used to run one of the tests that causes this;
the tests passes for what it is intended to test but leaves a large
value for reserved pages and that seemed strange to me.
test run on ppc64 with 16M huge pagescat /proc/meminfo
....
HugePages_Total: 25
HugePages_Free: 25
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kBmount -t hugetlbfs hugetlbfs /mnt
tundro4:~/libhugetlbfs-dev-20080516/tests # HUGETLBFS_VERBOSE=99 HUGETLBFS_DEBUG=y PATH="obj64:$PATH" LD_LIBRARY_PATH="$LD_LIBRARY_PATH:../obj64:obj64" truncate_above_4GB
Starting testcase "truncate_above_4GB", pid 3145
Mapping 3 hpages at offset 0x100000000...mapped at 0x3fffd000000
Replacing map at 0x3ffff000000 with map from offset 0x1000000...done
Truncating at 0x100000000...done
PASScat /proc/meminfo
....
HugePages_Total: 25
HugePages_Free: 25
HugePages_Rsvd: 18446744073709551614
HugePages_Surp: 0
Hugepagesize: 16384 kBI put in some printks and see that the rsvd value goes mad in
'return_unused_surplus_pages'.Debug output:
tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:527; resv_huge_pages=0 delta=3
tundro4 kernel: Call Trace:
tundro4 kernel: [c000000287dff9a0] [c000000000010978] .show_stack+0x7c/0x1c4 (unreliable)
tundro4 kernel: [c000000287dffa50] [c0000000000d7c8c] .hugetlb_acct_memory+0xa4/0x448
tundro4 kernel: [c000000287dffb20] [c0000000000d85ec] .hugetlb_reserve_pages+0xec/0x16c
tundro4 kernel: [c000000287dffbc0] [c0000000001be7fc] .hugetlbfs_file_mmap+0xe0/0x154
tundro4 kernel: [c000000287dffc70] [c0000000000cbc78] .mmap_region+0x280/0x52c
tundro4 kernel: [c000000287dffd80] [c00000000000bfa0] .sys_mmap+0xa8/0x108
tundro4 kernel: [c000000287dffe30] [c0000000000086ac] syscall_exit+0x0/0x40
tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:530; resv_huge_pages=3 delt...
As reported by Adam Litke and Jon Tollefson one of the libhugetlbfs
regression tests triggers a negative overall reservation count. When
this occurs where there is no dynamic pool enabled tests will fail.Following this email are two patches to address this issue:
hugetlb reservations: move region tracking earlier -- simply moves the
region tracking code earlier so we do not have to supply prototypes, andhugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma
splits -- which moves us to tracking the consumed reservation so that
we can correctly calculate the remaining reservations at vma close time.This stack is against the top of v2.6.25-rc6-mm3, should this solution
prove acceptable it would need slipping underneath Nick's multiple hugepage
size patches and those updated. I have a modified stack prepared for that.This version incorporates Mel's feedback (both cosmetic, and an allocation
under spinlock issue) and has an improved layout.Changes in V2:
- commentry updates
- pull allocations out from under hugetlb_lock
- refactor to match shared code layout
- reinstate BUG_ON'sJon could you have a test on this and see if it works out for you.
-apw
--
Version two works for me too. I am not seeing the reserve value become
negative when running the libhuge tests.Jon
--
When a hugetlb mapping with a reservation is split, a new VMA is cloned
from the original. This new VMA is a direct copy of the original
including the reservation count. When this pair of VMAs are unmapped
we will incorrect double account the unused reservation and the overall
reservation count will be incorrect, in extreme cases it will wrap.The problem occurs when we split an existing VMA say to unmap a page in
the middle. split_vma() will create a new VMA copying all fields from
the original. As we are storing our reservation count in vm_private_data
this is also copies, endowing the new VMA with a duplicate of the original
VMA's reservation. Neither of the new VMAs can exhaust these reservations
as they are too small, but when we unmap and close these VMAs we will
incorrect credit the remainder twice and resv_huge_pages will become
out of sync. This can lead to allocation failures on mappings with
reservations and even to resv_huge_pages wrapping which prevents all
subsequent hugepage allocations.The simple fix would be to correctly apportion the remaining reservation
count when the split is made. However the only hook we have vm_ops->open
only has the new VMA we do not know the identity of the preceeding VMA.
Also even if we did have that VMA to hand we do not know how much of the
reservation was consumed each side of the split.This patch therefore takes a different tack. We know that the whole of any
private mapping (which has a reservation) has a reservation over its whole
size. Any present pages represent consumed reservation. Therefore if
we track the instantiated pages we can calculate the remaining reservation.This patch reuses the existing regions code to track the regions for which
we have consumed reservation (ie. the instantiated pages), as each page
is faulted in we record the consumption of reservation for the new page.
When we need to return unused reservations at unmap time we simply count
the consumed reservation region subtracting that from...
Nice explanation. Testing on i386 with qemu, this patch allows some
small tests to pass without corruption of the rsvd counters.
libhugetlbfs tests also passed. I do not see anything new to complain
about in the code. Thanks.--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
Move the region tracking code much earlier so we can use it for page
presence tracking later on. No code is changed, just its location.Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 246 +++++++++++++++++++++++++++++----------------------------
1 files changed, 125 insertions(+), 121 deletions(-)diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f76ed1..d701e39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -47,6 +47,131 @@ static unsigned long __initdata default_hstate_size;
static DEFINE_SPINLOCK(hugetlb_lock);/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ * across the pages in a mapping.
+ */
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+static long region_add(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg, *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+static long region_chg(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg;
+ long chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new...
Straight-forward code-move.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
As reported by Adam Litke and Jon Tollefson one of the libhugetlbfs
regression tests triggers a negative overall reservation count. When
this occurs where there is no dynamic pool enabled tests will fail.Following this email are two patches to fix this issue:
hugetlb reservations: move region tracking earlier -- simply moves the
region tracking code earlier so we do not have to supply prototypes, andhugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma
splits -- which moves us to tracking the consumed reservation so that
we can correctly calculate the remaining reservations at vma close time.This stack is against the top of v2.6.25-rc6-mm3, should this solution
prove acceptable it would probabally need porting below Nicks multiple
hugepage size patches and those updated; if so I would be happy to do
that too.Jon could you have a test on this and see if it works out for you.
-apw
--
Looking good so far.
I am not seeing any of the tests push the reservation number negative -
with this patch set appliedJon
--
Move the region tracking code much earlier so we can use it for page
presence tracking later on. No code is changed, just its location.Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 246 +++++++++++++++++++++++++++++----------------------------
1 files changed, 125 insertions(+), 121 deletions(-)diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f76ed1..d701e39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -47,6 +47,131 @@ static unsigned long __initdata default_hstate_size;
static DEFINE_SPINLOCK(hugetlb_lock);/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ * across the pages in a mapping.
+ */
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+static long region_add(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg, *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+static long region_chg(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg;
+ long chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new...
When a hugetlb mapping with a reservation is split, a new VMA is cloned
from the original. This new VMA is a direct copy of the original
including the reservation count. When this pair of VMAs are unmapped
we will incorrect double account the unused reservation and the overall
reservation count will be incorrect, in extreme cases it will wrap.The problem occurs when we split an existing VMA say to unmap a page in
the middle. split_vma() will create a new VMA copying all fields from
the original. As we are storing our reservation count in vm_private_data
this is also copies, endowing the new VMA with a duplicate of the original
VMA's reservation. Neither of the new VMAs can exhaust these reservations
as they are too small, but when we unmap and close these VMAs we will
incorrect credit the remainder twice and resv_huge_pages will become
out of sync. This can lead to allocation failures on mappings with
reservations and even to resv_huge_pages wrapping which prevents all
subsequent hugepage allocations.The simple fix would be to correctly apportion the remaining reservation
count when the split is made. However the only hook we have vm_ops->open
only has the new VMA we do not know the identity of the preceeding VMA.
Also even if we did have that VMA to hand we do not know how much of the
reservation was consumed each side of the split.This patch therefore takes a different tack. We know that the whole of any
private mapping (which has a reservation) has a reservation over its whole
size. Any present pages represent consumed reservation. Therefore if
we track the instantiated pages we can calculate the remaining reservation.This patch reuses the existing regions code to track the regions for which
we have consumed reservation (ie. the instantiated pages), as each page
is faulted in we record the consumption of reservation for the new page.
When we need to return unused reservations at unmap time we simply count
the consumed reservation region subtracting that from...
decrement_hugepage_resv_vma() is called with hugetlb_lock held and region_chg
calls kmalloc(GFP_KERNEL). Hence it's possible we would sleep with that
spinlock held which is a bit uncool. The allocation needs to happen outside--
Mel Gorman
--
Yes, good spot. Luckily this pair of calls can be separated, as the
first is a prepare and the second a commit. So I can trivially pull
the allocation outside the lock.Had a quick go at this and it looks like I can move both out of the lock
to a much more logical spot and clean the patch up significantly. Will
fold in your other comments and post up a V2 once it has been tested.Thanks.
-apw
--
D'oh. It's not even that extreme, it's fairly straight-forward
to trigger as it turns out as this crappy application shows
http://www.csn.ul.ie/~mel/postings/apw-20080622/hugetlbfs-unmap-private-...
. This runs on x86 and can wrap the rsvd counters. I believe the other testsYeah, that does sound as if it would occur all right and running the
Clever. The additional nice thing is that it makes private mappings less of
a special case in comparison to shared mappings. My impression right now is
that with the path, shared mappings track reservations based on the underlying
file and the private mappings are generally tracked per-mapping and only shareThis looks sensible and applying the patches and running the test program
means that the reserve counter does not wrap when the program exists which
is very nice. I also tested a parent-child scenario where the pool is of
insufficient size and the child gets killed as expected. Thanks a million
for cleaning this up.Ok, seems straight forward. The tuples track pages that already exist so
by counting the overlaps in a given range, you know how many hugepages
have been faulted. The size of the VMA minus the overlap is theThe bits move here but for good reason. private_data is now a pointer and
Otherwise, looks right. The region_truncate() looked a bit odd but you
The VM_BUG_ON checks are removed here. Is that intentional? They still
The comment needs an update here to explain what the return value means.
There is an incredibly remote possibility that a fault would fail for a
mapping that had reserved huge pages because the kmalloc() in region_chg
failed. The system would have to be in terrible shape though. Should a
KERN_WARNING be printed here if this failure path is entered? Otherwise itThis comment is a tad misleading. The open call is also called at fork()
time. However, in the case of fork, the private_data will be cleared.
Maybe something like;====
The open vm_op is called when new VMAs are created but ...
Yes Adam reported that here yesterday, he found it in his hugetlfs testing.
I have done some investigation on it and it is being triggered by a bug in
the private reservation tracking patches. It is triggered by the hugetlb
test which causes some complex vma splits to occur on a private mapping.I believe I have the underlying problem nailed and do have some nearly
complete patches for this and they should be in a postable state by
tommorrow.-apw
--
Jon
--
When a process loads a kernel module, __stop_machine_run() is called, and
it calls sched_setscheduler() to give newly created kernel threads highest
priority. However, the process can have no CAP_SYS_NICE which required
for sched_setscheduler() to increase the priority. For example, SystemTap
loads its module with only CAP_SYS_MODULE. In this case,
sched_setscheduler() returns -EPERM, then BUG() is called.Failure of sched_setscheduler() wouldn't be a real problem, so this
patch just ignores it.
Or, should we give the CAP_SYS_NICE capability temporarily?Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
kernel/stop_machine.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)Index: linux-2.6.26-rc5-mm3/kernel/stop_machine.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/stop_machine.c
+++ linux-2.6.26-rc5-mm3/kernel/stop_machine.c
@@ -143,8 +143,7 @@ int __stop_machine_run(int (*fn)(void *)
kthread_bind(threads[i], i);/* Make it highest prio. */
- if (sched_setscheduler(threads[i], SCHED_FIFO, ¶m) != 0)
- BUG();
+ sched_setscheduler(threads[i], SCHED_FIFO, ¶m);
}/* We've created all the threads. Wake them all: hold this CPU so one
--
Hi Hidehiro,
Nice catch. This can happen in the current code, it just doesn't
Well, it can mean that the stop_machine blocks indefinitely. Better
I don't think so. It can be seen from another thread, and in theory
that should not see something random. Worse, they can change it from
another thread.How's this?
sched_setscheduler: add a flag to control access checks
Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.This simply introduces a flag to allow us to disable the capability
checks for internal callers (this is simpler than splitting the
sched_setscheduler() function, since it loops checking permissions).The flag is only "false" (ie. no check) for the following cases, where
it shouldn't matter:
drivers/input/touchscreen/ucb1400_ts.c:ucb1400_ts_thread()
- it's a kthread
drivers/mmc/core/sdio_irq.c:sdio_irq_thread()
- also a kthread
kernel/kthread.c:create_kthread()
- making a kthread (from kthreadd)
kernel/softlockup.c:watchdog()
- also a kthreadAnd these cases could have failed before:
kernel/softirq.c:cpu_callback()
- CPU hotplug callback
kernel/stop_machine.c:__stop_machine_run()
- Called from various places, including modprobe()Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
diff -r 509f0724da6b drivers/input/touchscreen/ucb1400_ts.c
--- a/drivers/input/touchscreen/ucb1400_ts.c Thu Jun 19 17:06:30 2008 +1000
+++ b/drivers/input/touchscreen/ucb1400_ts.c Thu Jun 19 19:36:40 2008 +1000
@@ -287,7 +287,7 @@ static int ucb1400_ts_thread(void *_ucb)
int valid = 0;
struct sched_param param = { .sched_priority = 1 };- sched_setscheduler(tsk, SCHED_FIFO, ¶m);
+ sched_setscheduler(tsk, SCHED_FIFO, ¶m, false);set_freezable();
while (!kthread_should_stop()) {
diff -r 509f0724da6b drivers/mmc/core/sdio_irq.c
--- a/drivers/mmc/core/sdio_irq.c Thu Jun 19 17:06:30...
What about?
int sched_setscheduler(struct task_struct *p, int policy,
struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, true);
}int sched_setscheduler_nocheck(struct task_struct *p, int policy,
struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, false);
}(With the appropriate transformation of sched_setscheduler -> __)
Better than scattering stray true/falses around the code.
J
--
agreed - it would also be less intrusive on the API change side.
i've created a new tip/sched/new-API-sched_setscheduler topic for this
to track it, but it would be nice to have a v2 of this patch that
introduces the new API the way suggested by Jeremy. (Hence the new topic
is auto-merged into tip/master but not into linux-next yet.) Thanks,Ingo
--
Yes, here's the patch. I've put it in my tree for testing, too.
sched_setscheduler_nocheck: add a flag to control access checks
Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.Two cases could have failed, so are changed to sched_setscheduler_nocheck:
kernel/softirq.c:cpu_callback()
- CPU hotplug callback
kernel/stop_machine.c:__stop_machine_run()
- Called from various places, including modprobe()Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
diff -r 91c45b8d7775 include/linux/sched.h
--- a/include/linux/sched.h Mon Jun 23 13:49:26 2008 +1000
+++ b/include/linux/sched.h Mon Jun 23 13:54:55 2008 +1000
@@ -1655,6 +1655,8 @@ extern int task_curr(const struct task_s
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler_nocheck(struct task_struct *, int,
+ struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff -r 91c45b8d7775 kernel/sched.c
--- a/kernel/sched.c Mon Jun 23 13:49:26 2008 +1000
+++ b/kernel/sched.c Mon Jun 23 13:54:55 2008 +1000
@@ -4744,16 +4744,8 @@ __setscheduler(struct rq *rq, struct tas
set_load_weight(p);
}-/**
- * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
- * @p: the task in question.
- * @policy: new policy.
- * @param: structure containing the new RT priority.
- *
- * NOTE that the task may be already dead.
- */
-int sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param)
+static int __sched_setscheduler(struct task_struct *p, int policy,
+ struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned ...
applied to tip/sched/new-API-sched_setscheduler, thanks Rusty. Also
added it to auto-sched-next so that it shows up in linux-next.as it suffered from line-warp damage.
Ingo
--
Hi.
I got this bug while migrating pages only a few times
via memory_migrate of cpuset.Unfortunately, even if this patch is applied,
I got bad_page problem after hundreds times of page migration
(I'll report it in another mail).
But I believe something like this patch is needed anyway.------------[ cut here ]------------
kernel BUG at mm/migrate.c:719!
invalid opcode: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
CPU 0
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos rtc_core rtc_lib 8139too pcspkr 8139cp mii ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
Pid: 3096, comm: switch.sh Not tainted 2.6.26-rc5-mm3 #1
RIP: 0010:[<ffffffff8029bb85>] [<ffffffff8029bb85>] migrate_pages+0x33e/0x49f
RSP: 0018:ffff81002f463bb8 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffe20000c17500 RCX: 0000000000000034
RDX: ffffe20000c17500 RSI: ffffe200010003c0 RDI: ffffe20000c17528
RBP: ffffe200010003c0 R08: 8000000000000000 R09: 304605894800282f
R10: 282f87058b480028 R11: 0028304005894800 R12: ffff81003f90a5d8
R13: 0000000000000000 R14: ffffe20000bf4cc0 R15: ffff81002f463c88
FS: 00007ff9386576f0(0000) GS:ffffffff8061d800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff938669000 CR3: 000000002f458000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process switch.sh (pid: 3096, threadinfo ffff81002f462000, task ffff81003e99cf10)
Stack: 0000000000000001 ffffffff80290777 0000000000000000 0000000000000000
ffff81002f463c88 ffff81000000ea18 ffff81002f463c88 000000000000000c
ffff81002f463ca8 00007ffffffff000 00007fff649f6000 0000000000000004
Call Trace:
[<ffffffff80290777>] ? new_node_page+0x0/0x2f
...
On Tue, 17 Jun 2008 16:35:01 +0900
Ok, let's fix one by one. please add your Signed-off-by if ok.This is a fix for page migration under speculative page lookup protocol.
-Kame
==
In speculative page cache lookup protocol, page_count(page) is set to 0
while radix-tree midification is going on, truncation, migration, etc...While page migration, a page fault to page under migration should wait
unlock_page() and migration_entry_wait() waits for the page from its
pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.
Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
if it is zero. This patch uses page_cache_get_speculative() to avoid
the panic.From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_strupage = migration_entry_to_page(entry);
- get_page(page);
+ if (!page_cache_get_speculative())
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);--
On Wed, 18 Jun 2008 10:13:49 +0900
This is obviously buggy....sorry..quilt refresh miss..==
In speculative page cache lookup protocol, page_count(page) is set to 0
while radix-tree modification is going on, truncation, migration, etc...While page migration, a page fault to page under migration should wait
unlock_page() and migration_entry_wait() waits for the page from its
pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.
Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
if it is zero. This patch uses page_cache_get_speculative() to avoid
the panic.From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_strupage = migration_entry_to_page(entry);
- get_page(page);
+ if (!page_cache_get_speculative(page))
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);--
These tend to all happen while the page is locked, and in particular
while the page does not have any references other than the current
code path and the pagecache. So no page tables should point to it.At any rate, page_cache_get_speculative() should not be used for this
purpose, but for when we _really_ don't have any references to a page.
--
On Wed, 18 Jun 2008 15:35:57 +1000
Then, I got NAK. what should I do ?
(This fix is not related to lock_page() problem.)If I read your advice correctly, we shouldn't use lock_page() here.
Before speculative page cache, page_table_entry of a page under migration
has a pte entry which encodes pfn as special pte entry. and wait for the
end of page migration by lock_page().Maybe we just go back to user-land and makes it to do page-fault again
is better ?Thanks,
-Kame--
Well, not nack as such as just wanting to find out a bit more about
What I don't think I understand, is how we can have a page in the
page tables (and with the ptl held) but with a zero refcount... Oh,
it's not actually a page but a migration entry! I'm not quite so
familiar with that code.Hmm, so we might possibly see a page there that has a zero refcount
due to page_freeze_refs? In which case, I think the direction of you
fix is good. Sorry for my misunderstanding the problem, and thank
you for fixing up my code!I would ask you to use get_page_unless_zero rather than
page_cache_get_speculative(), because it's not exactly a speculative
reference -- a speculative reference is one where we elevate _count
and then must recheck that the page we have is correct.Also, please add a comment. It would really be nicer to hide this
transiently-frozen state away from migration_entry_wait, but I can't
see any lock that would easily solve it.Thanks,
Nick
--
On Wed, 18 Jun 2008 16:42:37 +1000
ok, will adds comments.Thanks,
-Kame--
In speculative page cache look up protocol, page_count(page) is set to 0
while radix-tree modification is going on, truncation, migration, etc...While page migration, a page fault to page under migration does
- look up page table
- find it is migration_entry_pte
- decode pfn from migration_entry_pte and get page of pfn_page(pfn)
- wait until page is unlockedIt does get_page() -> wait_on_page_locked() -> put_page() now.
In page migration's radix-tree replacement, page_freeze_refs() ->
page_unfreeze_refs() is called. And page_count(page) turns to be zero
and must be kept to be zero while radix-tree replacement.If get_page() is called against a page under radix-tree replacement,
the kernel panics(). To avoid this, we shouldn't increment page_count()
if it is zero. This patch uses get_page_unless_zero().Even if get_page_unless_zero() fails, the caller just retries.
But will be a bit busier.Change log v1->v2:
- rewrote the patch description and added comments.From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -242,8 +242,15 @@ void migration_entry_wait(struct mm_stru
goto out;page = migration_entry_to_page(entry);
-
- get_page(page);
+ /*
+ * Once radix-tree replacement of page migration started, page_count
+ * *must* be zero. And, we don't want to call wait_on_page_locked()
+ * against a page without get_page().
+ * So, we use get_page_unless_zero(), here. Even failed, page fault
+ * will occur again.
+ */
+ if (!get_page_unless_zero(page))
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_lock...
Thanks
Acked-by: Nick Piggin <npiggin@suse.de>
--
Great!
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>--
sorry, so late responce.
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
--
Agree. It should be fixed independently.
--
I think that at least part of your patch, below, should fix this
problem. See comments there.Now I wonder if the assertion that newpage count == 1 could be violated?
I don't see how. We've just allocated and filled it and haven'tI'm not sure about this part. If it IS needed, I think it would be
needed independently of the unevictable/putback_lru_page() changes, as
this race must have already existed.However, unmap_and_move() replaced the migration entries with bona fide
pte's referencing the new page before freeing the old page, so I think
I agree with this part. I came to the same conclusion looking at the
code. If we just changed the if() and VM_BUG_ON() to:if (!page->mapping && page_count(page) == 1) { ...
we'd be doing exactly what putback_lru_page() is doing. So, this code
as always unnecessary, duplicate code [that I was trying to avoid :(].
So, just let putback_lru_page() handle this condition and conditionally
unlock_page().I'm testing with my stress load with the 2nd part of the patch above and
it's holding up OK. Of course, I didn't hit the problem before. I'll
try your duplicator script and see what happens.Regards,
Lee--
Without this part, I can easily get VM_BUG_ON in get_page,
even when processes in cpusets are only bash.---
kernel BUG at include/linux/mm.h:297!
:
Call Trace:
[<ffffffff80280d82>] ? handle_mm_fault+0x3e5/0x782
[<ffffffff8048c8bf>] ? do_page_fault+0x3d0/0x7a7
[<ffffffff80263ed0>] ? audit_syscall_exit+0x2e4/0x303
[<ffffffff8048a989>] ? error_exit+0x0/0x51
Code: b8 00 00 00 00 00 e2 ff ff 48 8d 1c 02 48 8b 13 f6 c
2 01 75 04 0f 0b eb fe 80 e6 40 48 89 d8 74 04 48 8b 43 10 83 78 08 00 75 04 <0f> 0b eb fe
f0 ff 40 08 fe 45 00 f6 03 01 74 0a 31 f6 48 89 df
RIP [<ffffffff8029c309>] migration_entry_wait+0xcb/0xfa
RSP <ffff81062cc6fe58>
---I agree that this part should be fixed independently, and
Kamezawa-san has already posted a patch for this.Thanks,
Daisuke Nishimura.
--
Disagree: IIRC, excellent example of the kind of assumption
that becomes invalid with Nick's speculative page references.Someone interested in the previous use of the page may have
incremented the refcount, and in due course will find that
it's got reused for something else, and will then back off.Hugh
--
Yeah. Kosaki-san mentioned that we'd need some rework for the
speculative page cache work. Looks like we'll need to drop the
VM_BUG_ON().I need to go read up on the new invariants we can trust with the
speculative page cache.Thanks,
Lee--
I don't know if I've added a summary, which is something I should
do.The best thing to do is never use page_count, but just use get
and put to refcount it. If you really must use it:- If there are X references to a page, page_count will return >= X.
- If page_count returns Y, there are no more than Y references to the page.--
this part is really necessary?
I tryed to remove it, but any problem doesn't happend.Of cource, another part is definitly necessary for specurative pagecache :)
--
I made this part first, and added a fix for migration_entry_wait later.
So, I haven't test without this part, and I think it will cause
VM_BUG_ON() here without this part.--
I got this VM_BUG_ON() as expected only by doing:
# echo $$ >/cgroup/cpuset/02/tasks
So, I beleive that both fixes for migration_entry_wait and
unmap_and_move (and, of course, removal VM_BUG_ON from
putback_lru_page) are needed.Thanks,
Daisuke Nishimura.
--
OK, I confirmed this part.
Andrew, please pick.
==================================================
Against: 2.6.26-rc5-mm3
remove redundant mapping check.
we'd be doing exactly what putback_lru_page() is doing. So, this code
as always unnecessary, duplicate code.
So, just let putback_lru_page() handle this condition and conditionally
unlock_page().Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>---
mm/migrate.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)Index: b/mm/migrate.c
===================================================================
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -716,13 +716,7 @@ unlock:
* restored.
*/
list_del(&page->lru);
- if (!page->mapping) {
- VM_BUG_ON(page_count(page) != 1);
- unlock_page(page);
- put_page(page); /* just free the old page */
- goto end_migration;
- } else
- unlock = putback_lru_page(page);
+ unlock = putback_lru_page(page);
}if (unlock)
--
On Wed, 18 Jun 2008 13:41:28 +0900
yes, but I'm now trying to rewrite putback_lru_page(). For avoid more complication.Thanks,
-Kame--
I got bad_page after hundreds times of page migration.
It seems that a locked page is being freed.Bad page state in process 'switch.sh'
page:ffffe20001ee8f40 flags:0x0500000000080019 mapping:0000000000000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 23283, comm: switch.sh Not tainted 2.6.26-rc5-mm3-test6-lee #1Call Trace:
[<ffffffff802747b0>] bad_page+0x97/0x131
[<ffffffff80275ae6>] free_hot_cold_page+0xd4/0x19c
[<ffffffff8027a5c3>] putback_lru_page+0xf4/0xfb
[<ffffffff8029b210>] putback_lru_pages+0x46/0x74
[<ffffffff8029bc5b>] migrate_pages+0x3f4/0x468
[<ffffffff80290797>] new_node_page+0x0/0x2f
[<ffffffff80291631>] do_migrate_pages+0x19b/0x1e7
[<ffffffff8025c827>] cpuset_migrate_mm+0x58/0x8f
[<ffffffff8025d0fd>] cpuset_attach+0x8b/0x9e
[<ffffffff8032ffdc>] sscanf+0x49/0x51
[<ffffffff8025a3e1>] cgroup_attach_task+0x3a3/0x3f5
[<ffffffff80489a90>] __mutex_lock_slowpath+0x64/0x93
[<ffffffff8025af06>] cgroup_common_file_write+0x150/0x1dd
[<ffffffff8025aaf4>] cgroup_file_write+0x54/0x150
[<ffffffff8029f855>] vfs_write+0xad/0x136
[<ffffffff8029fd92>] sys_write+0x45/0x6e
[<ffffffff8020bef2>] tracesys+0xd5/0xdaHexdump:
000: 28 00 08 00 00 00 00 05 02 00 00 00 01 00 00 00
010: 00 00 00 00 00 00 00 00 41 3b 41 2f 00 81 ff ff
020: 46 01 00 00 00 00 00 00 e8 17 e6 01 00 e2 ff ff
030: e8 4b e6 01 00 e2 ff ff 00 00 00 00 00 00 00 00
040: 19 00 08 00 00 00 00 05 00 00 00 00 ff ff ff ff
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: ba 06 00 00 00 00 00 00 00 01 10 00 00 c1 ff ff
070: 00 02 20 00 00 c1 ff ff 00 00 00 00 00 00 00 00
080: 28 00 08 00 00 00 00 05 01 00 00 00 00 00 00 00
090: 00 00 00 00 00 00 00 00 01 3d 41 2f 00 81 ff ff
0a0: bb c3 55 f7 07 00 00 00 68 c4 f0 01 00 e2 ff ff
0b0: e8 8f ee 01 00 e2 ff ff 00 00 00 00 00 00 00 00
------------[ cut here ]------------
kernel BUG at mm/filemap.c:575!
inva...
Lee-san, how about this ?
Tested on x86-64 and tried Nisimura-san's test at el. works good now.
-Kame
==
putback_lru_page()/unevictable page handling rework.Now, putback_lru_page() requires that the page is locked.
And in some special case, implicitly unlock it.This patch tries to make putback_lru_pages() to be lock_page() free.
(Of course, some callers must take the lock.)The main reason that putback_lru_page() assumes that page is locked
is to avoid the change in page's status among Mlocked/Not-Mlocked.Once it is added to unevictable list, the page is removed from
unevictable list only when page is munlocked. (there are other special
case. but we ignore the special case.)
So, status change during putback_lru_page() is fatal and page should
be locked.putback_lru_page() in this patch has a new concepts.
When it adds page to unevictable list, it checks the status is
changed or not again. if changed, retry to putback.This patche changes also caller side and cleaning up lock/unlock_page().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroy@jp.fujitsu.com>
---
mm/internal.h | 2 -
mm/migrate.c | 23 +++----------
mm/mlock.c | 24 +++++++-------
mm/vmscan.c | 96 +++++++++++++++++++++++++---------------------------------
4 files changed, 61 insertions(+), 84 deletions(-)Index: test-2.6.26-rc5-mm3/mm/vmscan.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
+++ test-2.6.26-rc5-mm3/mm/vmscan.c
@@ -486,73 +486,63 @@ int remove_mapping(struct address_space
* Page may still be unevictable for other reasons.
*
* lru_lock must not be held, interrupts must be enabled.
- * Must be called with page locked.
- *
- * return 1 if page still locked [not truncated], else 0
*/
-int putback_lru_page(struct page *page)
+#ifdef CONFIG_UNEVICTABLE_LRU
+void putback_lru_page(struct page *page)
{
int lru;
- int ret = 1;
int was_unevictable;- VM_BUG_ON(!PageLocked(page));
...
I have been testing with my work load on both ia64 and x86_64 and it
Given that the race that would activate this retry is likely quite rare,
A couple of minor comments below, but:
Hmmm. Still thinking about this. No need to protect against in flight
I don't understand what you're trying to say here. That is, what the
--
On Wed, 18 Jun 2008 14:21:06 -0400
thank you.
We access the page-table without taking pte_lock. But this vm is MLOCKED
and migration-race is handled. So we don't need to be too nervous to access
the pte. I'll consider more meaningful words.Thanks,
-Kame--
Update:
On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
hours. Still running.On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
after ~7 hours. Stack trace [below] indicates zone-lru lock in
__page_cache_release() called from put_page(). Either heavy contention
or failure to unlock. Note that previous run, with patches to
putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
before I shut it down to try these patches.I'm going to try again with the collected patches posted by Kosaki-san
[for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
lru feature and see if I can reproduce it there. It may be unrelated toOK, so you just want to note that we're accessing the pte w/o locking
and that this is safe because the vma has been VM_LOCKED and all pages
should be mlocked?I'll note that the vma is NOT VM_LOCKED during the pte walk.
munlock_vma_pages_range() resets it so that try_to_unlock(), called from
munlock_vma_page(), won't try to re-mlock the page. However, we hold
the mmap sem for write, so faults are held off--no need to worry about a
COW fault occurring between when the VM_LOCKED was cleared and before
the page is munlocked. If that could occur, it could open a window
where a non-mlocked page is mapped in this vma, and page reclaim could
potentially unmap the page. Shouldn't be an issue as long as we never
downgrade the semaphore to read during munlock.Lee
----------
softlockup stack trace for "usex" workload on ia64:BUG: soft lockup - CPU#13 stuck for 61s! [usex:124359]
Modules linked in: ipv6 sunrpc dm_mirror dm_log dm_multipath scsi_dh dm_mod pci_slot fan dock thermal sg sr_mod processor button container ehci_hcd ohci_hcd uhci_hcd usbcorePid: 124359, CPU 13, comm: usex
psr : 00001010085a6010 ifs : 8000000000000000 ip : [<a00000010000a1a0>] Tainted: G D (2.6.26-rc5-mm3-kame-rework+mcl_inherit)
ip is at ia64_spinlock_contention+0x2...
Lee-san, this is an additonal one..
Not-tested-yet, just by review.Fixing page_lock() <-> zone->lock nesting of bad-behavior.
Before:
lock_page()(TestSetPageLocked())
spin_lock(zone->lock)
unlock_page()
spin_unlock(zone->lock)
After:
spin_lock(zone->lock)
spin_unlock(zone->lock)Including nit-pick fix. (I'll ask Kosaki-san to merge this to his 5/5)
Hmm...
---
mm/vmscan.c | 25 +++++--------------------
1 file changed, 5 insertions(+), 20 deletions(-)Index: test-2.6.26-rc5-mm3/mm/vmscan.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
+++ test-2.6.26-rc5-mm3/mm/vmscan.c
@@ -1106,7 +1106,7 @@ static unsigned long shrink_inactive_lis
if (nr_taken == 0)
goto done;- spin_lock(&zone->lru_lock);
+ spin_lock_irq(&zone->lru_lock);
/*
* Put back any unfreeable pages.
*/
@@ -1136,9 +1136,8 @@ static unsigned long shrink_inactive_lis
}
}
} while (nr_scanned < max_scan);
- spin_unlock(&zone->lru_lock);
+ spin_unlock_irq(&zone->lru_lock);
done:
- local_irq_enable();
pagevec_release(&pvec);
return nr_reclaimed;
}
@@ -2438,7 +2437,7 @@ static void show_page_path(struct page *
*/
static void check_move_unevictable_page(struct page *page, struct zone *zone)
{
-
+retry:
ClearPageUnevictable(page); /* for page_evictable() */
if (page_evictable(page, NULL)) {
enum lru_list l = LRU_INACTIVE_ANON + page_is_file_cache(page);
@@ -2455,6 +2454,8 @@ static void check_move_unevictable_page(
*/
SetPageUnevictable(page);
list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+ if (page_evictable(page, NULL))
+ goto retry;
}
}@@ -2494,16 +2495,6 @@ void scan_mapping_unevictable_pages(stru
next = page_index;
next++;- if (TestSetPageLocked(page)) {
- /*
- * OK, let's do it the hard way...
- */
- if ...
No.
shrink_inactive_list lock usage is
local_irq_disable()
spin_lock(&zone->lru_lock);
while(){
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
spin_unlock(&zone->lru_lock);
local_irq_enable();this keep below lock rule.
Right.
--
OK, I'll test this on my x86_64 platform, which doesn't seem to hit the
Couple of comments:
* I believe that the locks are acquired in the right order--at least as
documented in the comments in mm/rmap.c.
* The unlocking appears out of order because this function attempts to
hold the zone lock across a few pages in the pagevec, but must switch to
a different zone lru lock when it finds a page on a different zone from
the zone whose lock it is holding--like in the pagevec drainingRight. With your reworked check_move_unevictable_page() [with retry],
we don't need to lock the page here, any more. That means we can revert
all of the changes to pass the mapping back to sys_shmctl() and move the
call to scan_mapping_unevictable_pages() back to shmem_lock() after
clearing the address_space's unevictable flag. We only did that to
avoid sleeping while holding the shmem_inode_info lock and the
shmid_kernel's ipc_perm spinlock.1) It appears that the spin_lock() [no '_irq'] was there because irqs
are disabled a few lines above so that we could use non-atomic
__count[_zone]_vm_events().
2) I think this predates the split lru or unevictable lru patches, so
We can remove this comment ^^^^^^^^^^^^^^^^^^^^^^^^^^I'll let you know how it goes.
Later,
Lee--
Yes.
I'll remove it.--
Quick update:
With this patch applied, at ~ 1.5 hours into the test, my system panic'd
[panic_on_oops set] with a BUG in __find_get_block() -- looks like the
BUG_ON() in check_irqs_on() called from bh_lru_install() inlined by
__find_get_block(). Before the panic occurred, I saw warnings from
native_smp_call_function_mask() [arch/x86/kernel/smp.c]--also because
irqs_disabled().I'll back out the changes [spin_[un]lock() => spin_[un]lock_irq()] to
shrink_inactive_list() and try again. Just a hunch.Lee
--
Yup.
Kamezawa-san's patch remove local_irq_enable(), but don't remove
local_irq_disable().--
On Sat, 21 Jun 2008 17:56:17 +0900
Sorry,
--
On Thu, 19 Jun 2008 10:45:22 -0400
On ia64, ia64_spinlock_contention() enables irq and soft-lockup detection
by irq works well. On x86-64, irq is not enabled during spin-wait, and
soft-lockup detection irq cannot be handled until irq is enabled.
Then, it seems there is someone who drops into infinite-loop within
spin_lock_irqsave(&zone->lock, flags)....Then, "A" cpu doesn't report soft-lockup while others report ?
Hmm..
--
Thank you for clarification. (so..will check Kosaki-san's one's comment later.
I think I have never seen this kind of dead-lock related to zone->lock.
(maybe it's because zone->lock is used in clear way historically)
I'll check around zone->lock. thanks.Regards,
-Kame--
Another update--with the collected patches:
Again, the x86_64 ran for > 22 hours w/o error before I shut it down.
And, again, the ia64 went into soft lockup--same stack traces. This
time after > 17 hours of running. It is possible that a BUG started
this, but it has long scrolled out of my terminal buffer by the time I
see the system.I'm now trying the ia64 platform with 26-rc5-mm3 + collected patches
with UNEVICTABLE_LRU de-configured. I'll start that up today and let it
run over the weekend [with panic_on_oops set] if it hasn't hit the
problem before I leave.Regards,
Lee--
Hi, Kamezawa-san.
I like this idea.
I'll test it tomorrow.
Thanks,
Daisuke Nishimura.
--
it seems good idea :)
I think this comment is useful.
No.
originally move_to_lru() called in unmap_and_move().
unevictable infrastructure patch move to this point for
calling putback_lru_page() under page locked.So, your patch remove page locked dependency.
move to unmap_and_move() again is better.--
On Wed, 18 Jun 2008 20:36:52 +0900
I caught panci here ;)
maybe
==
if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL))
==
ok, will look into again.Thanks,
--
I agree with Kosaki-san.
And VM_BUG_ON(page_count(newpage) != 1) in unmap_and_move()
is not correct again, IMHO.
I got this BUG actually when testing this patch(with
migratin_entry_wait fix).unmap_and_move()
move_to_new_page()
migrate_page()
remove_migration_ptes()
putback_lru_page() (*1)
:
if (!newpage->mapping) (*2)
VM_BUG_ON(page_count(newpage) != 1)If a anonymous page(without mapping) is migrated successfully,
this page is moved back to lru by putback_lru_page()(*1),
and the page count becomes 1(pte only).At the same time(between *1 and *2), if the process
that owns this page are freeing this page, the page count
becomes 0 and ->mapping becomes NULL by free_hot_cold_page(),
so this BUG is caused.I've not seen this BUG on real HW yet(seen twice on fake-numa
hvm guest of Xen), but I think it can happen theoretically.Thanks,
Daisuke Nishimura.
--
On Thu, 19 Jun 2008 17:00:59 +0900
That's (maybe) because page->mapping is not cleared when it's removed
from rmap. (and there is pagevec to dealy freeing....)But ok, I see your point. KOSAKI-san is now writing patch set to
fix the whole. please see it.Thanks,
-Kame--
I can't reproduce this bad page.
I'll try again tomorrow ;)--
OK. I'll report on my test more precisely.
- Environment
HW: 4CPU(x86_64), 2node NUMA
kernel: 2.6.26-rc5-mm3 + Lee's two fixes about double unlock_page
+ my patch. config is attached.- mount cpuset and make settings
# mount -t cgroup -o cpuset cpuset /cgroup/cpuset# mkdir /cgroup/cpuset/01
# echo 0-1 >/cgroup/cpuset/01/cpuset.cpus
# echo 0 >/cgroup/cpuset/01/cpuset.mems
# echo 1 >/cgroup/cpuset/01/cpuset.memory_migrate# mkdir /cgroup/cpuset/02
# echo 2-3 >/cgroup/cpuset/02/cpuset.cpus
# echo 1 >/cgroup/cpuset/02/cpuset.mems
# echo 1 >/cgroup/cpuset/02/cpuset.memory_migrate- register processes in cpusets
# echo $$ >/cgroup/cpuset/01/tasksI'm using LTP's page01 test, and running two instances infinitely.
# while true; do (somewhere)/page01 4194304 1; done &
# while true; do (somewhere)/page01 4194304 1; done &The same thing should be done about 02 directory.
- echo pids to another directory
Run simple script like below.---
#!/bin/bashG1=$1
G2=$2move_task()
{
for pid in $1
do
echo $pid >$2/tasks 2>/dev/null
done
}G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`move_task "${G1_TASK}" ${G2} &
move_task "${G2_TASK}" ${G1} &wait
---Please let me know if you need other information.
I'm also digging this problem.Thanks,
Daisuke Nishimura.
Thank you verbose explain.
I ran its testcase >3H today.
but unfortunately, I couldn't reproduce it.Hmm...
--
On Tue, 17 Jun 2008 16:47:09 +0900
Good catch, and I think your investigation in the last e-mail was correct.
I'd like to dig this...but it seems some kind of big fix is necessary.
Did this happen under page-migraion by cpuset-task-move test ?Thanks,
--
Yes.
I made 2 cpuset directories, run some processes in each cpusets,
and run a script like below infinitely to move tasks and migrate pages.---
#!/bin/bashG1=$1
G2=$2move_task()
{
for pid in $1
do
echo $pid >$2/tasks 2>/dev/null
done
}G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`move_task "${G1_TASK}" ${G2} &
move_task "${G2_TASK}" ${G1} &wait
---I got this bad_page after running this script for about 600 times.
Thanks,
Daisuke Nishimura.--
I'm seeing *mlocked* pages [PG_mlocked] being freed now with my stress
load, with just the "if(!page->mapping) { } clause removed, as proposed
in your rfc patch in previous mail. Need to investigate this...I'm not seeing *locked* pages [PG_lock], tho'. From your stack trace,
it appears that migrate_page() left locked pages on the list of pages to
be putback. The pages get locked and unlocked in unmap_and_move(). I
haven't found a path [yet] where the page can be returned still locked.--
Please see the mail I've just sended to Kosaki-san :)
Thanks,
Daisuke Nishimura.
--
was: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix
kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)<snip>
This [freeing of mlocked pages] also occurs in unpatched 26-rc5-mm3.
Fixed by the following:
PATCH: fix munlock page table walk - now requires 'mm'
Against 2.6.26-rc5-mm3.
Incremental fix for: mlock-mlocked-pages-are-unevictable-fix.patch
Initialize the 'mm' member of the mm_walk structure, else the
page table walk doesn't occur, and mlocked pages will not be
munlocked. This is visible in the vmstats:noreclaim_pgs_munlocked - should equal noreclaim_pgs_mlocked
less (nr_mlock + noreclaim_pgs_cleared), but is always zero
[munlock_vma_page() never called]noreclaim_pgs_mlockfreed - should be zero [for debug only],
but == noreclaim_pgs_mlocked - (nr_mlock + noreclaim_pgs_cleared)Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mlock.c | 2 ++
1 file changed, 2 insertions(+)Index: linux-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-17 15:20:57.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-17 15:23:17.000000000 -0400
@@ -318,6 +318,8 @@ static void __munlock_vma_pages_range(st
VM_BUG_ON(start < vma->vm_start);
VM_BUG_ON(end > vma->vm_end);+ munlock_page_walk.mm = mm;
+
lru_add_drain_all(); /* push cached pages to LRU */
walk_page_range(start, end, &munlock_page_walk);
lru_add_drain_all(); /* to update stats */--
Yup, Dave Hansen changed page_walk interface recently.
thus, his and ours patch is conflicted ;)below patch is just nit cleanups.
===========================================
From: Lee Schermerhorn <lee.schermerhorn@hp.com>This [freeing of mlocked pages] also occurs in unpatched 26-rc5-mm3.
Fixed by the following:
PATCH: fix munlock page table walk - now requires 'mm'
Against 2.6.26-rc5-mm3.
Incremental fix for: mlock-mlocked-pages-are-unevictable-fix.patch
Initialize the 'mm' member of the mm_walk structure, else the
page table walk doesn't occur, and mlocked pages will not be
munlocked. This is visible in the vmstats:noreclaim_pgs_munlocked - should equal noreclaim_pgs_mlocked
less (nr_mlock + noreclaim_pgs_cleared), but is always zero
[munlock_vma_page() never called]noreclaim_pgs_mlockfreed - should be zero [for debug only],
but == noreclaim_pgs_mlocked - (nr_mlock + noreclaim_pgs_cleared)Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>mm/mlock.c | 1 +
1 file changed, 1 insertion(+)Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -310,6 +310,7 @@ static void __munlock_vma_pages_range(st
.pmd_entry = __munlock_pmd_handler,
.pte_entry = __munlock_pte_handler,
.private = &mpw,
+ .mm = mm,
};VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
--
Indeed!
I guess lee's unevictable infrastructure and nick's specurative pagecache
is conflicted.
I'm investigating deeply now.--
Looks like x86 and ARM both fail to boot if PROFILE_LIKELY, FTRACE and
DYNAMIC_FTRACE are selected. If any one of those three are disabled it
boots (or fails in some other way which I'm looking at now). The serial
console output from both machines when they fail to boot is below, let me
know if there is any other information I can provide.ARM (Marvell Orion 5x):
<5>Linux version 2.6.26-rc5-mm3-dirty (bb3081@gamma) (gcc version 4.2.0 20070413 (prerelease) (CodeSourcery Sourcery G++ Lite 2007q1-21)) #24 PREEMPT Thu Jun 12 23:39:12 BST 2008
CPU: Feroceon [41069260] revision 0 (ARMv5TEJ), cr=a0053177
Machine: QNAP TS-109/TS-209
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x41000403
Memory policy: ECC disabled, Data cache writeback
<7>On node 0 totalpages: 32768
<7>Node 0 memmap at 0xc05df000 size 1048576 first pfn 0xc05df000
<7>free_area_init_node: node 0, pgdat c0529680, node_mem_map c05df000
<7> DMA zone: 32512 pages, LIFO batch:7
CPU0: D VIVT write-back cache
CPU0: I cache: 32768 bytes, associativity 1, 32 byte lines, 1024 sets
CPU0: D cache: 32768 bytes, associativity 1, 32 byte lines, 1024 sets
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
<5>Kernel command line: console=ttyS0,115200n8 root=/dev/nfs nfsroot=192.168.2.53:/stuff/debian ip=dhcp
PID hash table entries: 512 (order: 9, 2048 bytes)
Console: colour dummy device 80x30
<6>Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
<6>Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
<6>Memory: 128MB = 128MB total
<5>Memory: 123776KB available (5016K code, 799K data, 160K init)
<6>SLUB: Genslabs=12, HWalign=32, Order=0-3, MinOb...
I was able to reproduce a hang on x86 with those options. The patch
below is a potential fix. I think we don't want to trace
do_check_likely(), since the ftrace internals might use likely/unlikely
macro's which will just cause recursion back to do_check_likely()..Signed-off-by: Daniel Walker <dwalker@mvista.com>
---
lib/likely_prof.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)Index: linux-2.6.25/lib/likely_prof.c
===================================================================
--- linux-2.6.25.orig/lib/likely_prof.c
+++ linux-2.6.25/lib/likely_prof.c
@@ -22,7 +22,7 @@static struct likeliness *likeliness_head;
-int do_check_likely(struct likeliness *likeliness, unsigned int ret)
+int notrace do_check_likely(struct likeliness *likeliness, unsigned int ret)
{
static unsigned long likely_lock;--
the better fix would be to add likely_prof.o to this list of exceptions
in lib/Makefile:ifdef CONFIG_FTRACE
# Do not profile string.o, since it may be used in early boot or vdso
CFLAGS_REMOVE_string.o = -pg
# Also do not profile any debug utilities
CFLAGS_REMOVE_spinlock_debug.o = -pg
CFLAGS_REMOVE_list_debug.o = -pg
CFLAGS_REMOVE_debugobjects.o = -pg
endifinstead of adding notrace to the source.
Ingo
--
Here's the fix mentioned above.
--
Remove tracing from likely profiling since it could cause recursion if
ftrace uses likely/unlikely macro's internally.Signed-off-by: Daniel Walker <dwalker@mvista.com>
---
lib/Makefile | 2 ++
1 file changed, 2 insertions(+)Index: linux-2.6.25/lib/Makefile
===================================================================
--- linux-2.6.25.orig/lib/Makefile
+++ linux-2.6.25/lib/Makefile
@@ -15,6 +15,8 @@ CFLAGS_REMOVE_string.o = -pg
CFLAGS_REMOVE_spinlock_debug.o = -pg
CFLAGS_REMOVE_list_debug.o = -pg
CFLAGS_REMOVE_debugobjects.o = -pg
+# likely profiling can cause recursion in ftrace, so don't trace it.
+CFLAGS_REMOVE_likely_prof.o = -pg
endiflib-$(CONFIG_MMU) += ioremap.o
--
Did you happen to check PROFILE_LIKELY and FTRACE alone?
Daniel
--
Yes, without DYNAMIC_FTRACE the arm box gets all the way to userspace and
the x86 box panics while registering a driver so most likely unrelated to
this problem.--
Byron Bradley
--
Hi Andrew,
2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
machine. Sorry the console is bit overwritten for the first few lines.------------[ cut here ]------------
ot fs
no fstab.kernel BUG at mm/filemap.c:575!
sys, mounting ininvalid opcode: 0000 [1] ternal defaultsSMP
Switching to ne
w root and runnilast sysfs file: /sys/block/dm-3/removable
ng init.
unmounCPU 3 ting old /dev
u
nmounting old /pModules linked in:roc
unmounting
old /sys
Pid: 1, comm: init Not tainted 2.6.26-rc5-mm3-autotest #1
RIP: 0010:[<ffffffff80268155>] [<ffffffff80268155>] unlock_page+0xf/0x26
RSP: 0018:ffff81003f9e1dc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffe20000f63080 RCX: 0000000000000036
RDX: 0000000000000000 RSI: ffffe20000f63080 RDI: ffffe20000f63080
RBP: 0000000000000000 R08: ffff81003f9a5727 R09: ffffc10000200200
R10: ffffc10000100100 R11: 000000000000000e R12: 0000000000000000
R13: 0000000000000000 R14: ffff81003f47aed8 R15: 0000000000000000
FS: 000000000066d870(0063) GS:ffff81003f99fa80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000065afa0 CR3: 000000003d580000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff81003f9e0000, task ffff81003f9d8000)
Stack: ffffe20000f63080 ffffffff80270d9c 0000000000000000 ffffffffffffffff
000000000000000e 0000000000000000 ffffe20000f63080 ffffe20000f630c0
ffffe20000f63100 ffffe20000f63140 ffffe20000f63180 ffffe20000f631c0
Call Trace:
[<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
[<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
[<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
[<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80Code: 00 00 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 75 ec 32 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 56 ...
For whatever it's worth, I'm seeing the same thing on my x86_64 laptop.
-rc5-mm2 works OK, I'm going to try to bisect it tonight.% diff -u /usr/src/linux-2.6.26-rc5-mm[23]/.config
--- /usr/src/linux-2.6.26-rc5-mm2/.config 2008-06-10 22:21:13.000000000 -0400
+++ /usr/src/linux-2.6.26-rc5-mm3/.config 2008-06-12 22:20:25.000000000 -0400
@@ -1,7 +1,7 @@
#
# Automatically generated make config: don't edit
-# Linux kernel version: 2.6.26-rc5-mm2
-# Tue Jun 10 22:21:13 2008
+# Linux kernel version: 2.6.26-rc5-mm3
+# Thu Jun 12 22:20:25 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
@@ -275,7 +275,7 @@
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
-# CONFIG_NORECLAIM_LRU is not set
+CONFIG_UNEVICTABLE_LRU=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0Not much changed there...
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5... is said to "fix" it.
--
Another unlock of an unlocked page. Presumably when reclaim hadn't
done anything yet.Don't know, sorry. Strange.
--
Looks like something lockless pagecache *could* be connected with, but
I have never seen such a bug.Hmm...
@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
cancel_dirty_page(page, PAGE_CACHE_SIZE);remove_from_page_cache(page);
+ clear_page_mlock(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */...
+static inline void clear_page_mlock(struct page *page)
+{
+ if (unlikely(TestClearPageMlocked(page)))
+ __clear_page_mlock(page);
+}...
+void __clear_page_mlock(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page)); /* for LRU isolate/putback */
+
+ dec_zone_page_state(page, NR_MLOCK);
+ count_vm_event(NORECL_PGCLEARED);
+ if (!isolate_lru_page(page)) {
+ putback_lru_page(page);
+ } else {
+ /*
+ * Page not on the LRU yet. Flush all pagevecs and retry.
+ */
+ lru_add_drain_all();
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+ else if (PageUnevictable(page))
+ count_vm_event(NORECL_PGSTRANDED);
+ }
+}...
+int putback_lru_page(struct page *page)
+{
+ int lru;
+ int ret = 1;
+ int was_unevictable;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ lru = !!TestClearPageActive(page);
+ was_unevictable = TestClearPageUnevictable(page); /* for
page_evictable() */
+
+ if (unlikely(!page->mapping)) {
+ /*
+ * page truncated. drop lock as put_page() will
+ * free the page.
+ */
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
^^^^^^^^^^^^^^^^^^This is a rather wild thing to be doing. It's a really bad idea
to drop a lock that's taken sev...
On Thu, 12 Jun 2008 21:38:59 +1000
I agree and strongly hope this unlock should be removed.
The caller can do unlock by itself, I think.Thanks,
-Kame--
On Thu, 12 Jun 2008 01:57:46 -0700
at first look,==
truncate_inode_pages_range()
-> TestSetPageLocked() //
-> truncate_complete_page()
-> remove_from_page_cache() // makes page->mapping to be NULL.
-> clear_page_mlock()
-> __clear_page_mlock()
-> putback_lru_page()
-> unlock_page() // page->mapping is NULL
-> unlock_page() //BUG
==It seems truncate_complete_page() is bad.
==
static void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
if (page->mapping != mapping)
return;if (PagePrivate(page))
do_invalidatepage(page, 0);cancel_dirty_page(page, PAGE_CACHE_SIZE);
remove_from_page_cache(page); -----------------(A)
clear_page_mlock(page); -----------------(B)
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */
}
==(B) should be called before (A) as invalidate_complete_page() does.
Thanks,
-Kame--
This is reproducer of panic. "quick fix" is attached.
But I think putback_lru_page() should be re-designed.==
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <errno.h>int main(int argc, char *argv[])
{
int fd;
char *filename = argv[1];
char buffer[4096];
char *addr;
int len;fd = open(filename, O_CREAT | O_EXCL | O_RDWR, S_IRWXU);
if (fd < 0) {
perror("open");
exit(1);
}
len = write(fd, buffer, sizeof(buffer));if (len < 0) {
perror("write");
exit(1);
}addr = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
munmap(addr, 4096);
close(fd);unlink(filename);
}
==
you'll see panic.Fix is here
==quick fix for double unlock_page();
Signed-off-by: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Index: linux-2.6.26-rc5-mm3/mm/truncate.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/truncate.c
+++ linux-2.6.26-rc5-mm3/mm/truncate.c
@@ -104,8 +104,8 @@ truncate_complete_page(struct address_spcancel_dirty_page(page, PAGE_CACHE_SIZE);
- remove_from_page_cache(page);
clear_page_mlock(page);
+ remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */--
Hi Kame,
Thanks, The patch fixes the kernel panic.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
Confirming this quick fix works on my laptop that was hitting this crash -
am now up and running on -rc5-mm3.
Thanks - I put that in
Yes, it sounds that way.
--
Here's a proposed replacement patch that reworks putback_lru_page()
slightly and cleans up the call sites. I still want to balance the
get_page() in isolate_lru_page() with a put_page() in putback_lru_page()
for the primary users--vmscan and page migration. So, I need to drop
the lock before the put_page() when handed a page with null mapping and
a single reference count as the page will be freed on put_page() and a
locked page would bug out in free_pages_check()/bad_page().Lee
PATCH fix page unlocking protocol for putback_lru_page()
Against: 2.6.26-rc5-mm3
Replaces Kame-san's hotfix:
fix-double-unlock_page-in-2626-rc5-mm3-kernel-bug-at-mm-filemapc-575.patchApplies at end of vmscan/unevictable/mlock series to avoid patch conflicts.
1) modified putback_lru_page() to drop page lock only if both page_mapping()
NULL and page_count() == 1 [rather than VM_BUG_ON(page_count(page) != 1].
I want to balance the put_page() from isolate_lru_page() here for vmscan
and, e.g., page migration rather than requiring explicit checks of the
page_mapping() and explicit put_page() in these areas. However, the page
could be truncated while one of these subsystems holds it isolated from
the LRU. So, need to handle this case. Callers of putback_lru_page()
need to be aware of this and only call it with a page with NULL
page_mapping() when they will no longer reference the page afterwards.
This is the case for vmscan and page migration.2) m[un]lock_vma_page() already will not be called for page with NULL
mapping. Added VM_BUG_ON() to assert this.3) modified clear_page_lock() to skip the isolate/putback shuffle for
pages with NULL mapping, as they are being truncated/freed. Thus,
any future callers of clear_page_lock() need not be concerned about
the putback_lru_page() semantics for truncated pages.Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mlock.c | 29 +++++++++++++++++++----------
mm/vmsca...
On Fri, 13 Jun 2008 11:30:46 -0400
I'm sorry that I cannot catch the whole changes..
I cannot convice that this implicit behavior won't cause lock-up in future, again.
Even if there are enough comments...Why the page should be locked when it is put back to LRU ?
I think this restriction is added by RvR patch set, right ?
I'm sorry that I cannot catch the whole changes..Anyway, IMHO, lock <-> unlock should be visible as a pair as much as possible.
Thanks,
--
Kame-san: The restriction to put the page back to the LRU via
putback_lru_page() with the page locked does come from the unevictable
page infrastructure. Both page migration and vmscan can hold the page
isolated from the LRU, but unlocked, for quite some time. During this
time, a page can become nonreclaimable [or unevictable] or a
nonreclaimable page can become reclaimable. It's OK if an unevictable
pages gets on on the regular LRU lists, because we'll detect it and
"cull" it if/when vmscan attempts to reclaim it. However, if a
reclaimable page gets onto the unevictable LRU list, we may never get it
off, except via manual scan. Rik doesn't think we need the manual scan,
so we've been very careful to avoid conditions where we could "leak" a
reclaimable page permantently onto the unevictable list. Kosaki-san
found several scenarios where this could happen unless we check, under
page lock, the unevictable conditions when putting these pages back onI've considered modifying putback_lru_page() not to unlock/put the page
when mapping == NULL and count == 1. Then all of the callers would have
to remember this state, drop the lock and call put page themselves. I
think this would duplicate code and look ugly, but if we need to do
that, I guess we'll do it.Regards,
--
Below is a fix to the "proposed replacement patch" posted on Friday.
Incorrect test for page->mapping().Lee
Against: 2.6.26-rc5-mm3
Incremental fix to my proposed patch to "fix double unlock_page() in
2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575"."page_mapping(page)" should be "page->mapping" in VM_BUG_ON()s
introduced to m[un]lock_vma_page().Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mlock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)Index: linux-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-16 09:47:28.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-16 09:48:27.000000000 -0400
@@ -80,12 +80,12 @@ void __clear_page_mlock(struct page *pag
* Mark page as mlocked if not already.
* If page on LRU, isolate and putback to move to unevictable list.
*
- * Called with page locked and page_mapping() != NULL.
+ * Called with page locked and page->mapping != NULL.
*/
void mlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
- VM_BUG_ON(!page_mapping(page));
+ VM_BUG_ON(!page->mapping);if (!TestSetPageMlocked(page)) {
inc_zone_page_state(page, NR_MLOCK);
@@ -98,7 +98,7 @@ void mlock_vma_page(struct page *page)
/*
* called from munlock()/munmap() path with page supposedly on the LRU.
*
- * Called with page locked and page_mapping() != NULL.
+ * Called with page locked and page->mapping != NULL.
*
* Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
* [in try_to_munlock()] and then attempt to isolate the page. We must
@@ -118,7 +118,7 @@ void mlock_vma_page(struct page *page)
static void munlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
- VM_BUG_ON(!page_mapping(page));
+ VM_BUG_ON(!page->mapping);if (TestClearPageMlocked(page)) {
dec_zone_page_state(page, NR_MLOCK);--
Hi Lee,
Thanks, After applying the patch, the kernel does not panic's while
bootup.--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
[ 254.217776] ------------[ cut here ]------------
[ 254.217776] kernel BUG at mm/vmscan.c:510!
[ 254.217776] invalid opcode: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
[ 254.217776] last sysfs file: /sys/kernel/uevent_seqnum
[ 254.217776] CPU 1
[ 254.217776] Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
[ 254.217776] Pid: 12044, comm: madvise02 Not tainted 2.6.26-rc5-mm3 #4
[ 254.217776] RIP: 0010:[<ffffffff802729b2>] [<ffffffff802729b2>] putback_lru_page+0x152/0x160
[ 254.217776] RSP: 0018:ffff81012edd1cd8 EFLAGS: 00010202
[ 254.217776] RAX: ffffe20003f344b8 RBX: 0000000000000000 RCX: 0000000000000001
[ 254.217776] RDX: 0000000000005d5c RSI: 0000000000000000 RDI: ffffe20003f344b8
[ 254.217776] RBP: ffff81012edd1cf8 R08: 0000000000000000 R09: 0000000000000000
[ 254.217776] R10: ffffffff80275152 R11: 0000000000000001 R12: ffffe20003f344b8
[ 254.217776] R13: 00000000ffffffff R14: ffff810124801080 R15: ffffffffffffffff
[ 254.217776] FS: 00007fb3ad83c6f0(0000) GS:ffff81017f845320(0000) knlGS:0000000000000000
[ 254.217776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 254.217776] CR2: 00007fffb5846d38 CR3: 0000000117de9000 CR4: 00000000000006e0
[ 254.217776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 254.217776] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 254.217776] Process madvise02 (pid: 12044, threadinfo ffff81012edd0000, task ffff81017db6b3c0)
[ 254.217776] Stack: ffffe20003f344b8 ffffe20003f344b8 ffffffff80629300 0000000000000001
[ 254.217776] ffff81012edd1d18 ffffffff8027d268 ffffe20003f344b8 0000000000000000
[ 254.217776] ffff81012edd1d38 ffffffff80271783 0000000000000246 ffffe20003f344b8
[ 254.217776] Call Trace:
[ 254.217776] [<ffffffff8027d268>] __clear_page_mlock+0xe8/0x100
[ 254.217776] [<ffffffff80271783>] truncate_complete_pag...
int putback_lru_page(struct page *page)
{
int lru;
int ret = 1;
int was_unevictable;VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageLRU(page));lru = !!TestClearPageActive(page);
was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */if (unlikely(!page->mapping)) {
/*
* page truncated. drop lock as put_page() will
* free the page.
*/
VM_BUG_ON(page_count(page) != 1);added by unevictable-lru-infrastructure.patch.
How does one reproduce this? Looks like LTP madvise2.
--
Yep, totally reproducible here:
sudo ./testcases/bin/madvise02
--
| Greg KH | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| Adrian Bunk | [1/6] 2.6.21-rc2: known regressions |
| Paul Jackson | Re: cpuset-remove-sched-domain-hooks-from-cpusets |
git: | |
| Linus Torvalds | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
