Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

Previous thread: [PATCH] fs-writeback : check sync bit earlier in inode_wait_for_writeback by Richard Kennedy on Friday, April 16, 2010 - 8:33 am. (1 message)

Next thread: [PATCH] drm/radeon/kms: fix rs600 tlb flush by Jerome Glisse on Friday, April 16, 2010 - 9:46 am. (1 message)

Hello all,

I'm having an annoying kernel bug regarding huge pages in Fedora 12:

https://bugzilla.redhat.com/show_bug.cgi?id=552257

Basically I want to use huge pages in a multithreaded number crunching
program, which happens to use process-shared semaphores (because fftw
does it).  The futex for the semaphore ends up lying on a huge page, and
I then get an endless loop in get_futex_key(), apparently because the
anonymous huge page containing the futex does not have a page->mapping.
A test case is provided in the above link.

I reported the bug to Fedora bugzilla months ago, but haven't received
any feedback yet.  The Fedora kernel is based on 2.6.32.11, and a
cursory glance at the 2.6.34-rc3 source does not yield any relevant
change.

So, could anyone tell me if the current mainline kernel might act better
in this respect, before I get around to compiling it?

Thank you very much.

Please CC me as I'm not subscribed.

r6144

--


No, it works much better if you simply mail LKML and CC people who work

Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
hugetlb pages don't have their page->mapping set.

I guess something like the below might work, but I'd really rather not
add hugetlb knowledge to futex.c. Does anybody else have a better idea?
Maybe create something similar to an anon_vma for hugetlb pages?

---
 kernel/futex.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..b0f1b2d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -252,7 +252,7 @@ again:
 
 	page = compound_head(page);
 	lock_page(page);
-	if (!page->mapping) {
+	if (!page->mapping && !PageHuge(page)) {
 		unlock_page(page);
 		put_page(page);
 		goto again;
@@ -265,7 +265,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page) || (PageHuge(page) && !page->mapping)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;


--


anon_vma for hugetlb pages sounds overkill, what would it gain? In this
context, futex only appears to distinguish between whether the
references are private or shared.

Looking at the hugetlbfs code, I can't see a place where it actually cares
about the mapping as such. It's used to find shared pages in the page cache
(but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
though, the mapping is mostly kept in page->private for reservation accounting
purposes.

I can't think of other parts of the VM that touch the mapping if the
page is managed by hugetlbfs so the following patch should also work but
without futex having hugetlbfs-awareness. What do you think? Maybe for
safety, it would be better to make the mapping some obvious poison bytes
or'd with PAGE_MAPPING_ANON so an oops will be more obvious?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..57a5faa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
 
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
+	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	INIT_LIST_HEAD(&page->lru);
 
@@ -2447,8 +2448,10 @@ retry:
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
-		} else
+		} else {
 			lock_page(page);
+			page->mapping = (struct address_space *)PAGE_MAPPING_ANON;
+		}
 	}
 
 	/*
--


Yes, this seems perfectly adequate to me, that poison idea might be
worthwhile too :-)




--


Are you ok with an Ack to this slightly-difference patch too? If so,
I'll forward it on to Andrew.

==== CUT HERE ====

Fix infinite loop in get_futex_key when backed by huge pages

If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
get_futex_key() can go into an infinite loop waiting for a page->mapping
that will never exist. This was reported and documented in an external
bugzilla at

https://bugzilla.redhat.com/show_bug.cgi?id=552257

This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
mapped MAP_PRIVATE.  This is enough for futex to continue but because
of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
futex. No other part of the VM should be dereferencing the page->mapping of
a hugetlbfs page as its page cache is not on the LRU.

This patch fixes the problem with the test case described in the bugzilla.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/poison.h |   10 ++++++++++
 mm/hugetlb.c           |    6 +++++-
 2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..0f7b5ac 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -48,6 +48,16 @@
 #define POISON_FREE	0x6b	/* for use-after-free poisoning */
 #define	POISON_END	0xa5	/* end-byte of poisoning */
 
+/********** mm/hugetlb.c **********/
+/*
+ * Private mappings of hugetlb pages use this poisoned value for
+ * page->mapping. The core VM should not be doing anything with this mapping
+ * but futex requires the existance of some page->mapping value even if it
+ * is unused. If the core VM does deference the mapping, it'll look like a
+ * suspiciously high null-pointer offset starting from 0x2e5
+ */
+#define HUGETLB_PRIVATE_MAPPING	(0x2e4 | PAGE_MAPPING_ANON)
+
 /********** arch/$ARCH/mm/init.c **********/
 #define POISON_FREE_INITMEM	0xcc
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..487e3c2 100644
--- ...
From: Andrea Arcangeli
Date: Monday, April 19, 2010 - 8:48 am

Cute indeed.

BTW, just in case I tested this on transparent hugepage and it works
fine (it uses no cpu and can be killed with C^c). I had to hack it
like below to allocate the semaphore on hugepages without
khugepaged. I verified 1 hugepage is allocated (thanks to memory
compaction there's an huge excess of hugepages compared to what my
regular apps can eat ;). Furthermore I found gcc bypasses malloc so a
small patch to gcc should move it all into hugepages. Then maybe we
can build the kernel faster and definitely translate.o will build 8%
faster with the default khugepaged scan_sleep_millisecs settings
(waiting to be confirmed but I exclude bad surprises, whatever runs
fast with khugepaged has will run even faster without it if something,
or equal in the worst case).

Thanks,
Andrea

--- process-shared-sem-hugepage.c.orig	2010-04-19 17:43:47.278964888 +0200
+++ process-shared-sem-hugepage.c	2010-04-19 17:44:01.100032774 +0200
@@ -30,6 +30,7 @@ int main(void)
 
   g_thread_init(NULL);
   workers = g_new(GThread *, NWORKER); work_sem = g_new(sem_t, 1);
+  posix_memalign(&work_sem, 2*1024*1024, 2*1024*1024);
   result = sem_init(work_sem, TRUE, 0); g_assert(result == 0);
 
   for (i = 0; i < NWORKER; i++) {

--


Wouldn't a longer poison be more recognisable? Also, shouldn't this use
POISON_POINTER_DELTA?

Something like:

#define HUGETBL_POISON	((void *) 0x00300300 + POISON_POINTER_DELTA)


--

From: Andrea Arcangeli
Date: Monday, April 19, 2010 - 9:11 am

The default at kernel config time sets only 4k as unmapped (I think
it's a very bad default for 64bit archs), so above 4k userland can map
it and you can have actual derefs with 0x00300300 but not with Mel's
preferred <0x1000 address. So the address must be <0x1000.
--


Well, most poison values have that problem and still we have them. Also
on 64bit machines you can use POISON_POINTER_DELTA to map it outside the
virtual address range.


--

From: Andrea Arcangeli
Date: Monday, April 19, 2010 - 9:32 am

That would better be fixed too to stay <4096 for higher chance of

We've thousands of magic values there, I don't see much benefit from
POISON_POINTER_DELTA other than being able to call it
0xdeadbeef+POISON_POINTER_DELTA ;). We always look the assembly to
find the actual real raw pointer value (without the field offset) so I
think using a range between 0xaaa and 0xbbb for the error pointers, is
functional enough, but it's up to you as long as it is a address range
that can't be used by userland it's surely ok ;).
--


I was looking for an address < 0x1000 because it would only be valid in very
rare cases. I wasn't so sure about any other pointer value and only x86-64

So have I, but it couldn't be too near the page boundary either and pretty
much any address can be valid.  Still, architectures aren't stopped from
defining the delta and it is something we appear to rely on for the list
poisoning. I'd prefer a value below 0x1000 but matching list poisoning should
work for the most part.

How about?

==== CUT HERE ====
Fix infinite loop in get_futex_key when backed by huge pages

If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
get_futex_key() can go into an infinite loop waiting for a page->mapping that
will never exist. This was reported and documented in an external bugzilla at

https://bugzilla.redhat.com/show_bug.cgi?id=552257

This patch makes page->mapping a poisoned value that includes
PAGE_MAPPING_ANON mapped MAP_PRIVATE.  This is enough for futex to continue
but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced
or used by futex. No other part of the VM should be dereferencing the
page->mapping of a hugetlbfs page as its page cache is not on the LRU.

This patch fixes the problem with the test case described in the bugzilla.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/poison.h |    9 +++++++++
 mm/hugetlb.c           |    5 ++++-
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..bab71f3 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -48,6 +48,15 @@
 #define POISON_FREE	0x6b	/* for use-after-free poisoning */
 #define	POISON_END	0xa5	/* end-byte of poisoning */
 
+/********** mm/hugetlb.c **********/
+/*
+ * Private mappings of hugetlb pages use this poisoned value for
+ * page->mapping. The core VM should not be doing anything with this mapping
+ * but futex requires the existance of some page->mapping value ...

Thanks Mel, I'm also keen to keep hugetlb awareness out of futex.c, it's 
crazy enough in there already!

Acked-by: Darren Hart <darren@dvhart.com>

r6144, would you consider taking a look at the futextest test suite and 
letting me know where you think it might need improvement to catch what 
you ran into in your tests. I'm thinking some options to futex_wait and 
possibly other tests to tell it where to map the futex pages.

http://git.kernel.org/?p=linux/kernel/git/dvhart/futextest.git;a=summary

Thanks,



-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

--

Previous thread: [PATCH] fs-writeback : check sync bit earlier in inode_wait_for_writeback by Richard Kennedy on Friday, April 16, 2010 - 8:33 am. (1 message)

Next thread: [PATCH] drm/radeon/kms: fix rs600 tlb flush by Jerome Glisse on Friday, April 16, 2010 - 9:46 am. (1 message)