Re: [PATCH] mm: remove global locks from mm/highmem.c

Previous thread: Re: git.kernel.org move (finally)... estimated week of Feb 5 by Oleg Verych on Sunday, January 28, 2007 - 9:37 am. (1 message)

Next thread: [Fwd: [PATCH 2/7] lock_list: a fine grain locked double linked list] by Peter Zijlstra on Sunday, January 28, 2007 - 11:53 am. (4 messages)
To: <linux-kernel@...>, <linux-mm@...>
Cc: Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>
Date: Sunday, January 28, 2007 - 10:11 am

Eradicate global locks.

- kmap_lock is removed by extensive use of atomic_t, a new flush
scheme and modifying set_page_address to only allow NULL<->virt
transitions.

A count of 0 is an exclusive state acting as an entry lock. This is done
using inc_not_zero and cmpxchg. The restriction on changing the virtual
address closes the gap with concurrent additions of the same entry.

- pool_lock is removed by using the pkmap index for the
page_address_maps.

By using the pkmap index for the hash entries it is no longer needed to
keep a free list.

This patch has been in -rt for a while but should also help regular
highmem machines with multiple cores/cpus.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm.h | 32 ++-
mm/highmem.c | 433 ++++++++++++++++++++++++++++++-----------------------
2 files changed, 276 insertions(+), 189 deletions(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -543,23 +543,39 @@ static __always_inline void *lowmem_page
#endif

#if defined(WANT_PAGE_VIRTUAL)
-#define page_address(page) ((page)->virtual)
-#define set_page_address(page, address) \
- do { \
- (page)->virtual = (address); \
- } while(0)
-#define page_address_init() do { } while(0)
+/*
+ * wrap page->virtual so it is safe to set/read locklessly
+ */
+#define page_address(page) \
+ ({ typeof((page)->virtual) v = (page)->virtual; \
+ smp_read_barrier_depends(); \
+ v; })
+
+static inline int set_page_address(struct page *page, void *address)
+{
+ if (address)
+ return cmpxchg(&page->virtual, NULL, address) == NULL;
+ else {
+ /*
+ * cmpxchg is a bit abused because it is not guaranteed
+ * safe wrt direct assignment on all platforms.
+ */
+ void *virt = page->virtual;
+ return cmpxchg(&page->vitrual, virt, NULL) == virt;
+ }
+}
+void page_ad...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Sunday, January 28, 2007 - 6:29 pm

On Sun, 28 Jan 2007 15:11:34 +0100

I really don't recall any performance problems being reported out of that
code in recent years.

As Christoph says, it's very much preferred that code be migrated over to
kmap_atomic(). Partly because kmap() is deadlockable in situations where a
large number of threads are trying to take two kmaps at the same time and
we run out. This happened in the past, but incidences have gone away,

Have you verified that all architectures which can implement
WANT_PAGE_VIRTUAL also implement cmpxchg?

Have you verified that sufficient headers are included for this to compile
correctly on all WANT_PAGE_VIRTUAL-enabling architectures on all configs?
I don't see asm/system.h being included in mm.h and if I get yet another

This looks wrong. What happens if everyone else does their unmap between
the __set_current_state() and the add_wait_queue()?

-

To: Andrew Morton <akpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, January 29, 2007 - 3:08 pm

well, almost nobody profiles 32-bit boxes. I personally always knew that
kmap() sucks on certain 32-bit SMP workloads (and -rt's scheduling model
makes such bottlenecks even more apparent) - but many people acted in
the belief that 64-bit is all that matters and 32-bit scalability is
obsolete. Here are the numbers that i think changes the picture:

http://www.fedoraproject.org/awstats/stats/updates-released-fc6-i386.total
http://www.fedoraproject.org/awstats/stats/updates-released-fc6-x86_64.t...

For every 64-bit Fedora box there's more than seven 32-bit boxes. I
think 32-bit is going to live with us far longer than many thought, so
we might as well make it work better. Both HIGHMEM and HIGHPTE is the
default on many distro kernels, which pushes the kmap infrastructure

the problem is that everything that was easy to migrate was migrated off
kmap() already - and it's exactly those hard cases that cannot be
converted (like the pagecache use) which is the most frequent kmap()
users.

While "it would be nice" to eliminate kmap(), but reality is that it's
here and the patches from Peter to make it (quite a bit) more scalable
are here as well.

plus, with these fixes kmap() is actually faster than kmap_atomic().
(because kunmap_atomic() necessiates an INVLPG instruction which is
quite slow.)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, January 29, 2007 - 10:02 pm

I don't think anybody would argue against numbers, but just that there
are not many big 32-bit SMPs anymore. And if Bill Irwin didn't fix the
kmap problem back then, it would be interesting to see a system and
workload where it actually is a bottleneck.

Not that I'm against any patch to improve scalability, if it doesn't

Which pagecache use? file_read_actor()?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-

To: Andrew Morton <akpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, January 29, 2007 - 4:06 pm

i forgot to explain them:

these are updated daily i think. The counters started late October 2006,
when FC6 was released.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, January 29, 2007 - 3:19 pm

But HIGHPTE uses kmap_atomic (in mainline: does -rt use kmap there?)

Hugh
-

To: Hugh Dickins <hugh@...>
Cc: Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, January 29, 2007 - 3:53 pm

The contention i saw was on mainline and in the pagecache uses of
kmap(). With HIGHPTE i only meant that typically every available highmem
option is enabled on 32-bit distro kernel rpms, to make it work on as
wide selection of hardware as possible. Sometimes PAE is split into a
separate rpm, but mostly there's just one 32-bit kernel.

Ingo
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Monday, January 29, 2007 - 5:44 am

CONFIG_HIGHPTE code in -rt was horrid. I'll do some measurements on

It might have been my mistaken in understanding the latest cmpxchg
thread. My understanding was that since LL/SC is not exposable as a low
level primitive all platforms should implement a cmpxchg where some
would not be save against direct assignment.

I thought GCC would automagically use masking when presented with a

Eek, you are quite right.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Monday, January 29, 2007 - 9:31 pm

CONFIG_HIGHPTE is always horrid -we've known that for years.
Don't use it.

If that's all we're fixing here, I'd be highly suspect ...

M.

-

To: Martin J. Bligh <mbligh@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Monday, January 29, 2007 - 9:41 pm

On Mon, 29 Jan 2007 17:31:20 -0800

highpte uses atomic kmaps - it is unrelated to this work.
-

To: Andrew Morton <akpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Monday, January 29, 2007 - 9:49 pm

To: Martin J. Bligh <mbligh@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>, David Chinner <dgc@...>
Date: Monday, January 29, 2007 - 10:15 pm

On Mon, 29 Jan 2007 17:49:14 -0800

2% overhead for a pte-intensive workload for unknown reasons four years
ago. Sort of a mini-horrid, no?

We still don't know what is the source of kmap() activity which
necessitated this patch btw. AFAIK the busiest source is ext2 directories,
but perhaps NFS under certain conditions?

<looks at xfs_iozero>

->prepare_write no longer requires that the caller kmap the page.
-

To: Andrew Morton <akpm@...>
Cc: Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>, David Chinner <dgc@...>
Date: Tuesday, January 30, 2007 - 8:44 pm

Agreed, but don't we (xfs_iozero) have to map it first to zero it?

I think what you are saying here, Andrew, is that we can
do something like:

page = grab_cache_page
->prepare_write(page)
kaddr = kmap_atomic(page, KM_USER0)
memset(kaddr+offset, 0, bytes)
flush_dcache_page(page)
kunmap_atomic(kaddr, KM_USER0)
->commit_write(page)

to avoid using kmap() altogether?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: David Chinner <dgc@...>
Cc: Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Tuesday, January 30, 2007 - 9:11 pm

On Wed, 31 Jan 2007 11:44:36 +1100

Yup. Even better, use clear_highpage().
-

To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Tuesday, January 30, 2007 - 11:22 pm

For even more goodness, clearmem_highpage_flush() does exactly
the right thing for partial page zeroing ;)

Thanks, Andrew, I've added a patch to my QA tree with this mod.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: David Chinner <dgc@...>
Cc: Andrew Morton <akpm@...>, Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Friday, February 2, 2007 - 8:05 am

Note that there are tons of places in buffer.c that could use
clearmem_highpage_flush(). See the so far untested patch below:

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2007-02-02 12:53:51.000000000 +0100
+++ linux-2.6/fs/buffer.c 2007-02-02 12:59:42.000000000 +0100
@@ -1858,13 +1858,8 @@
if (block_start >= to)
break;
if (buffer_new(bh)) {
- void *kaddr;
-
clear_buffer_new(bh);
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr+block_start, 0, bh->b_size);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, block_start, bh->b_size);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
@@ -1952,10 +1947,8 @@
SetPageError(page);
}
if (!buffer_mapped(bh)) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + i * blocksize, 0, blocksize);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, i * blocksize,
+ blocksize);
if (!err)
set_buffer_uptodate(bh);
continue;
@@ -2098,7 +2091,6 @@
long status;
unsigned zerofrom;
unsigned blocksize = 1 << inode->i_blkbits;
- void *kaddr;

while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
status = -ENOMEM;
@@ -2120,10 +2112,8 @@
PAGE_CACHE_SIZE, get_block);
if (status)
goto out_unmap;
- kaddr = kmap_atomic(new_page, KM_USER0);
- memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
- flush_dcache_page(new_page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, zerofrom,
+ PAGE_CACHE_SIZE - zerofrom);
generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
unlock_page(new_page);
page_cache_release(new_page);
@@ -2150,10 +2140,7 @@
if (status)
goto out1;
if (zerofrom < offset) {
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr+zerofrom, ...

To: Christoph Hellwig <hch@...>, David Chinner <dgc@...>, Andrew Morton <akpm@...>, Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Friday, February 2, 2007 - 7:14 pm

Runs through XFSQA just fine. Looks good to me, Christoph.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: Christoph Hellwig <hch@...>, David Chinner <dgc@...>, Andrew Morton <akpm@...>, Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Friday, February 2, 2007 - 3:24 pm

--
Mathematics is the supreme nostalgia of our time.
-

To: Matt Mackall <mpm@...>
Cc: Christoph Hellwig <hch@...>, David Chinner <dgc@...>, Andrew Morton <akpm@...>, Martin J. Bligh <mbligh@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Friday, February 2, 2007 - 7:16 pm

Not needed - as usual, the code is right and the comments
are wrong. ;)

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: Andrew Morton <akpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Ingo Molnar <mingo@...>
Date: Sunday, January 28, 2007 - 10:52 pm

Simple: we should not implement cmpxchg in generic code. We should
be able to use atomic_long_cmpxchg for this -- it will work perfectly
regardless of what anybody else tells you.

cmpxchg is only required for when that memory location may get modified
by some other means than under your control (eg. userspace, in the case
of drm, or hardware MMU in the case of Christoph's old page fault
scalability patches).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>
Date: Sunday, January 28, 2007 - 10:49 am

What's the point for this? Extensive atomic_t use is usually much worse
than spinlocks. A spinlock region is just a single atomic instruction,
as soon as you do more than one atomic_t you tend to make scalability
worse. Not to mention that atomic_t are much worse when you try to
profile scalability issues.

What benchmark shows a problem with the current locking, and from what
---end quoted text---
-

To: Christoph Hellwig <hch@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Andrew Morton <akpm@...>
Date: Sunday, January 28, 2007 - 11:17 am

scalability. I did lock profiling on the -rt kernel, which exposes such
things nicely. Half of the lock contention events during kernel compile
were due to kmap(). (The system had 2 GB of RAM, so 40% lowmem, 60%

the pagecache ones cannot be converted to kmap_atomic, because we can
block while holding them. Plus kmap_atomic is quite a bit slower than
this scalable version of kmap().

Ingo

ps. please fix your mailer to not emit Mail-Followup-To headers. In Mutt
you can do this via "set followup_to=no" in your .muttrc.
-

To: Ingo Molnar <mingo@...>
Cc: Christoph Hellwig <hch@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Andrew Morton <akpm@...>
Date: Sunday, January 28, 2007 - 11:28 am

I have told you last time that this is absolutely intentional and I won't
change it.
-

To: Christoph Hellwig <hch@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Sunday, January 28, 2007 - 11:48 am

i'm sorry, but do you realize that files_lock is a global lock,
triggered by /every single/ file close?

" files_lock is a global lock and we touch it for every single
sys_close() system call that the system does. "

You really dont need to be a rocket scientist to see that it's a
globally bouncing cacheline that has a major effect on certain
VFS-intense workloads. Peter has worked hard to eliminate its effects
without having to couple this to an intrusive rewrite of the TTY layer.

( really, i personally find your dismissive style apalling and i think
such a reception of a nice patchset must be humiliating to Peter. I
certainly try to avoid to be involved with any VFS internals, due to
this unwelcoming tone of discussion. Had you been around when i
started contributing to the Linux kernel i'd probably not be hacking
the kernel today. You are a good hacker but the simultaneous
collateral damage you are causing is significantly reducing the net

( You are messing up the reply headers, everyone is listed in the 'To:'
field for any reply to your mail, instead of being added to the Cc:
list. )

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Christoph Hellwig <hch@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Sunday, January 28, 2007 - 11:54 am

Please check which thread you're in before you start such lengthy rants.

-

To: Christoph Hellwig <hch@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Sunday, January 28, 2007 - 2:19 pm

my reply applies to the other thread too, you made a similar comment
there too:

Ingo
-

Previous thread: Re: git.kernel.org move (finally)... estimated week of Feb 5 by Oleg Verych on Sunday, January 28, 2007 - 9:37 am. (1 message)

Next thread: [Fwd: [PATCH 2/7] lock_list: a fine grain locked double linked list] by Peter Zijlstra on Sunday, January 28, 2007 - 11:53 am. (4 messages)