Re: [rfc] increase struct page size?!

Previous thread: [PATCH] timerfd/eventfd context lock doesn't protect against poll_wait by Davi Arnaut on Thursday, May 17, 2007 - 8:45 pm. (4 messages)

Next thread: [PATCH -rt] futex_performance_hack sysctl build fix by Paul Mundt on Thursday, May 17, 2007 - 9:45 pm. (1 message)
From: Nick Piggin
Date: Thursday, May 17, 2007 - 9:08 pm

Hi,

I'd like to be the first to propose an increase to the size of struct page
just for the sake of increasing it!

If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
which is quite a nice number for cache purposes.

However we don't have to let those 8 bytes go to waste: we can use them
to store the virtual address of the page, which kind of makes sense for
64-bit, because they can likely to use complicated memory models.

I'd say all up this is going to decrease overall cache footprint in 
fastpaths, both by reducing text and data footprint of page_address and
related operations, and by reducing cacheline footprint of most batched
operations on struct pages.

Flame away :)

--

Many batch operations on struct page are completely random, and as such, I
think it is better if each struct page fits completely into a single
cacheline even if it means being slightly larger.

Don't let this space go to waste though, we can use page->virtual in order
to optimise page_address operations.

Interestingly, the irony of 32-bit architectures setting WANT_PAGE_VIRTUAL
because they have slow multiplications is that without WANT_PAGE_VIRTUAL, the
struct is 32-bytes and so page_address can usually be calculated with a shift.
So WANT_PAGE_VIRTUAL just bloats up the size of struct page for those guys!


Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -9,6 +9,14 @@
 struct address_space;
 
 /*
+ * WANT_PAGE_VIRTUAL on 64-bit machines gives a nice 64 byte alignment,
+ * so a struct page will fit entirely into a cacheline on modern CPUs.
+ */
+#if BITS_PER_LONG == 64
+# define WANT_PAGE_VIRTUAL
+#endif
+
+/*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
  * moment. Note that we have no way to track which tasks are ...
From: David Miller
Date: Thursday, May 17, 2007 - 9:47 pm

From: Nick Piggin <npiggin@suse.de>

I've toyed with this several times on sparc64, and in my experience
the extra memory reference on page->virtual costs on average about the
same as the non-power-of-2 pointer arithmetic.

The decision is absolutely arbitrary performance wise, but if you
consider the memory wastage on enormous systems going without
page->virtual I think is clearly better.
-

From: Nick Piggin
Date: Thursday, May 17, 2007 - 10:12 pm

Of course it is likely to be in the same cacheline, however if your L1
cache latency or throughput simply isn't up to it, FLATMEM systems
definitely could just not use ->virtual, but still add the extra padding
in the struct page for performance reasons... then they get to use
power-of-2 arithmetic to boot.

The page->virtual thing is just a bonus (although have you seen what
sort of hoops SPARSEMEM has to go through to find page_address?! It

0.2% of memory, or 2MB per GB. But considering we already use 14MB per
GB for the page structures, it isn't like I'm introducing an order of
magnitude problem.

The real benefit I see comes from cache footprint reduction of operations
on struct page. Consider with a 64 byte cacheline and 56 byte struct page,
then every 8 struct pages, 6 span 2 cachelines (75%). If you consider an
operation like reclaim that uses first and last fields, and assume that
pages are going to end up being random, then you're going to touch 75%
more cachelines.  Ditto for allocating and freeing pages.

-

From: David Miller
Date: Thursday, May 17, 2007 - 10:22 pm

From: Nick Piggin <npiggin@suse.de>

If you set the bit ranges in asm/sparsemem.h properly, as I
have currently on sparc64, it isn't bad at all.  It's a
single extra dereference from a table that sits in the main
kernel image and thus is in a locked TLB entry.

SPARSEMEM_EXTREME is pretty much unnecessary and with the
virtual mem-map stuff the sparsemem overhead goes away entirely
and we're back to "page - mem_map" type simple calculations

All these little things add up, let's not suck like some other
OSs by having that kind of mentality.

-

From: Nick Piggin
Date: Thursday, May 17, 2007 - 10:31 pm

Sure, but you'd still like to save several KB of icache by doing

They all do add up, but this isn't just wasting memory for no reason,
it is to make much better use of CPU caches. Back when PCs had only
a couple of MB of memory, size-speed optimisations were all the rage
because you had enough memory to throw around on big lookup tables and
such... that's only gone away because the cache cost hurts.

But this is one such size/speed tradeoff that actually should make
better use of the cache. Obviously extensive benchmarks are needed,
but I don't think it should be dismissed.

If you have a big problem with struct page overhead, cutting 8 bytes
off it isn't going to make you much happier -- you need to increase
PAGE_SIZE to get some real order-of-magnitude savings.

-

From: Christoph Lameter
Date: Friday, May 18, 2007 - 11:15 am

That is on the way out. See the discussion on virtual memmap support in 
sparseme.
-

From: Andrew Morton
Date: Friday, May 18, 2007 - 12:19 am

But they shouldn't be: we should aim to place physically contiguous pages
into logically contiguous pagecache slots, for all the reasons we
discussed.

If/when that happens, there will be a *lot* of locality of reference
against the pageframes in a lot of important codepaths.

-

From: Nick Piggin
Date: Friday, May 18, 2007 - 12:32 am

For big IO batch operations, pagecache would be more likely to be
physically contiguous, as would LRU, I suppose.

I'm more thinking of operations where things get reclaimed over time,
touched or dirtied in slightly different orderings, interleaved with

And when it doesn't happen, we eat 75% more cache misses. And for that
matter we eat 75% more cache misses for non-batch operations like
allocating or freeing a page by slab, for example.

-

From: Andrew Morton
Date: Friday, May 18, 2007 - 12:43 am

Yes, that can happen.  But in such cases we by definition aren't touching
the pageframes very often.  I'd assert that when the kernel is really
hitting those pageframes hard, it is commonly doing this in ascending

"measure twice, cut once"
-

From: Nick Piggin
Date: Friday, May 18, 2007 - 12:59 am

Of course, but if you're doing them on random-ish ranges, or multiple
files, or continually on the same file while parts of it get reclaimed

I'm not sure that I would always agree. Sure if there is random *IO*
involved, then it is going to be slow and we won't be hitting page frames
so hard. But if there is random pagecache access, it will hit them almost

Definitely agree there.

-

From: David Howells
Date: Friday, May 18, 2007 - 2:42 am

Whilst that's true, if you have to deal with a run of contiguous page structs
(eg: the page allocator, perhaps) it's actually less efficient because it
takes more cache to do it.  But, hey, it's a compromise whatever.

In the scheme of things, if we're mostly dealing with individual page structs
(as I think we are), then yes, I think it's probably a good thing to do -

That's a good idea, one that's implemented on some platforms anyway.  It'll be

kmap, filling in scatter/gather lists, crypto stuff.  I like it.

Can you do this just by turning on WANT_PAGE_VIRTUAL on all 64-bit platforms?

David
-

From: Nick Piggin
Date: Friday, May 18, 2007 - 6:30 pm

Yeah, we would end up eating about 12.5% more cachelines for contiguous
runs of pages... but that only kicks in after we've touched 8 of them I
think, and by that point the accesses should be very prefetchable.

I think the average of 75% more cachelines touched for random accesses
is going to outweigh the contiguous batch savings, but that's just a
guess at this point.
-

From: Andi Kleen
Date: Friday, May 18, 2007 - 5:06 am

I suspect the cache line footprint is not the main problem here (talking about
only one other cache line), but the potential latency of fetching the other 
half. One possible alternative instead of increasing struct page would be to 
identify places that commonly touch a page first (e.g. using oprofile) and 
then always add a prefetch()  there to fetch the other half of the page 
early. 

prefetch on something that is already in cache should be cheap,
so for the structs that don't straddle cachelines it shouldn't be a big
overhead.

I don't think doing the ->virtual addition will buy very much,
because at least the 64bit architectures will probably move
towards vmemmap where pfn->virt is quite cheap.

Of course the real long term fix for struct page cache overhead
would be larger soft page size.

-Andi
-

From: Hugh Dickins
Date: Friday, May 18, 2007 - 8:42 am

Sooner rather than later, don't we need those 8 bytes to expand from
atomic_t to atomic64_t _count and _mapcount?  Not that we really need
all 64 bits of both, but I don't know how to work atomically with less.

(Why do I have this sneaking feeling that you're actually wanting
to stick something into the lower bits of page->virtual?)

Hugh
-

From: Nick Piggin
Date: Friday, May 18, 2007 - 6:22 pm

No, I just thought I would get even more flamed if I was just adding
padding that wasn't even doing _anything_ :) I remembered Ken had a
benchmark where page_address was slow, and the rest is history...
-

From: William Lee Irwin III
Date: Saturday, May 19, 2007 - 10:53 am

I wonder how close we get to overflow on ->_mapcount and ->_count.
(untested/uncompiled).


-- wli


Index: mm-2.6.21/include/linux/mm.h
===================================================================
--- mm-2.6.21.orig/include/linux/mm.h	2007-05-19 10:17:17.682653270 -0700
+++ mm-2.6.21/include/linux/mm.h	2007-05-19 10:38:52.376433663 -0700
@@ -248,6 +248,24 @@
  * routine so they can be sure the page doesn't go away from under them.
  */
 
+static inline struct page *compound_head(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return page->first_page;
+	return page;
+}
+
+static inline int page_count(struct page *page)
+{
+	return atomic_read(&compound_head(page)->_count);
+}
+
+#ifdef CONFIG_ATOMIC_HIWATER
+unsigned long count_hiwater(void);
+int put_page_testzero(struct page *);
+int get_page_unless_zero(struct page *);
+void get_page(struct page *);
+#else /* !CONFIG_ATOMIC_HIWATER */
 /*
  * Drop a ref, return true if the refcount fell to zero (the page has no users)
  */
@@ -267,24 +285,13 @@
 	return atomic_inc_not_zero(&page->_count);
 }
 
-static inline struct page *compound_head(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return page->first_page;
-	return page;
-}
-
-static inline int page_count(struct page *page)
-{
-	return atomic_read(&compound_head(page)->_count);
-}
-
 static inline void get_page(struct page *page)
 {
 	page = compound_head(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
 	atomic_inc(&page->_count);
 }
+#endif /* !CONFIG_ATOMIC_HIWATER */
 
 static inline struct page *virt_to_head_page(const void *x)
 {
Index: mm-2.6.21/mm/Makefile
===================================================================
--- mm-2.6.21.orig/mm/Makefile	2007-05-18 09:58:43.851524250 -0700
+++ mm-2.6.21/mm/Makefile	2007-05-18 09:58:59.484415118 -0700
@@ -31,4 +31,4 @@
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += ...
From: Matthew Wilcox
Date: Sunday, May 20, 2007 - 3:50 pm

I think the problem is that an attacker can deliberately overflow
->_count, not that it can happen innocuously.  By mmaping, say, the page
of libc that contains memcpy() several million times, and forking
enough, can't you make ->_mapcount hit 0?  I'm not a VM guy, I just
vaguely remember people talking about this before.
-

From: Christoph Lameter
Date: Friday, May 18, 2007 - 11:14 am

That is not a valid consideration anymore. There is virtual memmap update 


page->virtual is a benefit if the page is cache hot. Otherwise it may 
cause a useless lookup.

I wonder if there are other uses for the free space?

-

From: Luck, Tony
Date: Friday, May 18, 2007 - 1:37 pm

> I wonder if there are other uses for the free space?

	unsigned long moreflags;

Nick and Hugh were just sparring over adding a couple (or perhaps 8)
flag bits.  This would supply 64 new bits ... maybe that would keep
them happy for a few more years.

-Tony
-

From: KAMEZAWA Hiroyuki
Date: Sunday, May 20, 2007 - 11:28 pm

On Fri, 18 May 2007 13:37:09 -0700
- page->zone 
  free some flags bits and makes page_zone() simple.
  and software (fake) zone for memory control can be added ?
or 

-page->some_memory_controler ?
(I don't know whether resource controller people want this or not.)

-Kame

-

From: Nick Piggin
Date: Friday, May 18, 2007 - 6:25 pm

It isn't the calculations I'm worried about, although they'll get simpler

It would be very rare for the page not to be in L1 cache at this point,
because we've likely taken a reference on it and/or locked it or moved it

Hugh points out that we should make _count and _mapcount atomic_long_t's,
which would probably be a better use of the space once your vmemmap goes
in.
-

From: Christoph Lameter
Date: Friday, May 18, 2007 - 7:03 pm

Well Andy was going to merge it:

http://marc.info/?l=linux-kernel&m=117620162415620&w=2

Andy when are we going to get the vmemmap patches into sparsemem?

-

From: Andy Whitcroft
Date: Saturday, May 19, 2007 - 8:43 am

Sorry this has been backed up with all the too-ing and fro-ing on other
things.  I am just cleaning up the next round which includes feedback
from wli and first stab at PPC64 support.  Should be out monday for review.

-apw
-

From: William Lee Irwin III
Date: Saturday, May 19, 2007 - 11:15 am

The cache cost argument is specious. Even misaligned, smaller is
smaller. The cache footprint reduction is merely amortized,
probabilistic, etc.




I'm not so sure about that. I doubt we have issues with that. I say
if there's to be padding to 64B to use the of the whole additional
space for additional flag bits. I'm sure fs's could make good use of
64 spare flag bits, or whatever's left over after the VM has its fill.
Perhaps so many spare flag bits could be used in lieu of buffer_heads.

page->virtual is the same old mistake as it was when it was removed.
The virtual mem_map code should be used to resolve the computational
expense. Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
44 bits or more, about as much as is possible, and one reference per
page per page is not even feasible. Full-length atomic_t's are just
not necessary.

However, there are numerous optimizations and features made possible
with flag bits, which might as could be made cheap by padding struct
page up to the next highest power of 2 bytes with space for flag bits.


-- wli
-

From: Christoph Lameter
Date: Saturday, May 19, 2007 - 11:25 am

Well the last time I tried to get this by Andi we became a bit concerned 
when we realized that the memory map would grow by 14% in size. Given 
that 4k page size challenged platforms have a huge amount of page structs 
that growth is significant. I think it would be fine to do it for IA64 
with 16k page size but not for x86_64.


-

From: Eric Dumazet
Date: Saturday, May 19, 2007 - 9:10 pm

This reminds me Andi attempted in the past to convert 'flags' to a 32 bits field :

http://marc.info/?l=linux-kernel&m=107903527523739&w=2

I wonder why this idea was not taken, saving 2MB per GB of memory is nice :)

-

From: Andi Kleen
Date: Sunday, May 20, 2007 - 5:56 am

It made sense in 2.4, but in 2.6 it doesn't actually save any memory because
there is no field to put into the freed padding.

Besides with the scarcity of pageflags it might make sense to do "64 bit only"
flags at some point.

-Andi
 
-

From: Christoph Lameter
Date: Monday, May 21, 2007 - 10:08 am

There is no scarcity of page flags. There is

1. Hoarding by Andrew

2. Waste by Sparsemem (section flags no longer necessary with
   virtual memmap)

2 will hopefully be addressed soon and with that 1 will go away.

-

From: KAMEZAWA Hiroyuki
Date: Monday, May 21, 2007 - 5:30 pm

On Mon, 21 May 2007 10:08:06 -0700 (PDT)

For i386(32bit arch), there is not enough space for vmemmap.
For 64bit arch, page flags are not exhausted yet.

-Kame


-

From: Christoph Lameter
Date: Monday, May 21, 2007 - 5:38 pm

I thought 32 bit would use flatmem? Is memory really sparse on 32 

Right.

-

From: KAMEZAWA Hiroyuki
Date: Monday, May 21, 2007 - 5:58 pm

On Mon, 21 May 2007 17:38:58 -0700 (PDT)

Of course, i386 can use flatmem.

I am just afraid that memory hotplug is just for sprasemem.
But I also think we can add memory-hotplug for flatmem if necessary.
(I myself have no plan now. I wonder memory power-save-mode may be supported
 by chipsets.)

-Kame

-

From: Geert Uytterhoeven
Date: Tuesday, May 22, 2007 - 2:44 am

Throwing in more crazy comments: many m68k boxes have really sparse memory, due
to lack of memory and large address space.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds
-

From: Andrew Morton
Date: Saturday, May 19, 2007 - 3:09 pm

You can overflow a page's refcount by mapping it 4G times.  That requires
32GB of pagetable memory.  It's quite feasible with remap_file_pages().

-

From: William Lee Irwin III
Date: Sunday, May 20, 2007 - 12:26 am

Oh dear, worst-case app behavior. I'm just wrong.


-- wli
-

From: Helge Hafting
Date: Monday, May 21, 2007 - 2:12 am

But do anybody ever need to do that?
Such an attack is easily thwarted by refusing to map it more
than, say 3G times? 

Helge Hafting
-

From: Nick Piggin
Date: Monday, May 21, 2007 - 2:45 am

That still allows you to DoS the page.

-

From: Nick Piggin
Date: Saturday, May 19, 2007 - 10:22 pm

Of course smaller is smaller ;) Why would that make the cache cost

I don't really know what you mean by this, or what part of my cache cost
argument you disagree with...

I think it is that you could construct mem_map access patterns, without
specifically looking at alignment, where a 56 byte struct page would suffer
about 75% more cache misses than a 64 byte aligned one (and you could also
get about 12% fewer cache misses with other access patterns).

I also think the kernel's mem_map access patterns would be more on the
random side, so overall would result in significantly fewer cache misses
with 64 byte aligned pages.


The issue is that userspace can DOS or crash the kernel by deliberately

Really? 64-bit architectures can already use about maybe 16 or 32 more
page flag bits than 32-bit architectures, and I definitely do not want
to increase the size of 32-bit struct page, so I think this wouldn't

Don't get too hung up on the page->virtual thing. I'll send another

I don't know what your 32 + PAGE_SHIFT calculation is for, but yes you
can wrap these counters from userspace on 64-bit architectures.

-

From: William Lee Irwin III
Date: Sunday, May 20, 2007 - 1:46 am

It's not possible to ignore aggregation. For instance, for a subset
of mem_map whose size ignoring alignment would otherwise fit in the
cache to completely avoid sharing any cachelines between page
structures requires page structures to be separated by at least one
mem_map index. This is highly unlikely in uniform distributions.




The lack of consideration of the average case. I'll see what I can smoke
out there.




This was a flat out error.




Actually they can't use most of those flag bits on account of
portability to the 32-bit case. A 32-bit flags on 64-bit is rather
plausible due to such.




That's fine.




That's just an error.


-- wli
-

From: Nick Piggin
Date: Sunday, May 20, 2007 - 2:25 am

But that wasn't my argument. I _know_ there are cases where the smaller
struct would be better, and I'm sure they would even arise in a running

I _am_ considering the average case, and I consider the aligned structure
is likely to win on average :) I just don't have numbers for it yet.


-

From: William Lee Irwin III
Date: Monday, May 21, 2007 - 1:08 am

Choosing k distinct integers (mem_map array indices) from the interval
[0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
array indices on average. The average interval length is
(n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
further somewhat hairy, but it should be clear that contiguity arising
from random choice is non-negligible.

In any event, I don't have all that much of an objection to what's
actually proposed, just this particular cache footprint argument.
One can motivate increases in sizeof(struct page), but not this way.

Now that I've been informed of the ->_count and ->_mapcount issues,
I'd say that they're grave and should be corrected even at the cost
of sizeof(struct page).


-- wli

Many thanks to int-e on EfNet #math for help with the calculations
(perhaps better described as doing them outright).
Heavily-edited IRC log (using Knuth's conventions for M, N, and k as
	the number of runs):

<int-e:#math> wli: oh maybe this can be solved exactly after all. The number
+of configurations of N numbers out of M with exactly k runs is C(N-1, k-1) *
+C(M-N+1, k). When there are k runs, the average run length is N/k, obviously.
<int-e:#math> wli: assume there are k runs. add an empty dummy element at the
+end and at the front - then you have (k+1) empty runs between the k runs.
+Every run has positive length. the empty runs correspond to a partition of
+M-N+2 into k+1 positive numbers, and the occupied runs correspond to one of N
+into k positive numbers, which gives that formula.
<int-e:#math> wli: So the average is 1/C(M,N) * sum[k=1 to N] N/k C(N-1,k-1)
+C(M-N+1,k) = 1/C(M,N) * sum[k=1 to N] C(N,k) C(M-N+1,k) = 1/C(M,N) * (C(M+1,
+N) - 1) = (M+1)/(M-N+1) - 1/C(M,N).
-

From: Nick Piggin
Date: Monday, May 21, 2007 - 2:27 am

Realise that you have to have a run of I think at least 7 or 8 contiguous
pages and temporally close references in order to save a single cacheline.

Then also that if the page being touched is not partially in cache from
an earlier access, then it is statistically going to cost more lines to
touch it (up to 75% if you touch the first and the last field, obviously 0%
if you only touch a single field, but that's unlikely given that you
usually take a reference then do at least something else like check flags).

I think the problem with the cache footprint argument is just whether

... yeah, something like that would bypass 
-

From: William Lee Irwin III
Date: Monday, May 21, 2007 - 4:26 am

It doesn't need to. If what's in the cache is uniformly distributed,
you get that result for spatial locality. From there, it's counting
cachelines.




The average interval ("run") length is (n+1)/(n-k+1) - 1/C(n,k), so for
that to be >= 8 you need (n+1)/(n-k+1) - 1/C(n,k) >= 8 which also happens
when (n+1)/(n-k+1) >= 9 or when n >= (9/8)*k - 1 or k <= (8/9)*(n+1).
Clearly a lower bound on k is required, but not obviously derivable.
k >= 8 is obvious, but the least k where (n+1)/(n-k+1) - 1/C(n,k) >= 8
is not entirely obvious. Numerically solving for the least such k finds
that k actually needs to be relatively close to (8/9)*n. A lower bound
of something like 0.87*n + O(1) probably holds.




Did you get cut off here?


-- wli
-

From: Nick Piggin
Date: Monday, May 21, 2007 - 5:52 pm

OK, so your 'k' is the number of struct pages that are in cache? Then
that's fine.

I'm not sure how many that is going to be, but I would be surprised if
it were a significant proportion of mem_map, even on not-so-large

Ah, you worked it out... yeah I'd guess this is going to be pretty difficult
a condition to satisfy (given that it isn't possible for a 4GB system, even

Must have. I was going to say it would bypass the whole speed/size
discussion anyway :P
-

From: Matt Mackall
Date: Monday, May 21, 2007 - 3:43 pm

As long as we're throwing out crazy unpopular ideas, try this one:

Divide struct page in two such that all the most commonly used
elements are in one piece that's nicely sized and the rest are in
another. Have two parallel arrays containing these pieces and accessor
functions around the unpopular bits.

Whether a sensible divide between popular and unpopular bits isn't
clear to me. But hey, I said it was crazy.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Nick Piggin
Date: Monday, May 21, 2007 - 6:08 pm

That would be unpopular with pagecache, because that uses pretty well
all fields.

-

From: Christoph Lameter
Date: Monday, May 21, 2007 - 6:13 pm

SLUB also uses all fields....

-

From: William Lee Irwin III
Date: Monday, May 21, 2007 - 6:39 pm

I have a crazier and even less popular idea. Eliminate struct page
entirely as an accounting structure (and, of course, mem_map with it).
Filesystems can keep the per-page metadata they need in their own
accounting structures, slab mutatis mutandis, etc. The brilliant bit
here is that devolving the accounting structures this way allows the
fs and/or subsystem to arrange for strong cache locality, file offset
adjacency to imply memory adjacency of the page accounting fields,
etc., where grabbing random structures out of some array is a real
cache thrasher.

The page allocation and page replacement algorithms would have to be
adjusted, and things would have to allocate their own refcounts,
supposing they want/need refcounts, but it's not so far out. Refer to
filesystem pages by <mapping, index> pairs, refer to slab pages by
address (virtual and physical are trivially inter-convertible), mock
up something akin to what filesystems do for anonymous pages, etc.

The real objection everyone's going to have is that driver writers
will stain their shorts when faced with the rules for handling such
things. The thing is, I'm not entirely sure who these driver writers
that would have such trouble are, since the driver writers I know
personally are sophisticates rather than walking disaster areas as such
would imply. I suppose they may not be representative of the whole.


-- wli

P.S. This idea is not plucked out of the air; it has precedents. A
number of microkernels do this, and IIRC k42 does so also.
-

From: Nick Piggin
Date: Monday, May 21, 2007 - 6:57 pm

BTW. I think the filesystem APIs (at least the VM-side ones) should be
doing this anyway (not even index, but offset). Passing things like
lists of pages around is just horrible. See my write_begin/write_end
and perform_write aops for (what I think is) a step in the right

That's not the objection I would have. I would say that firstly, I
don't think the mem_map overhead is very significant (at any rate,
an allocated-on-demand metadata is not going to be any smaller if
you fill up on pagecache...). Secondly, I think there is merit to
having the same page metadata used by the major subsystems, because
it helps for locality of reference.

But I haven't explored the idea enough myself to know whether there
would be any really killer benefits to this. Delayed metadata freeing
via RCU without holding up the freeing of the actual page would have
been something, however I can do similar with speculative references
now (or whenever the code gets merged), which doesn't even require the

Psst, just say "kernels" when you mention this to Linus ;)
-

From: William Lee Irwin III
Date: Monday, May 21, 2007 - 10:04 pm

The size isn't the advantage being cited; I'd actually expect the net
result to be larger. It's the control over the layout of the metadata
for cache locality and even things like having enough flags, folding
buffer_head -like affairs into the per-page metadata for filesystems
and so reaping cache locality benefits even there (assuming it works
out in other respects), and so on.

Passing pages between subsystems doesn't seem very significant to me.
There isn't going to be much locality of reference, or even any
guarantee that the subsystem gets fed a cache hot page structure. The
subsystem being passed the page will have its own cache hot accounting
structures to stick the information about the memory into.



I'm not entirely sure what you're on about there, but it sounds
interesting.


-- wli
-

From: Eric Dumazet
Date: Monday, May 21, 2007 - 2:31 am

On Mon, 21 May 2007 01:08:13 -0700


As long we handle 4 KB pages, adding 64 bits per page means 0.2 % of overhead. Ouch...

We currently have an overhead of 1.36 % for mem_map

Maybe we can still use 32 bits counters, and make sure non root users cannot
make these counters exceed 2^30. (I believe high order bit has already a meaning, 
check page_mapped() definition)

We could use a special atomic_inc_if_not_huge() function, that could revert to
 normal atomic_inc() on machines with less than 32 GB (using alternative_() variant)

On small setups (or 32 bits arches), atomic_inc_if_not_huge() would unconditionnally 
increment the counter.

#if !defined(BIG_MACHINES)
static int inline atomic_inc_if_not_huge(atomic_t *v)
{
atomic_inc(v);
return 1;
}
#else
extern int atomic_inc_if_not_huge(atomic_t *v);

#endif


/* in a .c file */
/* could be patched at boot time if available memory < 32GB (or other limit) */
#if defined(BIG_MACHINES)
#define MAP_LIMIT_COUNT (2<<30)
int atomic_inc_if_not_huge(atomic_t *v);
{
/* lazy test, we dont care enough to do a real atomic read-modify-write */
if (unlikely(atomic_read(v) >= MAP_LIMIT_COUNT)) {
	if (non_root_user())
		return 0;
	}
atomic_inc(v);
return 1;
}
#endif
-

From: Christoph Lameter
Date: Monday, May 21, 2007 - 10:06 am

I'd be glad too if you could get some numbers. I did some benchmarking a 
few weeks ago on x86_64 and I found only a very minimal performance drop 
if the calculation was simplified. 

Note also that a smaller structure means that more page structs can be 
covered by a certain amount of cachelines. Doing the alignment may cause 
more cacheline misses.

-

From: Andrea Arcangeli
Date: Sunday, May 20, 2007 - 10:13 am

We had those hardware alignment for many data structures where they
were only wasting memory (i.e. vmas).

There are few places where the hardware alignment matters, page struct

IIRC the math is faster for any x86. Overall I doubt the change is
measurable.

Even if this would be a microoptimization barely measurable in some
microbenchmark, I don't think this one is worth doing. mem_map is such
a bloat that it really has to be as small as it can unless we can

If you want to drop it you can, there's nothing fundamental that
prevents you to drop the 'virtual' completely from page struct, by
just making the vaddr per-process and storing it on the stack like
with the atomic kmaps, but passing it up the stack may require heavy
changes to various apis, which is why we've taken the few-changes lazy
way back then. If it wasn't worth back then, I doubt it worth now for
just pae36.
-

Previous thread: [PATCH] timerfd/eventfd context lock doesn't protect against poll_wait by Davi Arnaut on Thursday, May 17, 2007 - 8:45 pm. (4 messages)

Next thread: [PATCH -rt] futex_performance_hack sysctl build fix by Paul Mundt on Thursday, May 17, 2007 - 9:45 pm. (1 message)