Hi, I'd like to be the first to propose an increase to the size of struct page just for the sake of increasing it! If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes, which is quite a nice number for cache purposes. However we don't have to let those 8 bytes go to waste: we can use them to store the virtual address of the page, which kind of makes sense for 64-bit, because they can likely to use complicated memory models. I'd say all up this is going to decrease overall cache footprint in fastpaths, both by reducing text and data footprint of page_address and related operations, and by reducing cacheline footprint of most batched operations on struct pages. Flame away :) -- Many batch operations on struct page are completely random, and as such, I think it is better if each struct page fits completely into a single cacheline even if it means being slightly larger. Don't let this space go to waste though, we can use page->virtual in order to optimise page_address operations. Interestingly, the irony of 32-bit architectures setting WANT_PAGE_VIRTUAL because they have slow multiplications is that without WANT_PAGE_VIRTUAL, the struct is 32-bytes and so page_address can usually be calculated with a shift. So WANT_PAGE_VIRTUAL just bloats up the size of struct page for those guys! Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h +++ linux-2.6/include/linux/mm_types.h @@ -9,6 +9,14 @@ struct address_space; /* + * WANT_PAGE_VIRTUAL on 64-bit machines gives a nice 64 byte alignment, + * so a struct page will fit entirely into a cacheline on modern CPUs. + */ +#if BITS_PER_LONG == 64 +# define WANT_PAGE_VIRTUAL +#endif + +/* * Each physical page in the system has a struct page associated with * it to keep track of whatever it is we are using the page for at the * moment. Note that we have no way to track which tasks are ...
From: Nick Piggin <npiggin@suse.de> I've toyed with this several times on sparc64, and in my experience the extra memory reference on page->virtual costs on average about the same as the non-power-of-2 pointer arithmetic. The decision is absolutely arbitrary performance wise, but if you consider the memory wastage on enormous systems going without page->virtual I think is clearly better. -
Of course it is likely to be in the same cacheline, however if your L1 cache latency or throughput simply isn't up to it, FLATMEM systems definitely could just not use ->virtual, but still add the extra padding in the struct page for performance reasons... then they get to use power-of-2 arithmetic to boot. The page->virtual thing is just a bonus (although have you seen what sort of hoops SPARSEMEM has to go through to find page_address?! It 0.2% of memory, or 2MB per GB. But considering we already use 14MB per GB for the page structures, it isn't like I'm introducing an order of magnitude problem. The real benefit I see comes from cache footprint reduction of operations on struct page. Consider with a 64 byte cacheline and 56 byte struct page, then every 8 struct pages, 6 span 2 cachelines (75%). If you consider an operation like reclaim that uses first and last fields, and assume that pages are going to end up being random, then you're going to touch 75% more cachelines. Ditto for allocating and freeing pages. -
From: Nick Piggin <npiggin@suse.de> If you set the bit ranges in asm/sparsemem.h properly, as I have currently on sparc64, it isn't bad at all. It's a single extra dereference from a table that sits in the main kernel image and thus is in a locked TLB entry. SPARSEMEM_EXTREME is pretty much unnecessary and with the virtual mem-map stuff the sparsemem overhead goes away entirely and we're back to "page - mem_map" type simple calculations All these little things add up, let's not suck like some other OSs by having that kind of mentality. -
Sure, but you'd still like to save several KB of icache by doing They all do add up, but this isn't just wasting memory for no reason, it is to make much better use of CPU caches. Back when PCs had only a couple of MB of memory, size-speed optimisations were all the rage because you had enough memory to throw around on big lookup tables and such... that's only gone away because the cache cost hurts. But this is one such size/speed tradeoff that actually should make better use of the cache. Obviously extensive benchmarks are needed, but I don't think it should be dismissed. If you have a big problem with struct page overhead, cutting 8 bytes off it isn't going to make you much happier -- you need to increase PAGE_SIZE to get some real order-of-magnitude savings. -
That is on the way out. See the discussion on virtual memmap support in sparseme. -
But they shouldn't be: we should aim to place physically contiguous pages into logically contiguous pagecache slots, for all the reasons we discussed. If/when that happens, there will be a *lot* of locality of reference against the pageframes in a lot of important codepaths. -
For big IO batch operations, pagecache would be more likely to be physically contiguous, as would LRU, I suppose. I'm more thinking of operations where things get reclaimed over time, touched or dirtied in slightly different orderings, interleaved with And when it doesn't happen, we eat 75% more cache misses. And for that matter we eat 75% more cache misses for non-batch operations like allocating or freeing a page by slab, for example. -
Yes, that can happen. But in such cases we by definition aren't touching the pageframes very often. I'd assert that when the kernel is really hitting those pageframes hard, it is commonly doing this in ascending "measure twice, cut once" -
Of course, but if you're doing them on random-ish ranges, or multiple files, or continually on the same file while parts of it get reclaimed I'm not sure that I would always agree. Sure if there is random *IO* involved, then it is going to be slow and we won't be hitting page frames so hard. But if there is random pagecache access, it will hit them almost Definitely agree there. -
Whilst that's true, if you have to deal with a run of contiguous page structs (eg: the page allocator, perhaps) it's actually less efficient because it takes more cache to do it. But, hey, it's a compromise whatever. In the scheme of things, if we're mostly dealing with individual page structs (as I think we are), then yes, I think it's probably a good thing to do - That's a good idea, one that's implemented on some platforms anyway. It'll be kmap, filling in scatter/gather lists, crypto stuff. I like it. Can you do this just by turning on WANT_PAGE_VIRTUAL on all 64-bit platforms? David -
Yeah, we would end up eating about 12.5% more cachelines for contiguous runs of pages... but that only kicks in after we've touched 8 of them I think, and by that point the accesses should be very prefetchable. I think the average of 75% more cachelines touched for random accesses is going to outweigh the contiguous batch savings, but that's just a guess at this point. -
I suspect the cache line footprint is not the main problem here (talking about only one other cache line), but the potential latency of fetching the other half. One possible alternative instead of increasing struct page would be to identify places that commonly touch a page first (e.g. using oprofile) and then always add a prefetch() there to fetch the other half of the page early. prefetch on something that is already in cache should be cheap, so for the structs that don't straddle cachelines it shouldn't be a big overhead. I don't think doing the ->virtual addition will buy very much, because at least the 64bit architectures will probably move towards vmemmap where pfn->virt is quite cheap. Of course the real long term fix for struct page cache overhead would be larger soft page size. -Andi -
Sooner rather than later, don't we need those 8 bytes to expand from atomic_t to atomic64_t _count and _mapcount? Not that we really need all 64 bits of both, but I don't know how to work atomically with less. (Why do I have this sneaking feeling that you're actually wanting to stick something into the lower bits of page->virtual?) Hugh -
No, I just thought I would get even more flamed if I was just adding padding that wasn't even doing _anything_ :) I remembered Ken had a benchmark where page_address was slow, and the rest is history... -
I wonder how close we get to overflow on ->_mapcount and ->_count.
(untested/uncompiled).
-- wli
Index: mm-2.6.21/include/linux/mm.h
===================================================================
--- mm-2.6.21.orig/include/linux/mm.h 2007-05-19 10:17:17.682653270 -0700
+++ mm-2.6.21/include/linux/mm.h 2007-05-19 10:38:52.376433663 -0700
@@ -248,6 +248,24 @@
* routine so they can be sure the page doesn't go away from under them.
*/
+static inline struct page *compound_head(struct page *page)
+{
+ if (unlikely(PageTail(page)))
+ return page->first_page;
+ return page;
+}
+
+static inline int page_count(struct page *page)
+{
+ return atomic_read(&compound_head(page)->_count);
+}
+
+#ifdef CONFIG_ATOMIC_HIWATER
+unsigned long count_hiwater(void);
+int put_page_testzero(struct page *);
+int get_page_unless_zero(struct page *);
+void get_page(struct page *);
+#else /* !CONFIG_ATOMIC_HIWATER */
/*
* Drop a ref, return true if the refcount fell to zero (the page has no users)
*/
@@ -267,24 +285,13 @@
return atomic_inc_not_zero(&page->_count);
}
-static inline struct page *compound_head(struct page *page)
-{
- if (unlikely(PageTail(page)))
- return page->first_page;
- return page;
-}
-
-static inline int page_count(struct page *page)
-{
- return atomic_read(&compound_head(page)->_count);
-}
-
static inline void get_page(struct page *page)
{
page = compound_head(page);
VM_BUG_ON(atomic_read(&page->_count) == 0);
atomic_inc(&page->_count);
}
+#endif /* !CONFIG_ATOMIC_HIWATER */
static inline struct page *virt_to_head_page(const void *x)
{
Index: mm-2.6.21/mm/Makefile
===================================================================
--- mm-2.6.21.orig/mm/Makefile 2007-05-18 09:58:43.851524250 -0700
+++ mm-2.6.21/mm/Makefile 2007-05-18 09:58:59.484415118 -0700
@@ -31,4 +31,4 @@
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += ...I think the problem is that an attacker can deliberately overflow ->_count, not that it can happen innocuously. By mmaping, say, the page of libc that contains memcpy() several million times, and forking enough, can't you make ->_mapcount hit 0? I'm not a VM guy, I just vaguely remember people talking about this before. -
That is not a valid consideration anymore. There is virtual memmap update page->virtual is a benefit if the page is cache hot. Otherwise it may cause a useless lookup. I wonder if there are other uses for the free space? -
> I wonder if there are other uses for the free space? unsigned long moreflags; Nick and Hugh were just sparring over adding a couple (or perhaps 8) flag bits. This would supply 64 new bits ... maybe that would keep them happy for a few more years. -Tony -
On Fri, 18 May 2007 13:37:09 -0700 - page->zone free some flags bits and makes page_zone() simple. and software (fake) zone for memory control can be added ? or -page->some_memory_controler ? (I don't know whether resource controller people want this or not.) -Kame -
It isn't the calculations I'm worried about, although they'll get simpler It would be very rare for the page not to be in L1 cache at this point, because we've likely taken a reference on it and/or locked it or moved it Hugh points out that we should make _count and _mapcount atomic_long_t's, which would probably be a better use of the space once your vmemmap goes in. -
Well Andy was going to merge it: http://marc.info/?l=linux-kernel&m=117620162415620&w=2 Andy when are we going to get the vmemmap patches into sparsemem? -
Sorry this has been backed up with all the too-ing and fro-ing on other things. I am just cleaning up the next round which includes feedback from wli and first stab at PPC64 support. Should be out monday for review. -apw -
The cache cost argument is specious. Even misaligned, smaller is smaller. The cache footprint reduction is merely amortized, probabilistic, etc. I'm not so sure about that. I doubt we have issues with that. I say if there's to be padding to 64B to use the of the whole additional space for additional flag bits. I'm sure fs's could make good use of 64 spare flag bits, or whatever's left over after the VM has its fill. Perhaps so many spare flag bits could be used in lieu of buffer_heads. page->virtual is the same old mistake as it was when it was removed. The virtual mem_map code should be used to resolve the computational expense. Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is 44 bits or more, about as much as is possible, and one reference per page per page is not even feasible. Full-length atomic_t's are just not necessary. However, there are numerous optimizations and features made possible with flag bits, which might as could be made cheap by padding struct page up to the next highest power of 2 bytes with space for flag bits. -- wli -
Well the last time I tried to get this by Andi we became a bit concerned when we realized that the memory map would grow by 14% in size. Given that 4k page size challenged platforms have a huge amount of page structs that growth is significant. I think it would be fine to do it for IA64 with 16k page size but not for x86_64. -
This reminds me Andi attempted in the past to convert 'flags' to a 32 bits field : http://marc.info/?l=linux-kernel&m=107903527523739&w=2 I wonder why this idea was not taken, saving 2MB per GB of memory is nice :) -
It made sense in 2.4, but in 2.6 it doesn't actually save any memory because there is no field to put into the freed padding. Besides with the scarcity of pageflags it might make sense to do "64 bit only" flags at some point. -Andi -
There is no scarcity of page flags. There is 1. Hoarding by Andrew 2. Waste by Sparsemem (section flags no longer necessary with virtual memmap) 2 will hopefully be addressed soon and with that 1 will go away. -
On Mon, 21 May 2007 10:08:06 -0700 (PDT) For i386(32bit arch), there is not enough space for vmemmap. For 64bit arch, page flags are not exhausted yet. -Kame -
I thought 32 bit would use flatmem? Is memory really sparse on 32 Right. -
On Mon, 21 May 2007 17:38:58 -0700 (PDT) Of course, i386 can use flatmem. I am just afraid that memory hotplug is just for sprasemem. But I also think we can add memory-hotplug for flatmem if necessary. (I myself have no plan now. I wonder memory power-save-mode may be supported by chipsets.) -Kame -
Throwing in more crazy comments: many m68k boxes have really sparse memory, due
to lack of memory and large address space.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-
You can overflow a page's refcount by mapping it 4G times. That requires 32GB of pagetable memory. It's quite feasible with remap_file_pages(). -
Oh dear, worst-case app behavior. I'm just wrong. -- wli -
But do anybody ever need to do that? Such an attack is easily thwarted by refusing to map it more than, say 3G times? Helge Hafting -
Of course smaller is smaller ;) Why would that make the cache cost I don't really know what you mean by this, or what part of my cache cost argument you disagree with... I think it is that you could construct mem_map access patterns, without specifically looking at alignment, where a 56 byte struct page would suffer about 75% more cache misses than a 64 byte aligned one (and you could also get about 12% fewer cache misses with other access patterns). I also think the kernel's mem_map access patterns would be more on the random side, so overall would result in significantly fewer cache misses with 64 byte aligned pages. The issue is that userspace can DOS or crash the kernel by deliberately Really? 64-bit architectures can already use about maybe 16 or 32 more page flag bits than 32-bit architectures, and I definitely do not want to increase the size of 32-bit struct page, so I think this wouldn't Don't get too hung up on the page->virtual thing. I'll send another I don't know what your 32 + PAGE_SHIFT calculation is for, but yes you can wrap these counters from userspace on 64-bit architectures. -
It's not possible to ignore aggregation. For instance, for a subset of mem_map whose size ignoring alignment would otherwise fit in the cache to completely avoid sharing any cachelines between page structures requires page structures to be separated by at least one mem_map index. This is highly unlikely in uniform distributions. The lack of consideration of the average case. I'll see what I can smoke out there. This was a flat out error. Actually they can't use most of those flag bits on account of portability to the 32-bit case. A 32-bit flags on 64-bit is rather plausible due to such. That's fine. That's just an error. -- wli -
But that wasn't my argument. I _know_ there are cases where the smaller struct would be better, and I'm sure they would even arise in a running I _am_ considering the average case, and I consider the aligned structure is likely to win on average :) I just don't have numbers for it yet. -
Choosing k distinct integers (mem_map array indices) from the interval [0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous array indices on average. The average interval length is (n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much further somewhat hairy, but it should be clear that contiguity arising from random choice is non-negligible. In any event, I don't have all that much of an objection to what's actually proposed, just this particular cache footprint argument. One can motivate increases in sizeof(struct page), but not this way. Now that I've been informed of the ->_count and ->_mapcount issues, I'd say that they're grave and should be corrected even at the cost of sizeof(struct page). -- wli Many thanks to int-e on EfNet #math for help with the calculations (perhaps better described as doing them outright). Heavily-edited IRC log (using Knuth's conventions for M, N, and k as the number of runs): <int-e:#math> wli: oh maybe this can be solved exactly after all. The number +of configurations of N numbers out of M with exactly k runs is C(N-1, k-1) * +C(M-N+1, k). When there are k runs, the average run length is N/k, obviously. <int-e:#math> wli: assume there are k runs. add an empty dummy element at the +end and at the front - then you have (k+1) empty runs between the k runs. +Every run has positive length. the empty runs correspond to a partition of +M-N+2 into k+1 positive numbers, and the occupied runs correspond to one of N +into k positive numbers, which gives that formula. <int-e:#math> wli: So the average is 1/C(M,N) * sum[k=1 to N] N/k C(N-1,k-1) +C(M-N+1,k) = 1/C(M,N) * sum[k=1 to N] C(N,k) C(M-N+1,k) = 1/C(M,N) * (C(M+1, +N) - 1) = (M+1)/(M-N+1) - 1/C(M,N). -
Realise that you have to have a run of I think at least 7 or 8 contiguous pages and temporally close references in order to save a single cacheline. Then also that if the page being touched is not partially in cache from an earlier access, then it is statistically going to cost more lines to touch it (up to 75% if you touch the first and the last field, obviously 0% if you only touch a single field, but that's unlikely given that you usually take a reference then do at least something else like check flags). I think the problem with the cache footprint argument is just whether ... yeah, something like that would bypass -
It doesn't need to. If what's in the cache is uniformly distributed, you get that result for spatial locality. From there, it's counting cachelines. The average interval ("run") length is (n+1)/(n-k+1) - 1/C(n,k), so for that to be >= 8 you need (n+1)/(n-k+1) - 1/C(n,k) >= 8 which also happens when (n+1)/(n-k+1) >= 9 or when n >= (9/8)*k - 1 or k <= (8/9)*(n+1). Clearly a lower bound on k is required, but not obviously derivable. k >= 8 is obvious, but the least k where (n+1)/(n-k+1) - 1/C(n,k) >= 8 is not entirely obvious. Numerically solving for the least such k finds that k actually needs to be relatively close to (8/9)*n. A lower bound of something like 0.87*n + O(1) probably holds. Did you get cut off here? -- wli -
OK, so your 'k' is the number of struct pages that are in cache? Then that's fine. I'm not sure how many that is going to be, but I would be surprised if it were a significant proportion of mem_map, even on not-so-large Ah, you worked it out... yeah I'd guess this is going to be pretty difficult a condition to satisfy (given that it isn't possible for a 4GB system, even Must have. I was going to say it would bypass the whole speed/size discussion anyway :P -
As long as we're throwing out crazy unpopular ideas, try this one: Divide struct page in two such that all the most commonly used elements are in one piece that's nicely sized and the rest are in another. Have two parallel arrays containing these pieces and accessor functions around the unpopular bits. Whether a sensible divide between popular and unpopular bits isn't clear to me. But hey, I said it was crazy. -- Mathematics is the supreme nostalgia of our time. -
That would be unpopular with pagecache, because that uses pretty well all fields. -
I have a crazier and even less popular idea. Eliminate struct page entirely as an accounting structure (and, of course, mem_map with it). Filesystems can keep the per-page metadata they need in their own accounting structures, slab mutatis mutandis, etc. The brilliant bit here is that devolving the accounting structures this way allows the fs and/or subsystem to arrange for strong cache locality, file offset adjacency to imply memory adjacency of the page accounting fields, etc., where grabbing random structures out of some array is a real cache thrasher. The page allocation and page replacement algorithms would have to be adjusted, and things would have to allocate their own refcounts, supposing they want/need refcounts, but it's not so far out. Refer to filesystem pages by <mapping, index> pairs, refer to slab pages by address (virtual and physical are trivially inter-convertible), mock up something akin to what filesystems do for anonymous pages, etc. The real objection everyone's going to have is that driver writers will stain their shorts when faced with the rules for handling such things. The thing is, I'm not entirely sure who these driver writers that would have such trouble are, since the driver writers I know personally are sophisticates rather than walking disaster areas as such would imply. I suppose they may not be representative of the whole. -- wli P.S. This idea is not plucked out of the air; it has precedents. A number of microkernels do this, and IIRC k42 does so also. -
BTW. I think the filesystem APIs (at least the VM-side ones) should be doing this anyway (not even index, but offset). Passing things like lists of pages around is just horrible. See my write_begin/write_end and perform_write aops for (what I think is) a step in the right That's not the objection I would have. I would say that firstly, I don't think the mem_map overhead is very significant (at any rate, an allocated-on-demand metadata is not going to be any smaller if you fill up on pagecache...). Secondly, I think there is merit to having the same page metadata used by the major subsystems, because it helps for locality of reference. But I haven't explored the idea enough myself to know whether there would be any really killer benefits to this. Delayed metadata freeing via RCU without holding up the freeing of the actual page would have been something, however I can do similar with speculative references now (or whenever the code gets merged), which doesn't even require the Psst, just say "kernels" when you mention this to Linus ;) -
The size isn't the advantage being cited; I'd actually expect the net result to be larger. It's the control over the layout of the metadata for cache locality and even things like having enough flags, folding buffer_head -like affairs into the per-page metadata for filesystems and so reaping cache locality benefits even there (assuming it works out in other respects), and so on. Passing pages between subsystems doesn't seem very significant to me. There isn't going to be much locality of reference, or even any guarantee that the subsystem gets fed a cache hot page structure. The subsystem being passed the page will have its own cache hot accounting structures to stick the information about the memory into. I'm not entirely sure what you're on about there, but it sounds interesting. -- wli -
On Mon, 21 May 2007 01:08:13 -0700
As long we handle 4 KB pages, adding 64 bits per page means 0.2 % of overhead. Ouch...
We currently have an overhead of 1.36 % for mem_map
Maybe we can still use 32 bits counters, and make sure non root users cannot
make these counters exceed 2^30. (I believe high order bit has already a meaning,
check page_mapped() definition)
We could use a special atomic_inc_if_not_huge() function, that could revert to
normal atomic_inc() on machines with less than 32 GB (using alternative_() variant)
On small setups (or 32 bits arches), atomic_inc_if_not_huge() would unconditionnally
increment the counter.
#if !defined(BIG_MACHINES)
static int inline atomic_inc_if_not_huge(atomic_t *v)
{
atomic_inc(v);
return 1;
}
#else
extern int atomic_inc_if_not_huge(atomic_t *v);
#endif
/* in a .c file */
/* could be patched at boot time if available memory < 32GB (or other limit) */
#if defined(BIG_MACHINES)
#define MAP_LIMIT_COUNT (2<<30)
int atomic_inc_if_not_huge(atomic_t *v);
{
/* lazy test, we dont care enough to do a real atomic read-modify-write */
if (unlikely(atomic_read(v) >= MAP_LIMIT_COUNT)) {
if (non_root_user())
return 0;
}
atomic_inc(v);
return 1;
}
#endif
-
I'd be glad too if you could get some numbers. I did some benchmarking a few weeks ago on x86_64 and I found only a very minimal performance drop if the calculation was simplified. Note also that a smaller structure means that more page structs can be covered by a certain amount of cachelines. Doing the alignment may cause more cacheline misses. -
We had those hardware alignment for many data structures where they were only wasting memory (i.e. vmas). There are few places where the hardware alignment matters, page struct IIRC the math is faster for any x86. Overall I doubt the change is measurable. Even if this would be a microoptimization barely measurable in some microbenchmark, I don't think this one is worth doing. mem_map is such a bloat that it really has to be as small as it can unless we can If you want to drop it you can, there's nothing fundamental that prevents you to drop the 'virtual' completely from page struct, by just making the vaddr per-process and storing it on the stack like with the atomic kmaps, but passing it up the stack may require heavy changes to various apis, which is why we've taken the few-changes lazy way back then. If it wasn't worth back then, I doubt it worth now for just pae36. -
