login
Header Space

 
 

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

Previous thread: [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user by Christoph Lameter on Tuesday, September 11, 2007 - 2:03 am. (1 message)

Next thread: [19/41] Use page_cache_xxx for fs/xfs by Christoph Lameter on Tuesday, September 11, 2007 - 2:04 am. (1 message)
To: <torvalds@...>
Cc: <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Tuesday, September 11, 2007 - 2:03 am

This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.

- Support is added in a way that limits the changes to existing code.
  As a result filesystems can support larger page I/O with minimal changes.

- The page cache functions are mostly unchanged. Instead of a page struct
  representing a single page they take a head page struct (which looks
  the same as a regular page struct apart from the compound flags) and
  operate on those. Most page cache functions can stay as they are.

- No locking protocols are added or modified.

- The support is also fully transparent at the level of the OS. No
  specialized heuristics are added to switch to larger pages. Large
  page support is enabled by filesystems or device drivers when a device
  or volume is mounted. Larger block sizes are usually set during volume
  creation although the patchset supports setting these sizes per file.
  The formattted partition will then always be accessed with the
  configured blocksize.

- Large blocks also do not mean that the 4k mmap semantics need to be abandoned.
  The included mmap support will happily map 4k chunks of large blocks so that
  user space sees no changes.

Some of the changes are:

- Replace the use of PAGE_CACHE_XXX constants to calculate offsets into
  pages with functions that do the the same and allow the constants to
  be parameterized.

- Extend the capabilities of compound pages so that they can be
  put onto the LRU and reclaimed.

- Allow setting a larger blocksize via set_blocksize()

Rationales:
-----------

1. The ability to handle memory of an arbitrarily large size using
   a singe page struct "handle" is essential for scaling memory handling
   and reducing overhead in multiple kernel subsystems. This patchset
   is a strategic move that allows performance gains throughout the
   kern...
To: Christoph Lameter <clameter@...>, <andrea@...>
Cc: <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Monday, September 10, 2007 - 2:52 pm

There is a limitation in the VM. Fragmentation. You keep saying this
is a solved issue and just assuming you'll be able to fix any cases
that come up as they happen.

I still don't get the feeling you realise that there is a fundamental
fragmentation issue that is unsolvable with Mel's approach.

The idea that there even _is_ a bug to fail when higher order pages
cannot be allocated was also brushed aside by some people at the
vm/fs summit. I don't know if those people had gone through the
math about this, but it goes somewhat like this: if you use a 64K
page size, you can "run out of memory" with 93% of your pages free.
If you use a 2MB page size, you can fail with 99.8% of your pages
still free. That's 64GB of memory used on a 32TB Altix.

If you don't consider that is a problem because you don't care about
theoretical issues or nobody has reported it from running -mm
kernels, then I simply can't argue against that on a technical basis.
But I'm totally against introducing known big fundamental problems to
the VM at this stage of the kernel. God knows how long it takes to ever
fix them in future after they have become pervasive throughout the 
kernel.

IMO the only thing that higher order pagecache is good for is a quick
hack for filesystems to support larger block sizes. And after seeing it
is fairly ugly to support mmap, I'm not even really happy for it to do
that.

If VM scalability is a problem, then it needs to be addressed in other
areas anyway for order-0 pages, and if contiguous pages helps IO
scalability or crappy hardware, then there is nothing stopping us from
*attempting* to get contiguous memory in the current scheme.

Basically, if you're placing your hopes for VM and IO scalability on this,
then I think that's a totally broken thing to do and will end up making
the kernel worse in the years to come (except maybe on some poor
configurations of bad hardware).
-
To: Nick Piggin <nickpiggin@...>
Cc: <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Tuesday, September 11, 2007 - 4:01 pm

Well my problem first of all is that you did not read the full message. It 
discusses that later and provides page pools to address the issue.

Secondly you keep FUDding people with lots of theoretical concerns 
assuming Mel's approaches must fail. If there is an issue (I guess there 
must be right?) then please give us a concrete case of a failure that we 

Allocations can currently fail and all code has the requirement to handle 
failure cases in one form or another.

Currently we can only handle up to order 3 allocs it seems. 2M pages (and 
in particular pagesizes &gt; MAX_ORDER) will have to be handled by a separate 
large page pool facility discussed in the earlier message.
-
To: Christoph Lameter <clameter@...>
Cc: <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Tuesday, September 11, 2007 - 1:17 am

And BTW, before you accuse me of FUD, I'm actually talking about the
fragmentation issues on which Mel I think mostly agrees with me at this
point.

Also have you really a rational reason why we should just up and accept
all these big changes happening just because that, while there are lots
of theoretical issues, the person pointing them out to you hasn't happened
to give you a concrete failure case. Oh, and the actual performance
benefit is actually not really even quantified yet, crappy hardware not
withstanding, and neither has a proper evaluation of the alternatives.

So... would you drive over a bridge if the engineer had this mindset?
-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Tuesday, September 11, 2007 - 5:27 pm

I'm half way between you two on this one. I agree with Christoph in that
it's currently very difficult to trigger a failure scenario and today we
don't have a way of dealing with it. I agree with Nick in that conceivably a
failure scenario does exist somewhere and the careful person (or paranoid if
you prefer) would deal with it pre-emptively. The fact is that no one knows
what a large block workload is going to look like to the allocator so we're
all hand-waving.

Right now, I can't trigger the worst failure scenarious that cannot be
dealt with for fragmentation but that might change with large blocks. The
worst situation I can think is a process that continously dirties large
amounts of data on a large block filesystem while another set of processes
works with large amounts of anonymous data without any swap space configured
with slub_min_order set somewhere between order-0 and the large block size.
Fragmentation wise, that's just a kick in the pants and might produce
the failure scenario being looked for.

If it does fail, I don't think it should be used to beat Christoph with as
such because it was meant to be a #2 solution. What hits it is if the mmap()

Performance figures would be nice. dbench is flaky as hell but can
comparison figures be generated on one filesystem with 4K blocks and one
with 64K? I guess we can do it ourselves too because this should work on

If I had this bus that couldn't go below 50MPH, right...... never mind.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Christoph Lameter <clameter@...>
Cc: <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>, <joern@...>
Date: Tuesday, September 11, 2007 - 12:43 am

On the other hand, you ignore the potential failure cases, and ignore
the alternatives that do not have such cases.
-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 8:12 am

While I agree with your concern, those numbers are quite silly.  The
chances of 99.8% of pages being free and the remaining 0.2% being
perfectly spread across all 2MB large_pages are lower than those of SHA1
creating a collision.  I don't see anyone abandoning git or rsync, so
your extreme example clearly is the wrong one.

Again, I agree with your concern, even though your example makes it look
silly.

Jörn

-- 
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
-
To: <joern@...>
Cc: Nick Piggin <nickpiggin@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 4:07 pm

You may want to consider Mel's antifrag approaches which certainly=20
decreases the chance of this occurring. Reclaim can open up the needed=20
linear memory hole in a intentional way. The memory compaction approach=20
can even move pages to open up these 2M holes. The more pages we make=20
movable (see f.e. the targeted slab reclaim patchset that makes slab=20
pages movable) the more reliable higher order allocations become.
To: Christoph Lameter <clameter@...>
Cc: <joern@...>, Nick Piggin <nickpiggin@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 4:29 pm

I absolutely agree with your slab reclaim patchset.  No argument here.

What I'm starting to wonder about is where your approach has advantages
over Andrea's.  The chances of triggering something vaguely similar to
Nick's worst case scenario are certainly higher for your solution.  So
unless there are other upsides it is just the second-best solution.

Jörn

-- 
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
-
To: <joern@...>
Cc: Nick Piggin <nickpiggin@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 4:41 pm

Nick's worst case scenario is already at least partially addressed by the=
=20
Lumpy Reclaim code in 2.6.23. His examples assume 2.6.22 or earlier=20
code.

The advantages of this approach over Andreas is basically that the 4k=20
filesystems still can be used as is. 4k is useful for binaries and for=20
text processing like used for compiles. Large Page sizes are useful for=20
file systems that contain large datasets (scientific data, multimedia=20
stuff, databases) for applications that depend on high I/O throughput.
To: Christoph Lameter <clameter@...>
Cc: <joern@...>, Nick Piggin <nickpiggin@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 7:26 pm

If you mean that with my approach you can't use a 4k filesystem as is,
that's not correct. I even run the (admittedly premature but
promising) benchmarks on my patch on a 4k blocksized
filesystem... Guess what, you can even still mount a 1k fs on a 2.6
kernel.

The main advantage I can see in your patch is that distributions won't
need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
slower).
-
To: Andrea Arcangeli <andrea@...>
Cc: <joern@...>, Nick Piggin <nickpiggin@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 11, 2007 - 8:04 pm

Right you can use a 4k filesystem. The 4k blocks are buffers in a larger 

I would think that your approach would be slower since you always have to 
populate 1 &lt;&lt; N ptes when mmapping a file? Plus there is a lot of wastage 
of memory because even a file with one character needs an order N page? So 
there are less pages available for the same workload.

Then you are breaking mmap assumptions of applications becaused the order 
N kernel will no longer be able to map 4k pages.  You likely need a new 
binary format that has pages correctly aligned. I know that we would need 
one on IA64 if we go beyond the established page sizes.
-
To: Christoph Lameter <clameter@...>
Cc: <joern@...>, Nick Piggin <nickpiggin@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Wednesday, September 12, 2007 - 4:20 am

I don't have to populate them, I could just map one at time. The only
reason I want to populate every possible pte that could map that page
(by checking vma ranges) is to _improve_ performance by decreasing the
number of page faults of an order of magnitude. Then with the 62th bit
after NX giving me a 64k tlb, I could decrease the frequency of the

This is a known issue. The same is true for ppc64 64k. If that really

No you misunderstood the whole design. My patch will be 100% backwards
compatible in all respects. If I could break backwards compatibility
70% of the complexity would go away...
-
To: Jörn <joern@...>
Cc: Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Saturday, September 15, 2007 - 4:44 am

Actually it'd be pretty easy to craft an application which allocates seven
pages for pagecache, then one for &lt;something&gt;, then seven for pagecache, then
one for &lt;something&gt;, etc.

I've had test apps which do that sort of thing accidentally.  The result
wasn't pretty.
-
To: Andrew Morton <akpm@...>
Cc: Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Saturday, September 15, 2007 - 8:14 am

Except that the applications 7 pages are movable and the &lt;something&gt;
would have to be unmovable. And then they should not share the same
memory region. At least they should never be allowed to interleave in
such a pattern on a larger scale.

The only way a fragmentation catastroph can be (proovable) avoided is
by having so few unmovable objects that size + max waste &lt;&lt; ram
size. The smaller the better. Allowing movable and unmovable objects
to mix means that max waste goes way up. In your example waste would
be 7*size. With 2MB uper order limit it would be 511*size.

I keep coming back to the fact that movable objects should be moved
out of the way for unmovable ones. Anything else just allows
fragmentation to build up.

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:13 pm

It is actually really easy to force regions to never share. At the
moment, there is a fallback list that determines a preference for what
block to mix.

The reason why this isn't enforced is the cost of moving. On x86 and
x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
those pages to prevent any mixing would be bad enough. On PowerPC, it's
potentially 16MB. On IA64, it's 1GB.

As this was fragmentation avoidance, not guarantees, the decision was
made to not strictly enforce the types of pages within a block as the
cost cannot be made back unless the system was making agressive use of

This is easily achieved, just really really expensive because of the
amount of copying that would have to take place. It would also compel
that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
a lot of free memory to keep around which is why fragmentation avoidance
doesn't do it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Mel Gorman <mel@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 5:58 pm

I don't say the group should never be mixed. The movable objects could
be moved out on demand. If 64k get allocated then up to 64k get
moved. That would reduce the impact as the kernel does not hang while
it moves 2MB or even 1GB. It also allows objects to be freed and the
space reused in the unmovable and mixed groups. There could also be a
certain number or percentage of mixed groupd be allowed to further
increase the chance of movable objects freeing themself from mixed
groups.

But when you already have say 10% of the ram in mixed groups then it
is a sign the external fragmentation happens and some time should be

In your sample graphics you had 1152 groups. Reserving a few of those
doesnt sound too bad. And how many migrate types do we talk about. So
far we only had movable and unmovable. I would split unmovable into
short term (caches, I/O pages) and long term (task structures,
dentries). Reserving 6 groups for schort term unmovable and long term
unmovable would be 1% of ram in your situation.

Maybe instead of reserving one could say that you can have up to 6
groups of space not used by unmovable objects before aggressive moving
starts. I don't quite see why you NEED reserving as long as there is
enough space free alltogether in case something needs moving. 1 group
worth of space free might be plenty to move stuff too. Note that all
the virtual pages can be stuffed in every little free space there is
and reassembled by the MMU. There is no space lost there.


But until one tries one can't say.

MfG
        Goswin

PS: How do allocations pick groups? Could one use the oldest group
dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and
highest address for movable? Something to better keep the two out of
each other way.
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 17, 2007 - 6:03 am

This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general

I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't

No, which on those systems, I would suggest setting min_free_kbytes to a

Movable, unmovable, reclaimable and reserve in the current incarnation

Mostly done as you suggest already. Dentry are considered reclaimable, not

More groups = more cost although very easy to add them. A mixed type
used to exist but was removed again because it couldn't be proved to be

And if the groups are 1GB in size? I tried something like this already.


What you suggest sounds similar to having a type MIGRATE_MIXED where you
allocate from when the preferred lists are full. It became a sizing



We bias the location of unmovable and reclaimable allocations already. It's
not done for movable because it wasn't necessary (as they are easily
reclaimed or moved anyway).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Mel Gorman <mel@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 23, 2007 - 2:22 am

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
#     #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A
                +-----------+-----------+
# ##### ##### # |00 03 04 05|3A 3B 3C 3F|
# #   # #   # # |           |           |
### ### ### ### |01 02 07 06|39 38 3D 3E|
    #     #     |           |           |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# #   # #   # # |           |           |
# ##### ##### # |0F 0C 0B 0A|35 34 33 30|
#             # +-----+-----+           |
### ####### ### |10 11|1E 1F|20 21 2E 2F|
  # #     # #   |     |     |           |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
#     # #     # |     +-----+           |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # |     |     |           |
### ### ### ### |15 16|19 1A|25 26 29 2A|
                +-----+-----+-----------+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups

Which is different from reserving a full group as it does not count

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always a...
To: Goswin von Brederlow <brederlo@...>
Cc: Mel Gorman <mel@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 24, 2007 - 8:32 am

Here's an excellent example of an 0-255 numbered hilbert curve used  
to enumerate the various top-level allocations of IPv4 space:
http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To: Mel Gorman <mel@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Christoph Lameter <clameter@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 5:03 am

I don't know how it would prevent fragmentation from building up
anyway. It's commonly the case that potentially unmovable objects
are allowed to fill up all of ram (dentries, inodes, etc).

And of course,  if you craft your exploit nicely with help from higher
ordered unmovable memory (eg. mm structs or unix sockets), then
you don't even need to fill all memory with unmovables before you
can have them take over all groups.
-
To: Nick Piggin <nickpiggin@...>
Cc: Mel Gorman <mel@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 17, 2007 - 6:00 pm

Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
ZONE_MOVABLE and thus the memory that can be allocated for them is 
limited.



-
To: Christoph Lameter <clameter@...>
Cc: Mel Gorman <mel@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 17, 2007 - 8:11 pm

Why would ZONE_MOVABLE require that "movable objects should be moved
out of the way for unmovable ones"? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
-
To: Nick Piggin <nickpiggin@...>
Cc: Mel Gorman <mel@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 18, 2007 - 4:36 pm

This was a response to your statement all of memory could be filled up by unmovable 
objects. Which cannot occur if the memory for unmovable objects is 
limited. Not sure what you mean by reserves? Mel's reserves? The reserves 
for unmovable objects established by ZONE_MOVABLE?


-
To: Christoph Lameter <clameter@...>
Cc: Nick Piggin <nickpiggin@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 18, 2007 - 6:00 am

As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,

We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.

Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?

-- 
Mel "tired of typing" Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Mel Gorman <mel@...>
Cc: Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 18, 2007 - 8:31 am

No we don't. All workloads benefit from larger block sizes when
you've got a btree tracking 20 million inodes and a create has to
search that tree for a free inode.  The tree gets much wider and
hence we take fewer disk seeks to traverse the tree. Same for large
directories, btree's tracking free space, etc - everything goes
faster with a larger filesystem block size because we spent less
time doing metadata I/O.

And the other advantage is that sequential I/O speeds also tend to
increase with larger block sizes. e.g. XFS on an Altix (16k pages)
using 16k block size is about 20-25% faster on writes than 4k block
size. See the graphs at the top of page 12:

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

The benefits are really about scalability and with terabyte sized
disks on the market.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: Mel Gorman <mel@...>
Cc: Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, <andrea@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 18, 2007 - 6:49 am

Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks.  Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc.  Testdata was a 128MiB qemu root filesystem.

Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem.  How suck a hack would perform I'd
prefer not to find out, though. :)

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Saturday, September 15, 2007 - 11:51 am

That's incidentally exactly what the slab does, no need to reinvent
the wheel for that, it's an old problem and there's room for
optimization in the slab partial-reuse logic too. Just boost the order
0 page size and use the slab to get the 4k chunks. The sgi/defrag
design is backwards.
-
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Saturday, September 15, 2007 - 4:14 pm

How does that help? Will slabs move objects around to combine two
partially filled slabs into nearly full one? If not consider this:

- You create a slab for 4k objects based on 64k compound pages.
  (first of all that wastes you a page already for the meta infos)
- Something movable allocates a 14 4k page in there making the slab
  partially filled.
- Something unmovable alloactes a 4k page making the slab mixed and
  full.
- Repeat until out of memory.

OR

- Userspace allocates a lot of memory in those slabs.
- Userspace frees one in every 15 4k chunks.
- Userspace forks 1000 times causing an unmovable task structure to
  appear in 1000 slabs. 

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Saturday, September 15, 2007 - 6:30 pm

1. It helps providing a few guarantees: when you run "/usr/bin/free"
you won't get a random number, but a strong _guarantee_. That ram will
be available no matter what.

With variable order page size you may run oom by mlocking some half
free ram in pagecache backed by largepages. "free" becomes a fake
number provided by a weak design.

Apps and admin need to know for sure the ram that is available to be
able to fine-tune the workload to avoid running into swap but while

2. yes, slab can indeed be freed to release an excessive number of 64k
   pages pinned by an insignificant number of small objects. I already
   told to Mel even at the VM summit, that the slab defrag can payoff
   regardless, and this is nothing new, since it will payoff even

There's not just 1 4k object in the system... The whole point is to
make sure all those 4k objects goes into the same 64k page. This way
for you to be able to reproduce Nick's worst case scenario you have to

Movable? I rather assume all slab allocations aren't movable. Then
slab defrag can try to tackle on users like dcache and inodes. Keep in
mind that with the exception of updatedb, those inodes/dentries will
be pinned and you won't move them, which is why I prefer to consider

The entire slab being full is a perfect scenario. It means zero memory


If with slabs you mean slab/slub, I can't follow, there has never been
a single byte of userland memory allocated there since ever the slab

I guess you're confusing the config-page-shift design with the sgi
design where userland memory gets mixed with slab entries in the same
64k page... Also with config-page-shift the userland pages will all be
64k.

Things will get more complicated if we later decide to allow
kmalloc(4k) pagecache to be mapped in userland instead of only being
available for reads. But then we can restrict that to a slab and to
make it relocatable by following the ptes. That will complicate things
a lot.

But the whole point is that you don't need all that compl...
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 9:54 am

This and other comments in your reply show me that you completly
misunderstood what I was talking about.

Look at
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

The red dots (pinned) are dentries, page tables, kernel stacks,
whatever kernel stuff, right?

The green dots (movable) are mostly userspace pages being mapped
there, right?

What I was refering too is that because movable objects (green dots)
aren't moved out of a mixed group (the boxes) when some unmovable
object needs space all the groups become mixed over time. That means
the unmovable objects are spread out over all the ram and the buddy
system can't recombine regions when unmovable objects free them. There
will nearly always be some movable objects in the other buddy. The
system of having unmovable and movable groups breaks down and becomes
useless.


I'm assuming here that we want the possibility of larger order pages
for unmovable objects (large continiuos regions for DMA for example)
than the smallest order user space gets (or any movable object). If
mmap() still works on 4k page bounaries then those will fragment all
regions into 4k chunks in the worst case.

Obviously if userspace has a minimum order of 64k chunks then it will
never break any region smaller than 64k chunks and will never cause a
fragmentation catastroph. I know that is verry roughly your aproach
(make order 0 bigger), and I like it, but it has some limits as to how
big you can make it. I don't think my system with 1GB ram would work
so well with 2MB order 0 pages. But I wasn't refering to that but to
the picture.

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 11:08 am

What does the large square represent here? A "largepage"? If yes,

If the largepage is the square, there can't be red pixels mixed with
green pixels with the config-page-shift design, this is the whole
difference...

zooming in I see red pixels all over the squares mized with green
pixels in the same square. This is exactly what happens with the
variable order page cache and that's why it provides zero guarantees

If I understood correctly, here you agree that mixing movable and
unmovable objects in the same largepage is a bad thing, and that's
incidentally what config-page-shift prevents. It avoids it instead of
undoing the mixture later with defrag when it's far too late for

With config-page-shift mmap works on 4k chunks but it's always backed
by 64k or any other largesize that you choosed at compile time. And if
the virtual alignment of mmap matches the physical alignment of the
physical largepage and is &gt;= PAGE_SIZE (software PAGE_SIZE I mean) we
could use the 62nd bit of the pte to use a 64k tlb (if future cpus
will allow that). Nick also suggested to still set all ptes equal to

Yep, exactly this is what happens, it avoids that trouble. But as far
as fragmentation guarantees goes, it's really about keeping the
unmovable out of our way (instead of spreading the unmovable all over
the buddy randomly, or with ugly
boot-time-fixed-numbers-memory-reservations) than to map largepages in
userland. Infact as I said we could map kmalloced 4k entries in
userland to save memory if we would really want to hurt the fast paths
to make a generic kernel to use on smaller systems, but that would be
very complex. Since those 4k entries would be 100% movable (not like
the rest of the slab, like dentries and inodes etc..) that wouldn't
make the design less reliable, it'd still be 100% reliable and
performance would be ok because that memory is userland memory, we've

Sure! 2M is sure way excessive for a 1G system, 64k most certainly
too, of course unless you're running a db or a multim...
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 5:08 pm

hmm, it was a long time ago. This one looks like 4MB large pages so

Yes. I can enforce a similar situation but didn't because the evacuation
costs could not be justified for hugepage allocations. Patches to do such
a thing were prototyped a long time ago and abandoned based on cost. For

This picture is not grouping pages by mobility so that is hardly a
suprise. This picture is not running grouping pages by mobility. This is
what the normal kernel looks like. Look at the videos in
http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
compares to vanilla. These are from February when there was less control
over mixing blocks than there is today.

In the current version mixing occurs in the lower blocks as much as possible
not the upper ones. So there are a number of mixed blocks but the number is
kept to a minimum.

The number of mixed blocks could have been enforced as 0, but I felt it was
better in the general case to fragment rather than regress performance.
That may be different for large blocks where you will want to take the

We don't take defrag steps at the moment at all. There are memory
compaction patches but I'm not pushing them until we can prove they are

As I said elsewhere, you can try this easily on top of grouping pages by
mobility. They are not mutually exclusive and you'll have a comparison

Ok, get it implemented so and we'll try it out because we're just hand-waving
here and not actually producing anything to compare. It'll be interesting
to see how it works out for large blocks and hugepages (although I expect
the latter to fail unless grouping pages by mobility is in place).  Ideally,
they'll complement each other nicely but only ever having mixing take place
at the 64KB boundary. I have the testing setup necessary for checking
out hugepages at least and I hope to put together something that tests
large blocks as well. Minimally, running the hugepage allocation tests

-- 
-- 
Mel Gorman
Part-time Phd Student                          L...
To: Mel Gorman <mel@...>
Cc: Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 6:48 pm

I agree that 0 is a bad value. But so is infinity. There should be
some mixing but not a lot. You say "kept to a minimum". Is that
actively done or already happens by itself. Hopefully the later which

But would mapping a random 4K page out of a file then consume 64k?
That sounds like an awfull lot of internal fragmentation. I hope the
unaligned bits and pices get put into a slab or something as you

It is too bad that existing amd64 CPUs only allow such large physical
pages. But it kind of makes sense to cut away a full level or page

rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
ocasional mplayer.

I would mostly be concerned how rtorrents totaly random access of
mmapped files negatively impacts such a 64k page system.

MfG
     Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 17, 2007 - 5:30 am

Happens by itself due to biasing mixing blocks at lower PFNs. The exact



For what it's worth, the last allocation failure that occured with
grouping pages by mobility was order-1 atomic failures for a wireless
network card when bittorrent was running. You're likely right in that
torrents will be an interesting workload in terms of fragmentation.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 1:46 pm

I have been toying with the idea of having seperate caches for pinned
and movable dentries.  Downside of such a patch would be the number of
memcpy() operations when moving dentries from one cache to the other.
Upside is that a fair amount of slab cache can be made movable.
memcpy() is still faster than reading an object from disk.

Most likely the current reaction to such a patch would be to shoot it
down due to overhead, so I didn't pursue it.  All I have is an old patch
to seperate never-cached from possibly-cached dentries.  It will
increase the odds of freeing a slab, but provide no guarantee.

But the point here is: dentries/inodes can be made movable if there are
clear advantages to it.  Maybe they should?

Jörn

-- 
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html
-
To: Jörn Engel <joern@...>
Cc: Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:15 pm

Totally inappropriate.

I bet 99% of all "dentry_lookup()" calls involve turning the last dentry 
from having a count of zero ("movable") to having a count of 1 ("pinned").

So such an approach would fundamentally be broken. It would slow down all 
normal dentry lookups, since the *common* case for leaf dentries is that 
they have a zero count.

So it's much better to do it on a "directory/file" basis, on the 
assumption that files are *mostly* movable (or just freeable). The fact 
that they aren't always (ie while kept open etc), is likely statistically 
not all that important.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Jörn <joern@...>, Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:21 pm

My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
which are pinned for their entire lifetime and another for regular
files/inodes.  One could take a three-way approach and have
always-pinned, often-pinned and rarely-pinned.

We won't get never-pinned that way.

Jörn

-- 
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown
-
To: Jörn Engel <joern@...>
Cc: Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:44 pm

That sounds pretty good. The problem, of course, is that most of the time, 
the actual dentry allocation itself is done before you really know which 
case the dentry will be in, and the natural place for actually giving the 
dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" 
it with d_add() or d_instantiate().

But it turns out that most of the filesystems we care about already use a 
special case of "d_add()" that *already* replaces the dentry with another 
one in some cases: "d_splice_alias()".

So I bet that if we just taught "d_splice_alias()" to look at the inode, 
and based on the inode just re-allocate the dentry to some other slab 
cache, we'd already handle a lot of the cases!

And yes, you'd end up with the reallocation overhead quite often, but at 
least it would now happen only when filling in a dentry, not in the 
(*much* more critical) cached lookup path.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Jörn <joern@...>, Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 23, 2007 - 1:44 pm

There may be another approach.  We could create a never-pinned cache,
without trying hard to keep it full.  Instead of moving a hot dentry at
dput() time, we move a cold one from the end of lru.  And if the lru
list is short, we just chicken out.

Our definition of "short lru list" can either be based on a ratio of
pinned to unpinned dentries or on a metric of cache hits vs. cache
misses.  I tend to dislike the cache hit metric, because updatedb would
cause tons of misses and result in the same mess we have right now.

With this double cache, we have a source of slabs to cheaply reap under
memory pressure, but still have a performance advantage (memcpy beats
disk io by orders of magnitude).

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
-
To: Linus Torvalds <torvalds@...>
Cc: Joern Engel <joern@...>, Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 6:51 pm

You would only get it for dentries that live long (or your prediction
is awfully wrong) and then the reallocation amortizes over time if you
will. :)

MfG
        Goswin
-
To: Joern Engel <joern@...>
Cc: Andrea Arcangeli <andrea@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 6:06 pm

How probable is it that the dentry is needed again? If you copy it and
it is not needed then you wasted time. If you throw it out and it is
needed then you wasted time too. Depending on the probability one of
the two is cheaper overall. Idealy I would throw away dentries that
haven't been accessed recently and copy recently used ones.

How much of a systems ram is spend on dentires? How much on task
structures? Does anyone have some stats on that? If it is &lt;10% of the
total ram combined then I don't see much point in moving them. Just
keep them out of the way of users memory so the buddy system can work

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Joern Engel <joern@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 6:40 pm

As usual, the answer is "it depends".  I've had up to 600MB in dentry
and inode slabs on a 1GiB machine after updatedb.  This machine
currently has 13MB in dentries, which seems to be reasonable for my
purposes.

Jörn

-- 
Audacity augments courage; hesitation, fear.
-- Publilius Syrus
-
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:15 pm

Except now as I've repeatadly pointed out, you have internal fragmentation
problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
example to get the same sort of results and a lot of copying and moving when

Nothing stops you altering the PAGE_SIZE so that large blocks work in
the way you envision and keep grouping pages by mobility for huge page
sizes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To: Mel Gorman <mel@...>
Cc: Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 2:50 pm

Well not sure about the 16MB number, since I'm unsure what the size of
the ram was. But clearly I agree there are fragmentation issues in the
slab too, there have always been, except they're much less severe, and
the slab is meant to deal with that regardless of the PAGE_SIZE. That
is not a new problem, you are introducing a new problem instead.

We can do a lot better than slab currently does without requiring any
defrag move-or-shrink at all.

slab is trying to defrag memory for small objects at nearly zero cost,
by not giving pages away randomly. I thought you agreed that solving
the slab fragmentation was going to provide better guarantees when in
another email you suggested that you could start allocating order &gt; 0
pages in the slab to reduce the fragmentation (to achieve most of the
guarantee provided by config-page-shift, but while still keeping the

You ignore one other bit, when "/usr/bin/free" says 1G is free, with
config-page-shift it's free no matter what, same goes for not mlocked
cache. With variable order page cache, /usr/bin/free becomes mostly a
lie as long as there's no 4k fallback (like fsblock).

And most important you're only tackling on the pagecache and I/O
performance with the inefficient I/O devices, the whole kernel has no
cahnce to get a speedup, infact you're making the fast paths slower,
just the opposite of config-page-shift and original Hugh's large
PAGE_SIZE ;).
-
To: Andrea Arcangeli <andrea@...>
Cc: Mel Gorman <mel@...>, Goswin von Brederlow <brederlo@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 16, 2007 - 6:56 pm

% free
             total       used       free     shared    buffers     cached
Mem:       1398784    1372956      25828          0     225224     321504
-/+ buffers/cache:     826228     572556
Swap:      1048568         20    1048548

When has free ever given any usefull "free" number? I can perfectly
fine allocate another gigabyte of memory despide free saing 25MB. But
that is because I know that the buffer/cached are not locked in.

On the other hand 1GB can instantly vanish when I start a xen domain
and anything relying on the free value would loose.


The only sensible thing for an application concerned with swapping is
to whatch the swapping and then reduce itself. Not the amount
free. Although I wish there were some kernel interface to get a
preasure value of how valuable free pages would be right now. I would
like that for fuse so a userspace filesystem can do caching without
cripling the kernel.

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Mel Gorman <mel@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Tuesday, September 18, 2007 - 3:31 pm

Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even


Repeated drop caches + free can help.
-
To: Andrea Arcangeli <andrea@...>
Cc: Goswin von Brederlow <brederlo@...>, Mel Gorman <mel@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Sunday, September 23, 2007 - 2:56 am

As a user I know it because I didn't put a kernel source into /tmp. A

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing

I would kill any programm that does that to find out how much free ram
the system has.

MfG
        Goswin
-
To: Goswin von Brederlow <brederlo@...>
Cc: Mel Gorman <mel@...>, Andrew Morton <akpm@...>, Joern Engel <joern@...>, Nick Piggin <nickpiggin@...>, Christoph Lameter <clameter@...>, <torvalds@...>, <linux-fsdevel@...>, <linux-kernel@...>, Christoph Hellwig <hch@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>, Fengguang Wu <fengguang.wu@...>, swin wang <wangswin@...>, <totty.lu@...>, <hugh@...>
Date: Monday, September 24, 2007 - 11:39 am

Various apps requires you (admin/user) to tune the size of their

The whole point is if there's not enough ram of course... this is why

The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over

The admin should do that if he's unsure, not a program of course!
-
To: Andrea Arcangeli <andrea@...>