login
Header Space

 
 

Re: [00/17] Large Blocksize Support V3

Previous thread: [04/17] Free up page->private for compound pages by clameter on Tuesday, April 24, 2007 - 6:21 pm. (1 message)

Next thread: [17/17] xfs: Variable order page cache support by clameter on Tuesday, April 24, 2007 - 6:21 pm. (1 message)
To: <linux-kernel@...>
Cc: Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Tuesday, April 24, 2007 - 6:21 pm

V2-&gt;V3
- More restructuring
- It actually works!
- Add XFS support
- Fix up UP support
- Work out the direct I/O issues
- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
  back to constants. Disabled for 32bit and HIGHMEM configurations.
  This also allows a gradual migration to the new page cache
  inline functions. LARGE_BLOCKSIZE capabilities can be
  added gradually and if there is a problem then we can disable
  a subsystem.

V1-&gt;V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.

This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.

Rationales:

1. We have problems supporting devices with a higher blocksize than
   page size. This is for example important to support CD and DVDs that
   can only read and write 32k or 64k blocks. We currently have a shim
   layer in there to deal with this situation which limits the speed
   of I/O. The developers are currently looking for ways to completely
   bypass the page cache because of this deficiency.

2. 32/64k blocksize is also used in flash devices. Same issues.

3. Future harddisks will support bigger block sizes that Linux cannot
   support since we are limited to PAGE_SIZE. Ok the on board cache
   may buffer this for us but what is the point of handling smaller
   page sizes than what the drive supports?

4. Reduce fsck times. Larger block sizes mean faster file system checking.

5. Performance. If we look at IA64 vs. x86_64 then it seems that the
   faster interrupt handling on x86_64 compensate for the speed loss due to
   a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
   sizes on all allows a significant reduction in I/O overhead and increases
   the size of I/O that can be performed by hardware in a single r...
To: <clameter@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>
Date: Saturday, April 28, 2007 - 12:39 pm

Hi. 

	I have a few questions about that patchset:

1) Is it possible for block device to assume that it will alway get big 
requests (and aligned by big blocksize) ?

2) Does metadata reading/writing occuress also using same big blocksize ?

3 If so, How __bread/__getblk are affrected? Does returned buffer_head point 
to whole block ?

And what do you think about mine design ?
I want to link parts of compound page through buffer_heads
So the head page's bh points to second page (tail page ) bh's, and from this 
bh it is possible to reference the page itself and so on.
(This will allow a compound page be physicly fragmented)


Best regards, 
	Maxim Levitsky

PS:

I ask questions since this patchset does matter to me, I really like to see 
this &lt;= 4K limit lifted (all software limits are bad)
And finaly get good packet writing... I miss DirectCD much...
Altough &gt;4K blocksizes are really only first step.
To make really fast packet writing, the UDF filesystem should be rewritten as 
well.
-
To: Maxim Levitsky <maximlevitsky@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>
Date: Monday, April 30, 2007 - 1:23 am

That is one of the key problems. We hope that Mel Gorman's antifrag work 


Correct.

-
To: <clameter@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Thursday, April 26, 2007 - 10:04 pm

Something I was looking for but couldn't find: suppose an application takes
a pagefault against the third 4k page of an order-2 pagecache "page".  We
need to instantiate a pte against find_get_page(offset/4)+3.  But these
patches don't touch mm/memory.c at all and filemap_nopage() appears to
return the zeroeth 4k page all the time in that case.

So.. what am I missing, and how does that part work?



Also, afaict your important requirements would be met by retaining
PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by
physically contiguous pages - so instead of allocating and adding one 4k
page, we allocate an order-2 page and sprinkle all four page*'s into the
radix tree in one hit.  That should be fairly straightforward to do, and
could be made indistinguishably fast from doing a single 16k page for some
common pagecache operations (gang-insert, gang-lookup).

The BIO and block layers will do-the-right-thing with that pagecache and
you end up with four times more data in the SG lists, worst-case.
-
To: Andrew Morton <akpm@...>
Cc: <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Thursday, April 26, 2007 - 10:27 pm

Sure, that addresses the larger I/O side of things, but it doesn't address
the large filesystem blocksize issues that can only be solved with some kind
of page aggregation abstraction. Compound pages and high order page cache
indexing solves this extremely neatly, regardless of whether the compound
page is contiguous or not.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Thursday, April 26, 2007 - 10:53 pm

a) That wasn't a part of Christoph's original rationale list, so forgive
   me for thinking it is not so important and got snuck in post-facto when
   things got tough.

b) I don't immediately see why a filesystam cannot implement larger
   blocksizes via this scheme - instantiate and lock four pages and go for

We cannot say anything about neatness until we've seen mmap.
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 12:20 am

I've been pushing christoph to do something like this for more than a year
purely so we can support large block sizes in XFS. He's got other reasons
for wanting to do this, but that doesn't mean that the large filesystem

So now how do you get block aligned writeback? Or make sure that truncate
doesn't race on a partial *block* truncate? You basically have to
jump through nasty, nasty hoops, to handle corner cases that are introduced
because the generic code can no longer reliably lock out access to a
filesystem block.

Eventually you end up with something like fs/xfs/linux-2.6/xfs_buf.c and
doing everything inside the filesystem because it's the only way sane
way to serialise access to these aggregated structures. This is
the way XFS used to work in it's data path, and we all know how long
and loud people complained about that.....

A filesystem specific aggregation mechanism is not a palatable solution
here because it drives filesystems away from being able to use generic
code. 

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 1:15 am

in writeback and pageout:

	if (page-&gt;index &amp; mapping-&gt;block_size_mask)


I would expect we could (should) implement this in generic code by
modifying the existing stuff.

I'm not saying it's especially simple, nor fast.  But it has the advantage
that we're not forced to use larger pages with _it's_ attendant performance
problems.

And it will benefit all filesystems immediately.

And it doesn't introduce a rather nasty hack of pretending (in some places)
that pages are larger than they really are.

And it has the very significant advantage that it doesn't introduce brand
new concepts and some complexity into core MM.

And make no mistake: the latter disadvantage is huge.  Because if we do the
PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*. 
Maintaining and enhancing core MM and VFS becomes harder and more costly
and slower and more buggy *for ever*.  The ramp for people to become
competent on core MM becomes longer.  Our developer pool becomes smaller, and
proportionally less skilled.

And hardware gets better.  If Intel &amp; AMD come out with a 16k pagesize
option in a couple of years we'll look pretty dumb.  If the problems which
you're presently having with that controller get sorted out in the next
generation of the hardware, we'll also look pretty dumb.

As always, there are tradeoffs.  We can see the cons, and they are very
significant.  We don't yet know the pros.  Perhaps they will be similarly
significant.  But I don't believe that the larger PAGE_CACHE_SIZE hack
(sorry) is the only way in which they can be realised.

-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 12:55 pm

Unfortunately, this isn't a problem with hardware getting better, but
a willingness to break backwards compatibility.

x86_64 uses a 4k page size to avoid breaking 32-bit applications.  And
unfortunately, iirc, even 64-bit applications are continuing to depend
on 4k page alignments for things like the text and bss segments.  If
the userspace ELF and other compiler/linker specifications were
appropriate written so they could handle 16k pagesizes, maybe 5 years
from now we could move to a 16k pagesize.  But this is going to
require some coordination between the userspace binutils folks and
AMD/Intel in order to plan such a migration.

							- Ted
-
To: Theodore Tso <tytso@...>
Cc: Andrew Morton <akpm@...>, David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 1:32 pm

The AMD64 psABI requires binaries to work with any page size up to 64k.

Whether that's true in practice is another matter entirely, of course.

-- 
Nicholas Miell &lt;nmiell@comcast.net&gt;

-
To: Nicholas Miell <nmiell@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 2:12 pm

64-bit applications are a non-issue. The ABI requires them to handle it.
It's 32-bit applications on x86-64 that are the concern. ABI emulation
for them is more involved when 4K 32-bit ABI's are to be emulated on
kernels compiled for larger native pagesizes. In practice, so many use
getpagesize() it may not be much of an issue, but consider WINE and
other sorts of non-Linux-native 32-bit apps, among other issues.


-- wli
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 2:09 am

So we might do writeback on one page in N - how do we
make sure none of the other pages are reclaimed while we are doing
writeback on this bclok?

IOWs, we have to lock every page in the block, mark them all as
writeback, etc. Instead of doing something once, we have
to repeat it for every block in page. This is better than a compound

And the locking order? How do you enforce *kernel wide* the
same locking order for all pages in the same block so that we
don't get ABBA deadlocks on page locks within a block?



So you're suggesting that we reintroduce a buffer-oriented filesystem

So you'll take slow, inefficient and complex rather than use an
non-intrusive and /optional/ interface to large pages?

Words fail me......

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:04 am

By marking them all dirty when one is marked dirty.

David, you're perfectly capable of working all this out yourself.  But

I already said it'd be a bit slower.  But given that those four pageframes




The optionality is useful - it at least means that we can easily remove it
all if/when it becomes obsolete.

-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 4:03 am

I've looked at all this but I'm trying to work out if anyone
else has looked at the impact of doing this. I have direct experience
with this form of block aggregation - this is pretty much what is
done in irix - and it's full of nasty, ugly corner cases.

I've got several year-old Irix bugs  assigned that are hit every so
often where one page in the aggregated set has the wrong state, and
it's simply not possible to either reproduce the problem or work out
how it happened. The code has grown too complex and convoluted, and
by the time the problem is noticed (either by hang, panic or bug
check) the cause of it is long gone.

I don't want to go back to having to deal with this sort of problem
- I'd much prefer to have a design that does not make the same

No, I'm speaking from years of experience working on a
page/buffer/chunk cache capable of using both large pages and
aggregating multiple pages. It has, at times, almost driven me
insane and I don't want to go back there.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, May 4, 2007 - 9:31 am

So the practical question is.  Was it a high level design problem or
was it simply a choice of implementation issue.

Until we code review and implementation that does page aggregation for
linux we can't say how nasty it would be.

Of course what gets confusing is when you mention you refer to the
previous implementation as a buffer cache, because that isn't at all
what Linux had for a buffer cache.  The linux buffer cache was the
same as the current page cache except it was index by block number and

The suggestion seems to be to always aggregate pages (to handle
PAGE_SIZE &lt; block size), and not to even worry about the fact
that it happens that the pages you are aggregating are physically
contiguous.  The memory allocator and the block layer can worry
about that.  It isn't something the page cache or filesystems
need to pay attention to.

I suspect the implementation in linux would be sufficiently different
that it would not be prone to the same problems.  Among other things
we are already do most things on a range of page addresses, so we
would seem to have most of the infrastructure already.

It looks like if we extend the current batching a little more
so it covers all of the interesting cases. (read)

Ensure the dirty bit on all pages in the group when we set it on
one page.

Add re-read when we dirty the group if we don't have it all present.

Round the range we operate on up so we cleanly hit the beginning
and end of the group size.

Only issue the mapping operations on the first page in the group.

Is about what we would have to do to handle multiple pages in one
block in the page cache.  There are clearly more details but as
a first approximation I don't see this being fundamentally more
complex then what we are currently doing.  Just taking into account a
few more details.



The whole physical continuity thing seems to come cleanly out of a
speculative page allocator, and that would seem to work and provide
improvements on smaller block sizes filesyste...
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 12:58 am

Both. To many things can happen asynchroonously to a page that it
makes it just about impossible to predict all the potential race
conditions that are involved. complexity arose from trying to fix

We already have an implementation - I've pointed it out several times
now: see fs/xfs/linux-2.6/xfs_buf.[ch].


perfomrance problems in using discontigous pages and needing to

Filesystems don't typically do this - they work on blocks and assume

Hmmm - we're not talking about using 64k block size filesystems to
store lots of little files or using them on small, slow disks.
We're looking at optimising for multi-petabyte filesystems with
multi-terabyte sized files sustaining throughput of tens to hundreds
of GB/s to/from hundreds to thousands of disk.

I certinaly don't consider 64k block size filesystems as something
suitable for desktop use - maybe PVRs would benefit, but this
is not something you'd use for your kernel build environment on a
single disk in a desktop system....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 2:56 am

Yes, and but it isn't a generic implementation in mm/filemap.c,
it is a compatibility layer.  It lives with the current deficiencies
Always?

Ugh.  I just realized looking at the xfs code that it doesn't

But that is how mm/filemap.c works.  The calls into the filesystem
can be per multi-page group as they are current per page.  The point
is that the existing in kernel abstraction  is already larger then a

Yes.  You are talking about only fixing the kernel for your giant
64K block filesystems that are only interesting on peta-byte arrays.

I am pointing out that the other fixes that have been discussed.
Optimistic contiguous page allocation and a larger linux scatter
gather list.  Are interesting on much smaller filesystem and machine
sizes where small files are still interesting.  Making them generally
better improvements for linux.

If you only improve the giant peta-byte raid cases 99% of linux users
simply don't care, and so the code isn't very interesting.

Eric
-
To: Eric W. Biederman <ebiederm@...>, David Chinner <dgc@...>
Cc: Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 11:17 am

Just to stick my two cents in here:

The definition of what is meant by "large" filesystems has to change
with the advances in disk drive technology.  In the not too distant
past, a "large" single filesystem was 100 GB. There are now consumer
grade disks on the market with 1 TB available in a single unit.  I don't
know about you guys, but that scares the crap out of me, in terms of
dealing with that much space on a desktop machine.  Efficiently dealing
with transferring that much data on a desktop (never mind server) means
re-thinking the limitations of the I/O subsystems.  What was once the
realm of the data center is now the realm of the living room.  Large
data sets are becoming more commonplace (HD Movies, audio files, etc)
with each passing day, and there is no end in sight in the progression.
In addition, with the release of specs recently about larger sector
sizes for disk drives (2048 bytes, or larger), this is going to become a
pressing need for the general case, not just the extremely large
servers, or HPC machines and clusters.  Already there is no efficient
way to back up that much space, in a reasonable time, except to have
another disk of a similar or larger size to back up to.  Anything we can
do to make disk I/O *Faster* is a win.

I recognize that there is a huge issue in dealing with sub block size
files.  The trade off of small files VS large blocks is now a non
trivial problem.  Once disk sector sizes increase, the problems will
have to be dealt with in a more intelligent manner, possibly dividing
sectors into smaller logical blocks for small files? Maybe filesystems
that can understand multiple block sizes? 

Well, we do live in interesting times; we just have to make the most of
it.

Dan Weigert

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Eric W.
Biederman
Sent: Monday, May 07, 2007 2:56 AM
To: David Chinner
Cc: Andrew Morton; clameter@sgi.com; linux-kernel@vger.kernel.org...
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, May 4, 2007 - 12:11 pm

And the 50% gain in the benchmarks means what? That the device 
manufacturers have to redesign their chips?

-
To: David Chinner <dgc@...>
Cc: <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 4:48 am

We're talking about two separate things here - let us not conflate them.

1: The arguably-crippled HBA which wants bigger SG lists.

2: The late-breaking large-blocksizes-in-the-fs thing.


None of this multiple-page-locking stuff we're discussing here is relevant
to the HBA performance problem.  It's pretty simple (I think) for us to
ensure that, for the great majority of the time, contiguous pages in a file
are also physically contiguous.  Problem solved, HBA go nice and quick,
move on.




Now, we have this the second and completely unrelated requirement:
supporting fs-blocksize &gt; PAGE_SIZE.  One way to address this is via the
mangle-multiple-pages-into-one approach.  And it's obviously the best way
to do it, if mangle-multiple-pages is already available.

But I don't know how important requirement 2 is.  XFS already has
presumably-working private code to do it, and there is simplification and
perhaps modest performance gain in the block allocator to be had here.

And other filesystems (ie: ext4) _might_ use it.  But ext4 is extent-based,
so perhaps it's not work churning the on-disk format to get a bit of a
boost in the block allocator.

So I _think_ what this boils down to is offering some simplifications in
XFS, by adding complexications to core VFS and MM.  I dunno if that's a
good deal.


So...  tell us why you want feature 2?
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, May 4, 2007 - 8:57 am

Well from other parts of the conversation there is a third issue.
  3: large-sectorsize-on-disk.

There are a handful of devices in the kernel that could benefit
and be cleaned up a great deal if they could assume they always
received data in their sg lists that were full sectors.  Nothing
needs to be physically contiguous to handle that case though.

If we support large sector sizes for raw block devices we would
still have an issue of what to do with filesystems that want

I suspect we will still need Jens &gt; 128 page linux scatter gather list


Agreed.

When we are doing things optimistically and absolutely require large pages
this approach seems pretty sane.   When we start requiring large 64k

A good question.

Eric
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 12:45 pm

Well, ext3 could definitely use it; there are people using 8k and 16k
blocksizes on ia64 systems today.  Those filesystems can't be mounted
on x86 or x86_64 systems because our pagesize is 4k, though.

And I imagine that ext4 might want to use a large blocksize too ---
after all, XFS is extent based as well, and not _all_ of the
advantages of using a larger blocksize are related to brain-damaged
storage subsystems with short SG list support.  Whether the advantages
offset the internal fragmentation overhead or the complexity of adding
fragments support is a different question, of course.

So while the jury is out about how many other filesystems might use
it, I suspect it's more than you might think.  At the very least,
there may be some IA64 users who might be trying to transition their
way to x86_64, and have existing filesystems using a 8k or 16k
block filesystems.  :-)

						- Ted
-
To: Theodore Tso <tytso@...>
Cc: Andrew Morton <akpm@...>, David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, May 4, 2007 - 9:33 am

How much of a problem would it be if those blocks were not necessarily
contiguous in RAM, but placed in normal 4K pages in the page cache?

I expect meta data operations would have to be modified but that otherwise
you would not care.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 12:29 am

If you need to treat the block in a contiguous range, then you need to
vmap() the discontiguous pages. That has substantial overhead if you
have to do it regularly.

We do this in xfs_buf.c for &gt; page size blocks - the overhead that
caused when operating on inode clusters resulted in us doing some
pointer fiddling and directly addresing the contents of each page

I think you might need to modify the copy-in and copy-out operations
substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
span multple pages.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 12:48 am

Which is why I would prefer not to do it.  I think vmap is not really
compatible with the design of the linux page cache.

Although we can't even count on the pages being mapped into low
memory right now and have to call kmap if we want to access them
so things might not be that bad.  Even if it was a multipage kmap

But in a filesystem like ext2 except for a zeroing some unused hunks
of the page all that really happens is you setup for DMA straight out
of the page cache.  So this is primarily an issue for meta-data.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 1:27 am

Right - so how do we efficiently  manipulate data inside a large
block that spans multiple discontigous pages if we don't vmap

Except when you structures span page boundaries. Then you can't directly
reference the structure - it needs to be copied out elsewhere, modified
and copied back. That's messy and will require significant modification

I'm not sure I follow you here - copyin/copyout is to userspace and
has to handle things like RMW cycles to a filesystem block. e.g. if
we get a partial block over-write, we need to read in all the bits
around it and that will span multiple discontiguous pages. Currently
these function only handle RMW operations on something up to a
single page in size - to handle a RMW cycle on a block larger than a
page they are going to need substantial modification or entirely
new interfaces.

The high order page cache avoids the need to redesign interfaces
because it doesn't change the interfaces between the filesystem
and the page cache - everything still effectively operates
on single pages and the filesystem block size never exceeds the
size of a single page.....

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To: David Chinner <dgc@...>
Cc: Eric W. Biederman <ebiederm@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 2:43 am

You don't manipulate data except for copy_from_user, copy_to_user.
That is easy comparatively to deal with, and certainly doesn't
need vmap.

Meta-data may be trickier, but a lot of that depends on your

Potentially.  This is a just a meta data problem, and possibly we
solve it with something like vmap.  Possibly the filesystem won't
cross those kinds of boundaries and we simply never care.

The fact that it is a meta-data problem suggests it isn't the fast
path and we can incur a little more cost.  Especially if we filesytems

Bleh.  It has been to many days since I have hacked that code I forgot
which piece that was.  Yes. prepare_to_write is called before
we write to the page cache from the filesystem.

We already handle multiple page writes fairly well in that context.
prepare_write, commit_write may need a page cache but it may not.
All that really needs to happen is that all of the pages that
are part of the block get marked dirty in the page cache so one

Yes, instead of having to redesign the interface between the
fs and the page cache for those filesystems that handle large
blocks we instead need to redesign significant parts of the VM interface.
Shift the redesign work to another group of people and call it a trivial.

That is hardly a gain when it looks like you can have the same effect
with some moderately simple changes to mm/filemap.c and the existing
interfaces.

Eric
-
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 12:06 pm

To some extend that is true. But then there will then also be additional 
gain: We can likely get the VM to handle larger pages too which may get 
rid of hugetlb fs etc. The work is pretty straightforward: No locking 
changes f.e. So hardly a redesign. I think the crucial point is the 
antifrag/defrag issue if we want to generalize it.

I have an updated patch here that relies on page reservations. Adds 
something called page pools. On bootup you need to specify how many pages 
of each size you want. The page cache will then use those pages for 
filesystems that need larger blocksize. 

The interesting thing about that one is that it actually enables support 
foir multiple blocksizes with a single larger pagesize. If f.e. we setup a 
pool of 64k pages then the block layer can segment that into 16k pieces. 
So one can actually use 16k 32k and 64k block size with a single larger 
page size.

-
To: Christoph Lameter <clameter@...>
Cc: Eric W. Biederman <ebiederm@...>, David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 1:29 pm

Sadly, a backward compatibility stub must be retained in perpetuity.
It should be able to be reduced to the point it doesn't need its own
dedicated source files or config options, but it'll need something to
deal with the arch code.


-- wli
-
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 2:49 am

I wonder what happened to my pagearray patches.


-- wli
-
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Monday, May 7, 2007 - 3:06 am

I never really got the thing working, but I had an idea for a sort of
library to do this. This is/was probably against something like 2.6.5
but I honestly have no idea. Maybe this makes it something of an API
proposal.


-- wli


Index: linux-2.6/include/linux/pagearray.h
===================================================================
--- linux-2.6.orig/include/linux/pagearray.h	2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/include/linux/pagearray.h	2005-04-22 06:06:02.677494584 -0700
@@ -0,0 +1,24 @@
+#ifndef _LINUX_PAGEARRAY_H
+#define _LINUX_PAGEARRAY_H
+
+struct scatterlist;
+struct vm_area_struct;
+struct page;
+
+struct pagearray {
+	struct page **pages;
+	int nr_pages;
+	size_t length;
+};
+
+int alloc_page_array(struct pagearray *, const int, const size_t);
+void free_page_array(struct pagearray *);
+void zero_page_array(struct pagearray *);
+struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
+int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
+int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
+int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
+void *vmap_pagearray(struct pagearray *);
+
+#endif /* _LINUX_PAGEARRAY_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2005-04-22 06:01:29.786980248 -0700
+++ linux-2.6/mm/Makefile	2005-04-22 06:06:02.677494584 -0700
@@ -10,7 +10,7 @@
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o slab.o swap.o truncate.o vmscan.o \
-			   prio_tree.o $(mmu-y)
+			   prio_tree.o pagearray.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thras...
To: Eric W. Biederman <ebiederm@...>
Cc: David Chinner <dgc@...>, Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, <clameter@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Tuesday, May 8, 2007 - 4:49 am

This should probably have memcpy to/from pagearrays. Whole-hog read
and write f_op implementations would be good, too, since ISTR some
drivers basically do little besides that on their internal buffers.

vmap_pagearray() should take flags, esp. VM_IOREMAP but perhaps also
protections besides PAGE_KERNEL in case uncachedness is desirable. I'm
not entirely sure what it'd be used for if discontiguity is so heavily
supported. My wild guess is drivers that do things that are just too
weird to support with the discontig API, since that's how I used it.
It should support vmap()'ing interior sub-ranges, too.

The pagearray mmap() support is schizophrenic as to whether it prefills
or faults and not all that complete as far as manipulating the mmap()
goes. Shooting down ptes, flipping pages, or whatever drivers actually
do with the things should have helpers arranged. Coherent sets of
helpers for faulting vs. mmap()'ing idioms would be good.

pagearray_to_scatterlist() should probably take the scatterist as an
argument instead of allocating the scatterlist itself.

Something to construct bio's from pagearrays might help.

s/page_array/pagearray/g should probably be done. Prefixing with
pagearray_ instead of randomly positioning it within the name would
be good, too.

Some working API conversions on drivers sound like a good idea. I had
some large number of API conversions about, now lost, but they'd be
bitrotted anyway.

struct pagearray is better off as an opaque type so large pagearray
handling can be added in later via radix trees or some such, likewise
for expansion and contraction. Keeping drivers' hands off the internals
is just a good idea in general.

I'm somewhat less clear on what filesystems need to do here, or if it
would be useful for them to efficiently manipulate data inside a
large block that spans multiple discontiguous pages. I expect some
changes are needed at the very least to fill a pagearray with whatever
predetermined pages are needed. Filesystems probably nee...
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 1:49 am

You would only lock a single higher order block. Truncate works on that 
level.

If you have 4 separate pages then you need to take separate locks and you 
may not have contiguous memory which makes the filesystem run through all 

The patch is not about forcing to use large pages but about the option to 


The patchset would reduce complexity and making it easy to handle the page 
cache. Gets rid of the hacks to support larger ones right now. Its 

No it becomes easier. Look at the patchset. It cleans up a huge mess.
What is hacky about it? It is consistently using larger pages for the page 

We are currently looking dumb and unable to deal with the hardware. Yes 
we can pressure the hardware vendors to produce hardware conforming to our 

It is the most consistent solution that avoid the proliferation of further 
hacks to address the large blocksize.
-
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 2:55 am

This is completely incorrect.

Of *course* they're contiguous.  That's the whole point.

It's not exactly hard to lock four pages which are contiguous in pagecache,


No it doesn't and please stop spinning.  x86 ptes map 4k pages and the core
MM needs changes to continue to work with this hack in place.

If x86 had larger pagesize we wouldn't be seeing any of this.  It is a workaround

Were any cleanups made which were not also applicable as standalone things


It pretends that pages are large than they actually are, forcing the
pte-management code to also play along with the pretence.


That's spin as well.

Please address my point: if in five years time x86 has larger or varible
pagesize, this code will be a permanent millstone around our necks which we
*should not have merged*.

And if in five years time x86 does not have larger pagesize support then
the manufacturers would have decided that 4k pages are not a performance

You cannot say this.  I'm sitting here *watching* you refuse to seriously
consider alternatives.

And you've conspicuously failed to address my point regarding the
*permanent* additional maintenance cost.


Anyway.  Let's await those performance numbers.  If they're notably good,
and if we judge that this goodness will be realised on more than one
arguably-crippled present-day disk adapter then we can evaluate the
*various* options which we have for stuffing more data into that adapter.


-
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 9:44 am

So the verdict is wait 5 years, see if x86 did anything, and so on.
Has anyone else noticed our embedded arch installed base dwarfs our
x86 and "enterprise" installed bases combined? What are our priorities
that make designing the core around x86 meaningful again? Pipe dreams
of competing with Windows on the desktop? Optimizing for kernel
compiles on kernel hackers' workstations? Something should seriously
be reevaluated there at some point.

As x86 is now the priority regardless, maybe checking in with Intel and
AMD as far as what they'd like to see happen would be enlightening. It
may be that some things are deadlocked on OS use cases.

Also, is there something in particular that should be done for the case
of x86 acquiring a variable pagesize?


-- wli
-
To: William Lee Irwin III <wli@...>
Cc: Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:15 pm

You missed the bit about "evaluate alternatives".
-
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 10:21 pm

No worries. I'm used to being on the wrong side of things. I'll have
no trouble picking out the alternative least likely to be accepted. ;)


-- wli
-
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 7:05 am

Unfortunately, it's not really practical to increase the page size
very much on most systems, because you end up wasting a lot of space
in the page cache.  So there is a tension between wanting a small page
size so your page cache uses memory efficiently, and wanting a large
page size so the TLB covers more address space and your programs run
faster (not to mention other benefits such as the kernel having to
manage fewer pages, and I/O being done in bigger chunks).

Thus there is not really any single page size that suits all workloads
and machines.  With distros wanting to just have a single kernel per
architecture, and the fact that the page size is a compile-time
constant, we currently end up having to pick one size and just put up
with the fact that it will suck for some users.  We currently have
this situation on ppc64 now that POWER5+ and POWER6 machines have
hardware support for 64k pages as well as 4k pages.

So I can see a few different options:

(a) Keep things more or less as they are now and just wear the fact
that we will continue to show lower performance than certain
proprietary OSes, or

(b) Somehow manage to make the page size a variable rather than a
compile-time constant, and pick a suitable page size at boot time
based on how much memory the machine has, or something.  I looked at
implementing this at one point and recoiled in horror. :)

(c) Make the page cache able to use small pages for small files and
large pages for large files.  AIUI this is basically what Christoph is
proposing.

Option (a) isn't very palatable to me (nor I expect, Christoph :)
since it basically says that Linux is very much focussed on the
embedded and desktop end of things and isn't really suitable as a
high-performance OS for large SMP systems.  I don't want to believe
that. ;)

Option (b) would be a bit of an ugly hack.

Which leaves option (c) - unless you have a further option.  So I have
to say I support Christoph on this, at least as far as the general
principle is conc...
To: Paul Mackerras <paulus@...>
Cc: Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 7:58 am

We could approximate option (b) by setting a standard page size for the 
page cache and set the same page size in SLUB (slub is already boot time 
configurable in that respect). That will make large portions of the VM 
use the same page order. I can try to add similar controls to the page 
cache.

If the page size is set too high for a mount then we use the buffer 
head functionality to split the higher order page into pieces of the 
appropriate size. That could limit the number of page sizes that we need 
to support.

But I think we should first see how well Mel's antifrag work does.


-
To: Paul Mackerras <paulus@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 7:41 am

For the TLB issue, higher order pagecache doesn't help. If distros
ship with a 4K page size on powerpc, and use some larger pages in
the pagecache, some people are still going to get angry because
they wanted to use 64K pages... But I agree 64K pages is too big
for most things anyway, and 16 would be better as a default (which
hopefully x86-64 will get one day).

Anyway, for io performance, there are alternatives, dispite what
some people seem to be saying. We can submit larger sglists to the
device for larger ios, which Jens is looking at (which could help
all types of workloads, not just those with sequential large file
IO).

After that, I'd find it amusing if HBAs worth thousands of $ have
trouble looking up sglists at the relatively glacial pace that IO
requires, and/or can't spare a few more K for reasonable sglist
sizes, but if that is really the case, then we could use iommus
and/or just attempt to put physically contiguous pages in pagecache,
rather than require it.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 8:14 am

Oh?  Assuming your hardware is capable of supporting a variety of page
sizes, and of putting a page at any address that is a multiple of its
size, it should help, potentially a great deal, as far as I can see.
I'm thinking in particular of machines that have software-loaded
fully-associative TLBs and support a lot of page sizes, e.g.
4kB * 4^n for n = 0 up to 8 or so, like some embedded powerpc chips.

It's not as simple on 64-bit powerpc with the hash table of course,
because the page size is chosen at the segment (256MB) level,

Even 16k is going to bloat the page cache, and some people will
complain.  One way that x86-64 could do 16k pages is by still indexing
the PTE page in units of 4k, but then have an indicator in the PTE
that this is a 16k page.  Thus a 16k page would occupy 4 consecutive
PTEs, but once it was loaded into the TLB, a single TLB entry would
map the whole 16k.  That would give the expanded TLB reach and allow
4k and 16k pages to be intermixed freely.

Paul.
-
To: Paul Mackerras <paulus@...>
Cc: Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 9:42 am

I think Christoph's variable order pagecache should be perfectly
fine on ppc64.  We're selecting the pagesize on a per-file basis,
and the page size selection would choce which segement this mmap
gets into.  Ben's get_unmapped_area changes are very helpfull for that.

-
To: Paul Mackerras <paulus@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 8:36 am

That's a little bit more than just the higher order pagecache patch.
But I don't know if that would be impossible to do with the "attempt
to allocate contiguous pagecache" approach either. Or if it would be

I guess any page size bloats the pagecache relative to something
smaller :) But 4K doesn't seem to be proving too much problem for
x86 and I'm not talking about an actual implementation coming up,
but just a size that would make sense in future (and probably last
for a long time).

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 8:12 am

Powerpc supports multiple pagesizes. Maybe we could make mmap use those 
page sizes some day if we had a variable order page cache. Your stands on 
the issue means that powerpc will be forever crippled and not be able to 

Right this could help but it is not addressing the basic requirement for
devices that need large contiguuos chunks of memory for I/O.
-
To: Christoph Lameter <clameter@...>
Cc: Nick Piggin <nickpiggin@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 9:37 am

You can already mmap using different pagesize on ppc64.  A certain
infinibad adapter with interesting design choices requires 4k mappings
even on 64k kernels, and for some spu-related mappings on Cell we
want the same.  I'm no expert on the powerpc mmu, but I'd be surprised
if 64k pagecache mappings wouldn't work on 4k base page size kernels.

-
To: Christoph Lameter <clameter@...>
Cc: Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 8:25 am

Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)

This seems like just speculation. I would not be against something which,
without, would "cripple" some relevant hardware, but you are just handwaving

Did you read the last paragraph? Or anything Andrew's been writing?

  "After that, I'd find it amusing if HBAs worth thousands of $ have
   trouble looking up sglists at the relatively glacial pace that IO
   requires, and/or can't spare a few more K for reasonable sglist
   sizes, but if that is really the case, then we could use iommus
   and/or just attempt to put physically contiguous pages in pagecache,
   rather than require it."

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 12:48 pm

This crippling also applies to IA64. The crippling is a concern of 
multiple arches and its effects have been well documented. Talk to Peter 
Chubb for example. The requirement is to be able to do I/O to large 
contiguous sections of memory. Your proposals are ignoring the 
requirements.
-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 9:39 am

Different mmu.  The desktop 32bit mmu Linus refered to has almost nothing

Real highend HBAs don't have that problem.  But for example aacraid
which is very common on mid-end servers is a _lot_ faster when it
gets continous memory.  Some benchmark was 10 or more percent faster
on windows due to this.
-
To: Christoph Hellwig <hch@...>
Cc: Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 10:27 pm

Well I wasn't trying to make a point there so it isn't a big deal... but
he has known to say the 64-bit hash table is insane or broken. If he's

And that wasn't due to the 128 sg limit?

I guess 10% isn't a small amount. Though it would be nice to have
before/after numbers for Linux. And, like Andrew was saying, we could
just _attempt_ to put contiguous pages in pagecache rather than
_require_ it. Which is still robust under fragmentation, and benefits
everyone, not just files with a large pagecache size.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Hellwig <hch@...>, Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Saturday, April 28, 2007 - 4:16 am

No, that was due to aacraid really liking sg lists as small as possible
where every entry covers areas as big as possible.  The driver really
liked physical merging once wli changed the page allocator to return

I'll try to find the old thread.

-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Hellwig <hch@...>, Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 10:39 pm

What sort of strategy do you intend to use to speculatively populate
the pagecache with contiguous pages?


-- wli
-
To: William Lee Irwin III <wli@...>
Cc: Christoph Hellwig <hch@...>, Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 10:50 pm

Andrew outlined it.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Christoph Hellwig <hch@...>, Christoph Lameter <clameter@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 11:16 pm

I'd like to suggest a few straightforward additions to the proposal:

(1) the interface to the page allocator tries to allocate N pages where
	(a) N is a power of 2
	(b) some effort is made to get contiguity
	(c) some effort is made to fall back to lesser contiguity
	(d) some effort is made to get N pages even with no contiguity
(2) a corresponding group freeing interface to the page allocator
(3) Pass the pages around in a list or similar so that O(1) instead of
	O(pages) splice operations under the lock suffice for passing
	them around. Dissecting compound pages outside locks helps.


-- wli
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:22 am

Ahh. I think I know what you mean. The current patchset is for performance 
testing against mainline. Lets first cover the bases and then see where 
we go. It is not against mm. I will submit pieces to mm depending on the 
outcome of our discussions.
-
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:29 am

Thanks.

There's a ludicrous amount of MM work pending in -mm.  It would probably be
less work at your end to see what ends up landing in 2.6.22-rc1.

&lt;wanders off to do his mm-merge-plans email&gt;
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:35 am

I am aware of that and thats why I kept this against upstream. The need 
right now is for justification and explanation. I had to go 
through a head spinning series of VM layers to get an idea how to do 
this in a clean way and then had to make additional passes to do minimal 
modifications to get this working so that it is testable.
 
Performance tests please...
-
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:43 am

OK.

Don't get me wrong - I do think this is neat code and is a good way of
addressing the problem.  (I'm surprised that the mmap protopatch didn't
touch rmap.c).

But I don't think it's a slam dunk and I would like you to appreciate the
constraints which I believe we operate under.  And I don't think we've

On various HBAs, please ;)
-
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:19 am

If you can find them....
 

No its a fact. The patchset really allows one to switch large page 

The page cache is different from pte mapping. One page struct controls 
them all. Look at the patches. There is no state information in the tail


The page cache functions require a mapping parameter. This is available 
in most place and a natural thing given that allocation etc is also bound


No they are 16k if the filesystem wants them to be 16k. The filesystem 
does not need to have the data mapped into an address space. And there is 

No this code will enable us to switch to this new page size in a very fast 
way. Because the pagecache already supports it it is easier to add the 

The manufacturers on x86 are already supporting 2M page sizes and cannot 
support intermediate sizes since they are married to the page table 
format for performance reasons. The patch could f.e. lead to 

And I am sitting here in disbelief about the series of weird alternatives 
running over my screen just to avoid the obvious solution. Then there is 
this weird idea that this would hinder us from supporting additional page 
sizes for mmap while the patch does exactly lead to enable support such 

Where? The page cache handling in the various layers is significantly 

One? Spin.... The majority you mean?

Dave, where are we with the performance tests?
-
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel@...>, William Lee Irwin III <wli@...>, Jens Axboe <jens.axboe@...>, Badari Pulavarty <pbadari@...>, Maxim Levitsky <maximlevitsky@...>
Date: Friday, April 27, 2007 - 3:26 am

How on earth can the *addition* of variable pagecache size simplify the
existing code?

What cleanups are in this patchset which cannot be made *without* the

Well yes.

Do note that if the numbers are good, we also need to look at how generally
useful this work is.  For example, if it only benefits one particular
arguably-crippled present-generation adapter then that of course weakens the
case.
-
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, David Chinner <dgc@...>, <linux-kernel@...>, Mel Gorman <mel