Re: [patch] mm: pageable memory allocator (for DRM-GEM?)

Previous thread: [PATCH] CFS scheduler: documentation about scheduling policies by Martin Steigerwald on Tuesday, September 23, 2008 - 2:01 am. (9 messages)

Next thread: [Patch -tip 0/4] Creation of the initcall tracer by Frédéric Weisbecker on Tuesday, September 23, 2008 - 3:29 am. (6 messages)
From: Nick Piggin
Date: Tuesday, September 23, 2008 - 2:10 am

Hi,

So I promised I would look at this again, because I (and others) have some
issues with exporting shmem_file_setup for DRM-GEM to go off and do things
with.

The rationale for using shmem seems to be that pageable "objects" are needed,
and they can't be created by userspace because that would be ugly for some
reason, and/or they are required before userland is running.

I particularly don't like the idea of exposing these vfs objects to random
drivers because they're likely to get things wrong or become out of synch
or unreviewed if things change. I suggested a simple pageable object allocator
that could live in mm and hide the exact details of how shmem / pagecache
works. So I've coded that up quickly.

Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
see something strange... it seems that userspace actually gets some kind of
descriptor, a descriptor to an object backed by this shmem file (let's call it
a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
pread, pwrite, and mmap these objects, but unfortunately it has no direct way
to do that, due to not having open(2)ed the files directly. So what GEM does
is to add some ioctls which take the "file descriptor" things, and derives
the shmem file from them, and then calls into the vfs to perform the operation.

If my cursory reading is correct, then my allocator won't work so well as a
drop in replacement because one isn't allowed to know about the filp behind
the pageable object. It would also indicate some serious crack smoking by
anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...

So please, nobody who worked on that code is allowed to use ugly as an
argument. Technical arguments are fine, so let's try to cover them.

BTW. without knowing much of either the GEM or the SPU subsystems, the
GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
considered to make the object allocator be a filesystem? That way you ...
From: Thomas Hellström
Date: Tuesday, September 23, 2008 - 3:21 am

Nick,
 From my point of view, this is exactly what's needed, although there 
might be some different opinions among the
DRM developers. A question:

Sometimes it's desirable to indicate that a page / object is "cleaned", 
which would mean data has moved and is backed by device memory. In that 
case one could either free the object or indicate to it that it can 
release it's pages. Is freeing / recreating such an object an expensive 
operation? Would it, in that case, be possible to add an object / page 
"cleaned" function?

/Thomas
 








--

From: Jerome Glisse
Date: Tuesday, September 23, 2008 - 4:31 am

On Tue, 23 Sep 2008 12:21:26 +0200

Also what about a uncached page allocator ? As some drivers might need
them, there is no number but i think their was some concern that changing
PAT too often might be costly and that we would better have a poll of
such pages.

Cheers,
Jerome Glisse <glisse@freedesktop.org>
--

From: Christoph Lameter
Date: Tuesday, September 23, 2008 - 6:18 am

IA64 has an uncached allocator. See arch/ia64/include/asm/incached.h and
arch/ia64/kernel/uncached.c. Probably not exactly what you want but its a
starting point.


--

From: Nick Piggin
Date: Wednesday, September 24, 2008 - 5:18 pm

Ah, interesting... freeing/recreating isn't _too_ expensive, but it is
going to have to allocate a lot of pages (for a big object) and copy
a lot of memory. It's strange to say "cleaned", in a sense, because the
allocator itself doesn't know it is being used as a writeback cache ;)
(and it might get confusing with the shmem implementation because your
cleaned != shmem cleaned!).

I understand the operation you need, but it's tricky to make it work in
the existing shmem / vm infrastructure I think. Let's call it "dontneed",
and I'll add a hook in there we can play with later to see if it helps?

What I could imagine is to have a second backing store (not shmem), which
"dontneed" pages go onto, and they simply get discarded rather than swapped
out (eg. via the ->shrinker() memory pressure indicator). You could then
also register a callback to recreate these parts of memory if they have been
discarded then become used again. It wouldn't be terribly difficult come to
think of it... would that be useful?

--

From: Thomas Hellström
Date: Thursday, September 25, 2008 - 12:19 am

Well, the typical usage pattern is:

1) User creates a texture object, the data of which lives in a pageable 
object.
2) DRM decides it needs to go into V(ideo)RAM, and doesn't need a 
backing store. It indicates "dontneed" status on the object.
3) Data is evicted from VRAM. If it's not dirtied in VRAM and the 
"dontneed" pages are still around in the
old backing object, fine. We can and should reuse them. If data is 
dirtied in VRAM or the page(s) got discarded
 we need new pages and to set up a copy operation.

So yes, that would indeed be very useful,
I think one way is to have the callback happen on a per-page basis.
 DRM can then collect a list of what pages need to be copied from VRAM 
based on the callback and its knowledge of VRAM data status and set up a 
single DMA operation. So the callback shouldn't implicitly mark the 
newly allocated pages dirty.

Another useful thing I come to think of looking through the interface 
specification again is to have a pgprot_t argument to the pageable 
object vmap function, so that the VMAP mapping can be set to uncached / 
write-combined.
This of course would imply that the caller has pinned the pages and had 
the default kernel mapping caching status changed as well.

/Thomas





--

From: Keith Packard
Date: Thursday, September 25, 2008 - 7:38 am

Note that this can occur as a result of a suspend-to-memory transition
at which point *all* of the objects in VRAM will need to be preserved in
main memory, and so the pages aren't really 'freed', they just don't
need to have valid contents, but the system should be aware that the
space may be needed at some point in the future.

--=20
keith.packard@intel.com
From: Thomas Hellström
Date: Thursday, September 25, 2008 - 8:39 am

Actually, I think the pages must be allowed to be freed, and that we 
don't put a requirement on "pageable"  to keep
swap-space slots for these pages. If we hit an OOM-condition during 
suspend-to-memory that's bad, but let's say we
required "pageable" to keep swap space slots for us, the result would 
perhaps be that another device wasn't able to suspend, or a user-space 
program was killed due to lack of swap-space prior to suspend.

I'm not really sure what's the worst situation, but my feeling is that 
we should not require swap-space to be reserved for VRAM, and abort the 
suspend operation if we hit OOM. That would, in the worst case, mean 
that people with non-UMA laptops and a too small swap partition would 
see their battery run out much quicker than they expected...

/Thomas



--

From: Dave Airlie
Date: Thursday, September 25, 2008 - 3:41 pm

On Fri, Sep 26, 2008 at 1:39 AM, Thomas Hellström

You can't fail suspend, it just doesn't work like that. The use case
is close laptop
shove in bag, walk away. Having my bag heat up and the laptop inside
not suspended
isn't the answer ever.

So with that in mind, I think we either a) keep some backing pages
around, or b) make object
file backed so if the swap space fills up we can back out to the file objects.

--

From: Keith Packard
Date: Tuesday, September 23, 2008 - 8:50 am

Right, creating them from user space was just a mild inconvenience as
we'd have to come up with suitable names. The semantics don't match
exactly as most of the time we never need the filename later, but some
objects will be referenced later on so we'd need to be able to come up
with a persistent name at that point.

The real issue is that we need to create objects early in the kernel
initialization sequence to provide storage for the console frame buffer,
long before user space starts up. Lacking this, we wouldn't be able to

ation.

Sure, we've looked at using regular file descriptors for these objects
and it almost works, except for a few things:

 1) We create a lot of these objects. The X server itself may have tens
    of thousands of objects in use at any one time (my current session
    with gitk and firefox running is using 1565 objects). Right now, the
    maximum number of fds supported by 'normal' kernel configurations
    is somewhat smaller than this. Even when the kernel is fixed to
    support lifting this limit, we'll be at the mercy of existing user
    space configurations for normal applications.

 2) More annoyingly, applications which use these objects also use
    select(2) and depend on being able to represent the 'real' file
    descriptors in a compact space near zero. Sticking a few thousand
    of these new objects into the system would require some ability to
    relocate the descriptors up higher in fd space. This could also
    be done in user space using dup2, but that would require managing
    file descriptor allocation in user space.

 3) The pread/pwrite/mmap functions that we use need additional flags
    to indicate some level of application 'intent'. In particular, we
    need to know whether the data is being delivered only to the GPU
    or whether the CPU will need to look at it in the future. This
    drives the kind of memory access used within the kernel and has
    a significant performance impact.

If (when?) we can ...
From: Jerome Glisse
Date: Tuesday, September 23, 2008 - 11:29 am

On Tue, 23 Sep 2008 08:50:29 -0700

I am starting to ponder if driver specific ioctl for memory object is a
better plan. On intel you have your GTT mapping trick (whether you want
to access an object through GTT or directly map ram page iirc), on radeon
i can think of similar but bit different use case where we can ask to
map some vram with special properties on it so we can access some tiled
surface transparently.

Of course the underlying implementation will share quite bit of code.
I just think that each hw have its own specificity and that trying to
hamer out all this in a common userspace API is not the best thing to do.
I am pretty sure nvidia hw offer some nice trick that won't fit in any
common userspace interface.

So the point is that Nick proposal does make lot of sense and i think
we should let each driver design their own memory object API to fit their
need. We don't have the need for a common interface anymore in DRI2.

Cheers,
Jerome Glisse <glisse@freedesktop.org>
--

From: Nick Piggin
Date: Wednesday, September 24, 2008 - 5:30 pm

Pity. Anyway, I accept that, let's move on.


I guess so. A big problem of ioctls is just that they had been easier to
add so they got less thought and review ;) If your ioctls are stable,

Well, no not a seperate filesystem to do the pageable backing store, but
a filesystem to do your object management. If there was a need for pageable

You can map them to userspace if you just take a page at a time and insert
them into the page tables at fault time (or mmap time if you prefer).
Currently, this will mean that mmapped pages would not be swappable; is
that a problem?


--

From: Keith Packard
Date: Wednesday, September 24, 2008 - 6:20 pm

Well, the goal is to "eventually" get to use fds so that at least some
of our common operations can use regular sys calls. But, not having
those trapped in the shmem layer may end up being a feature as we'll get
to watch more closely, although dealing with the actual pread/pwrite

One does what one can. Of course, in this case, 'cross platform' is just
x86/x86_64 as we're talking Intel integrated graphics. When (if?) we
figure out how to create a common interface across multiple cards for
some of these operations, we'll probably discover mistakes. We have
.=20

Now that you've written one, we could go back and think about building a
file system and using fds for our operations. It would be a whole lot

Yes. We leave a lot of objects mapped to user space as mmap isn't
exactly cheap. We're trying to use pread/pwrite for as much bulk I/O as
we can, but at this point, we're still mapping most of the pages we
allocate into user space and leaving them. Things like textures and
render buffers will get mmapped if there are any software fallbacks.
Other objects, like vertex buffers, will almost always end up mapped.

One of our explicit design goals was to make sure user space couldn't
ever pin arbitrary amounts of memory; I'd hate to go back on that as it
seems like an important property for any subsystem designed to support
regular user applications in a general purpose desktop environment. I
don't want to trust user space to do the right thing, I want to enforce
that from kernel space.

--=20
keith.packard@intel.com
From: Nick Piggin
Date: Wednesday, September 24, 2008 - 7:30 pm

OK, that's all that can be asked I guess. Low level object / memory 


OK. I will have to add some facilities to allow mmaps that go back through
to tmpfs and be swappable... Thanks for the data point.
--

From: Keith Packard
Date: Wednesday, September 24, 2008 - 7:43 pm

It seems like once you've done that you might consider extracting the
page allocator from shmem so that drm, tmpfs and sysv IPC would share
the same underlying memory manager API.

--=20
keith.packard@intel.com
From: Nick Piggin
Date: Wednesday, September 24, 2008 - 8:07 pm

That might be the cleanest logical way to do it actually. But for the moment
I'm happy not to pull tmpfs apart :) Even if it seems like the wrong way
around, at least it is insulated to within mm/
 
--

From: Keith Packard
Date: Wednesday, September 24, 2008 - 11:16 pm

Sure; no sense changing that before we've gotten some experience with
the new API anyway. Would we consider modifying sysv IPC as well? It
currently uses the shmem_file_setup function although it lives a long
ways from mm...

--=20
keith.packard@intel.com
From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 1:45 am

On Tue, 23 Sep 2008 11:10:17 +0200

Some questions..

 - could you use GFP_HIGHUSER rather than GFP_HIGHUSER_MOVABLE ?
   I think setting mapping_gfp_mask() (address_space->flags) to appropriate
   value is enough.

 - Can we mlock pages while it's vmapped ? (or Reserve and remove from LRU)
   Then, new split-lru can ignore these pages while there are mapped. over-killing ?

 - Doesn't we need to increase page->mapcount ?

 - memory resource contorller should account these pages ?
   (Maybe this is question to myself....)

Thanks,
-Kame

--

From: Eric Anholt
Date: Monday, September 29, 2008 - 6:10 pm

Hiding the details of shmem and the pagecache sounds pretty good to me
(since we've got it wrong at least twice so far).  Hopefully the result
.

I think the explanation for this got covered in other parts of the

Yes, we definitely considered a filesystem (it would be nice for
debugging to be able to look at object contents from a debugger process
\easily).  However, once we realized that fds just wouldn't work (we're
allocating objects in a library, so we couldn't just dup2 them up high,
and we couldn't rely on being able to up the open file limit for the
process), shmem seemed to already be exactly what we wanted, and we
assumed that whatever future API changes in the couple of VFS and
pagecache calls we made would be easier to track than duplicating all of
shmem.c into our driver.

I'm porting our stuff to test on your API now, and everything looks
straightforward except for mmap.  For that I seem to have three options:

1) Implement a range allocator on the DRM device and have a hashtable of
ranges to objects, then have a GEM hook in the mmap handler of the drm
device when we find we're in one of those ranges.

This was the path that TTM took.  Since we have different paths to
mmapping objects (direct backing store access, or aperture access,
though the second isn't in my tree yet), it means we end up having
multiple offsets to represent different mmap types, or multiplying the
size of the range and having the top half of the range mean the other
mmap type.

2) Create a kernel-internal filesystem and get struct files for the
objects.

This is the method that seemed like the right thing to do in the linux
style, so I've been trying to figure that part out.  I've been assured
that libfs makes my job easy here, but as I look at it I'm less sure.
The sticking point to me is how the page list I get from your API ends
up getting used by simple_file_* and generic_file_*.  And, in the future
where the pageable memory allocator is actually pageable while mmapped,
what does ...
From: Jesse Barnes
Date: Thursday, October 2, 2008 - 10:15 am

I don't think anyone would argue that using normal system calls would be ugly, 
but there are several limitations with that approach, including the fact that 
some of our operations become slightly more difficult to do, along with the 
other limitations mentioned in drm_gem.c and in other threads.

At this point I think we should go ahead and include Eric's earlier patchset 
into drm-next, and continue to refine the internals along the lines of what 
you've posted here in the post-2.6.28 timeframe.  The ioctl based interfaces 
(there aren't too many) are something we can support going forward, so we 
should be able to rip up/clean up the implementation over time as the VM 
becomes more friendly to these sort of operations.

Any objections?

Dave, you can add my Acked-by (or S-o-b if Eric includes my GTT mapping stuff) 
to Eric's patchset; hope you can do that soon so we can get a libdrm with the 
new APIs released soon.

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center
--

From: Keith Packard
Date: Thursday, October 2, 2008 - 10:17 pm

Nick, in case you missed the plea here, we're asking if you have any
objection to shipping the mm changes present in Eric's patch in 2.6.28.
When your new pageable allocator becomes available, we'll switch over to
using that instead and revert Eric's mm changes.

We're ready to promise to support the user-land DRM interface going
forward, and we've got lots of additional work queued up behind this
merge. We'd prefer to push stuff a bit at a time rather than shipping a
lot of new code in a single kernel release.=20

--=20
keith.packard@intel.com
From: Nick Piggin
Date: Thursday, October 2, 2008 - 11:40 pm

So long as we don't have to support the shmem exports for too long,
I'm OK with that. The pageable allocator probably is probably not a
2.6.28 merge candidate at this point, so I don't want to hold things

I would have liked to see more effort going towards building the user
API starting with a pseudo filesystem rather than ioctls, but that's
just my opinion after squinting at the problem from 100 metres away..
So I don't have a strong standing to demand a change here ;)

So, I'm OK with it for 2.6.28.

I think Christoph had some concerns with the patches, and I'd like to
hear that he's happy now. Christoph? Does the pageable allocator API
satisfy your concerns, or did you have other issues with it?

Thanks,
Nick
--

Previous thread: [PATCH] CFS scheduler: documentation about scheduling policies by Martin Steigerwald on Tuesday, September 23, 2008 - 2:01 am. (9 messages)

Next thread: [Patch -tip 0/4] Creation of the initcall tracer by Frédéric Weisbecker on Tuesday, September 23, 2008 - 3:29 am. (6 messages)