Hi, So I promised I would look at this again, because I (and others) have some issues with exporting shmem_file_setup for DRM-GEM to go off and do things with. The rationale for using shmem seems to be that pageable "objects" are needed, and they can't be created by userspace because that would be ugly for some reason, and/or they are required before userland is running. I particularly don't like the idea of exposing these vfs objects to random drivers because they're likely to get things wrong or become out of synch or unreviewed if things change. I suggested a simple pageable object allocator that could live in mm and hide the exact details of how shmem / pagecache works. So I've coded that up quickly. Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I see something strange... it seems that userspace actually gets some kind of descriptor, a descriptor to an object backed by this shmem file (let's call it a "file descriptor"). Anyway, it turns out that userspace sometimes needs to pread, pwrite, and mmap these objects, but unfortunately it has no direct way to do that, due to not having open(2)ed the files directly. So what GEM does is to add some ioctls which take the "file descriptor" things, and derives the shmem file from them, and then calls into the vfs to perform the operation. If my cursory reading is correct, then my allocator won't work so well as a drop in replacement because one isn't allowed to know about the filp behind the pageable object. It would also indicate some serious crack smoking by anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison... So please, nobody who worked on that code is allowed to use ugly as an argument. Technical arguments are fine, so let's try to cover them. BTW. without knowing much of either the GEM or the SPU subsystems, the GEM problem seems similar to SPU. Did anyone look at that code? Was it ever considered to make the object allocator be a filesystem? That way you ...
Nick, From my point of view, this is exactly what's needed, although there might be some different opinions among the DRM developers. A question: Sometimes it's desirable to indicate that a page / object is "cleaned", which would mean data has moved and is backed by device memory. In that case one could either free the object or indicate to it that it can release it's pages. Is freeing / recreating such an object an expensive operation? Would it, in that case, be possible to add an object / page "cleaned" function? /Thomas --
On Tue, 23 Sep 2008 12:21:26 +0200 Also what about a uncached page allocator ? As some drivers might need them, there is no number but i think their was some concern that changing PAT too often might be costly and that we would better have a poll of such pages. Cheers, Jerome Glisse <glisse@freedesktop.org> --
IA64 has an uncached allocator. See arch/ia64/include/asm/incached.h and arch/ia64/kernel/uncached.c. Probably not exactly what you want but its a starting point. --
Ah, interesting... freeing/recreating isn't _too_ expensive, but it is going to have to allocate a lot of pages (for a big object) and copy a lot of memory. It's strange to say "cleaned", in a sense, because the allocator itself doesn't know it is being used as a writeback cache ;) (and it might get confusing with the shmem implementation because your cleaned != shmem cleaned!). I understand the operation you need, but it's tricky to make it work in the existing shmem / vm infrastructure I think. Let's call it "dontneed", and I'll add a hook in there we can play with later to see if it helps? What I could imagine is to have a second backing store (not shmem), which "dontneed" pages go onto, and they simply get discarded rather than swapped out (eg. via the ->shrinker() memory pressure indicator). You could then also register a callback to recreate these parts of memory if they have been discarded then become used again. It wouldn't be terribly difficult come to think of it... would that be useful? --
Well, the typical usage pattern is: 1) User creates a texture object, the data of which lives in a pageable object. 2) DRM decides it needs to go into V(ideo)RAM, and doesn't need a backing store. It indicates "dontneed" status on the object. 3) Data is evicted from VRAM. If it's not dirtied in VRAM and the "dontneed" pages are still around in the old backing object, fine. We can and should reuse them. If data is dirtied in VRAM or the page(s) got discarded we need new pages and to set up a copy operation. So yes, that would indeed be very useful, I think one way is to have the callback happen on a per-page basis. DRM can then collect a list of what pages need to be copied from VRAM based on the callback and its knowledge of VRAM data status and set up a single DMA operation. So the callback shouldn't implicitly mark the newly allocated pages dirty. Another useful thing I come to think of looking through the interface specification again is to have a pgprot_t argument to the pageable object vmap function, so that the VMAP mapping can be set to uncached / write-combined. This of course would imply that the caller has pinned the pages and had the default kernel mapping caching status changed as well. /Thomas --
Note that this can occur as a result of a suspend-to-memory transition at which point *all* of the objects in VRAM will need to be preserved in main memory, and so the pages aren't really 'freed', they just don't need to have valid contents, but the system should be aware that the space may be needed at some point in the future. --=20 keith.packard@intel.com
Actually, I think the pages must be allowed to be freed, and that we don't put a requirement on "pageable" to keep swap-space slots for these pages. If we hit an OOM-condition during suspend-to-memory that's bad, but let's say we required "pageable" to keep swap space slots for us, the result would perhaps be that another device wasn't able to suspend, or a user-space program was killed due to lack of swap-space prior to suspend. I'm not really sure what's the worst situation, but my feeling is that we should not require swap-space to be reserved for VRAM, and abort the suspend operation if we hit OOM. That would, in the worst case, mean that people with non-UMA laptops and a too small swap partition would see their battery run out much quicker than they expected... /Thomas --
On Fri, Sep 26, 2008 at 1:39 AM, Thomas Hellström You can't fail suspend, it just doesn't work like that. The use case is close laptop shove in bag, walk away. Having my bag heat up and the laptop inside not suspended isn't the answer ever. So with that in mind, I think we either a) keep some backing pages around, or b) make object file backed so if the swap space fills up we can back out to the file objects. --
Right, creating them from user space was just a mild inconvenience as
we'd have to come up with suitable names. The semantics don't match
exactly as most of the time we never need the filename later, but some
objects will be referenced later on so we'd need to be able to come up
with a persistent name at that point.
The real issue is that we need to create objects early in the kernel
initialization sequence to provide storage for the console frame buffer,
long before user space starts up. Lacking this, we wouldn't be able to
ation.
Sure, we've looked at using regular file descriptors for these objects
and it almost works, except for a few things:
1) We create a lot of these objects. The X server itself may have tens
of thousands of objects in use at any one time (my current session
with gitk and firefox running is using 1565 objects). Right now, the
maximum number of fds supported by 'normal' kernel configurations
is somewhat smaller than this. Even when the kernel is fixed to
support lifting this limit, we'll be at the mercy of existing user
space configurations for normal applications.
2) More annoyingly, applications which use these objects also use
select(2) and depend on being able to represent the 'real' file
descriptors in a compact space near zero. Sticking a few thousand
of these new objects into the system would require some ability to
relocate the descriptors up higher in fd space. This could also
be done in user space using dup2, but that would require managing
file descriptor allocation in user space.
3) The pread/pwrite/mmap functions that we use need additional flags
to indicate some level of application 'intent'. In particular, we
need to know whether the data is being delivered only to the GPU
or whether the CPU will need to look at it in the future. This
drives the kind of memory access used within the kernel and has
a significant performance impact.
If (when?) we can ...On Tue, 23 Sep 2008 08:50:29 -0700 I am starting to ponder if driver specific ioctl for memory object is a better plan. On intel you have your GTT mapping trick (whether you want to access an object through GTT or directly map ram page iirc), on radeon i can think of similar but bit different use case where we can ask to map some vram with special properties on it so we can access some tiled surface transparently. Of course the underlying implementation will share quite bit of code. I just think that each hw have its own specificity and that trying to hamer out all this in a common userspace API is not the best thing to do. I am pretty sure nvidia hw offer some nice trick that won't fit in any common userspace interface. So the point is that Nick proposal does make lot of sense and i think we should let each driver design their own memory object API to fit their need. We don't have the need for a common interface anymore in DRI2. Cheers, Jerome Glisse <glisse@freedesktop.org> --
Pity. Anyway, I accept that, let's move on. I guess so. A big problem of ioctls is just that they had been easier to add so they got less thought and review ;) If your ioctls are stable, Well, no not a seperate filesystem to do the pageable backing store, but a filesystem to do your object management. If there was a need for pageable You can map them to userspace if you just take a page at a time and insert them into the page tables at fault time (or mmap time if you prefer). Currently, this will mean that mmapped pages would not be swappable; is that a problem? --
Well, the goal is to "eventually" get to use fds so that at least some of our common operations can use regular sys calls. But, not having those trapped in the shmem layer may end up being a feature as we'll get to watch more closely, although dealing with the actual pread/pwrite One does what one can. Of course, in this case, 'cross platform' is just x86/x86_64 as we're talking Intel integrated graphics. When (if?) we figure out how to create a common interface across multiple cards for some of these operations, we'll probably discover mistakes. We have .=20 Now that you've written one, we could go back and think about building a file system and using fds for our operations. It would be a whole lot Yes. We leave a lot of objects mapped to user space as mmap isn't exactly cheap. We're trying to use pread/pwrite for as much bulk I/O as we can, but at this point, we're still mapping most of the pages we allocate into user space and leaving them. Things like textures and render buffers will get mmapped if there are any software fallbacks. Other objects, like vertex buffers, will almost always end up mapped. One of our explicit design goals was to make sure user space couldn't ever pin arbitrary amounts of memory; I'd hate to go back on that as it seems like an important property for any subsystem designed to support regular user applications in a general purpose desktop environment. I don't want to trust user space to do the right thing, I want to enforce that from kernel space. --=20 keith.packard@intel.com
OK, that's all that can be asked I guess. Low level object / memory OK. I will have to add some facilities to allow mmaps that go back through to tmpfs and be swappable... Thanks for the data point. --
It seems like once you've done that you might consider extracting the page allocator from shmem so that drm, tmpfs and sysv IPC would share the same underlying memory manager API. --=20 keith.packard@intel.com
That might be the cleanest logical way to do it actually. But for the moment I'm happy not to pull tmpfs apart :) Even if it seems like the wrong way around, at least it is insulated to within mm/ --
Sure; no sense changing that before we've gotten some experience with the new API anyway. Would we consider modifying sysv IPC as well? It currently uses the shmem_file_setup function although it lives a long ways from mm... --=20 keith.packard@intel.com
On Tue, 23 Sep 2008 11:10:17 +0200 Some questions.. - could you use GFP_HIGHUSER rather than GFP_HIGHUSER_MOVABLE ? I think setting mapping_gfp_mask() (address_space->flags) to appropriate value is enough. - Can we mlock pages while it's vmapped ? (or Reserve and remove from LRU) Then, new split-lru can ignore these pages while there are mapped. over-killing ? - Doesn't we need to increase page->mapcount ? - memory resource contorller should account these pages ? (Maybe this is question to myself....) Thanks, -Kame --
Hiding the details of shmem and the pagecache sounds pretty good to me (since we've got it wrong at least twice so far). Hopefully the result . I think the explanation for this got covered in other parts of the Yes, we definitely considered a filesystem (it would be nice for debugging to be able to look at object contents from a debugger process \easily). However, once we realized that fds just wouldn't work (we're allocating objects in a library, so we couldn't just dup2 them up high, and we couldn't rely on being able to up the open file limit for the process), shmem seemed to already be exactly what we wanted, and we assumed that whatever future API changes in the couple of VFS and pagecache calls we made would be easier to track than duplicating all of shmem.c into our driver. I'm porting our stuff to test on your API now, and everything looks straightforward except for mmap. For that I seem to have three options: 1) Implement a range allocator on the DRM device and have a hashtable of ranges to objects, then have a GEM hook in the mmap handler of the drm device when we find we're in one of those ranges. This was the path that TTM took. Since we have different paths to mmapping objects (direct backing store access, or aperture access, though the second isn't in my tree yet), it means we end up having multiple offsets to represent different mmap types, or multiplying the size of the range and having the top half of the range mean the other mmap type. 2) Create a kernel-internal filesystem and get struct files for the objects. This is the method that seemed like the right thing to do in the linux style, so I've been trying to figure that part out. I've been assured that libfs makes my job easy here, but as I look at it I'm less sure. The sticking point to me is how the page list I get from your API ends up getting used by simple_file_* and generic_file_*. And, in the future where the pageable memory allocator is actually pageable while mmapped, what does ...
I don't think anyone would argue that using normal system calls would be ugly, but there are several limitations with that approach, including the fact that some of our operations become slightly more difficult to do, along with the other limitations mentioned in drm_gem.c and in other threads. At this point I think we should go ahead and include Eric's earlier patchset into drm-next, and continue to refine the internals along the lines of what you've posted here in the post-2.6.28 timeframe. The ioctl based interfaces (there aren't too many) are something we can support going forward, so we should be able to rip up/clean up the implementation over time as the VM becomes more friendly to these sort of operations. Any objections? Dave, you can add my Acked-by (or S-o-b if Eric includes my GTT mapping stuff) to Eric's patchset; hope you can do that soon so we can get a libdrm with the new APIs released soon. Thanks, -- Jesse Barnes, Intel Open Source Technology Center --
Nick, in case you missed the plea here, we're asking if you have any objection to shipping the mm changes present in Eric's patch in 2.6.28. When your new pageable allocator becomes available, we'll switch over to using that instead and revert Eric's mm changes. We're ready to promise to support the user-land DRM interface going forward, and we've got lots of additional work queued up behind this merge. We'd prefer to push stuff a bit at a time rather than shipping a lot of new code in a single kernel release.=20 --=20 keith.packard@intel.com
So long as we don't have to support the shmem exports for too long, I'm OK with that. The pageable allocator probably is probably not a 2.6.28 merge candidate at this point, so I don't want to hold things I would have liked to see more effort going towards building the user API starting with a pseudo filesystem rather than ioctls, but that's just my opinion after squinting at the problem from 100 metres away.. So I don't have a strong standing to demand a change here ;) So, I'm OK with it for 2.6.28. I think Christoph had some concerns with the patches, and I'd like to hear that he's happy now. Christoph? Does the pageable allocator API satisfy your concerns, or did you have other issues with it? Thanks, Nick --
