From: Matthew Wilcox <matthew@wil.cx>
Some devices have a BAR at a non-standard address. The pci_read_base()
API allows us to probe these BARs and fill in a resource for it as if
they were standard BARs.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
drivers/pci/probe.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
include/linux/pci.h | 10 ++++++++++
2 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7098dfb..1518c4f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -181,13 +181,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask)
return size;
}
-enum pci_bar_type {
- pci_bar_unknown, /* Standard PCI BAR probe */
- pci_bar_io, /* An io port BAR */
- pci_bar_mem32, /* A 32-bit memory BAR */
- pci_bar_mem64, /* A 64-bit memory BAR */
-};
-
static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
{
if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
@@ -300,6 +293,46 @@ static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
goto out;
}
+/**
+ * pci_read_base - Read a BAR from a specified location
+ * @dev: The PCI device to read
+ * @type: The type of BAR to read
+ * @res: A struct resource to be filled in
+ * @reg: The address in PCI config space to read the BAR from.
+ *
+ * Some devices have BARs in unusual places. This function lets a driver ask
+ * the PCI subsystem to read it and place it in the resource tree. If it is
+ * like a ROM BAR with an enable in bit 0, the caller should specify a @type
+ * of io, mem32 or mem64. If it's like a normal BAR with memory type in the
+ * low bits, specify unknown, even if the caller knows what kind of BAR it is.
+ *
+ * Returns -ENXIO if the BAR was not successfully read. If the BAR is read,
+ * but no suitable parent resource can be found for the BAR, this function
+ * returns -ENODEV. If the resource cannot be inserted into the resource tree,
+ ...From: Keith Packard <keithp@keithp.com>
GEM needs to create shmem files and get pages related to a shmem file, and
using this pair of functions is the easiest way to do that.
Signed-off-by: Eric Anholt <eric@anholt.net>
---
include/linux/mm.h | 11 +++++++++++
mm/shmem.c | 17 ++++-------------
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 866a3db..6e54210 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -715,6 +715,17 @@ static inline int shmem_lock(struct file *file, int lock,
#endif
struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags);
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+ SGP_READ, /* don't exceed i_size, don't allocate page */
+ SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
+ SGP_WRITE, /* may exceed i_size, may allocate page */
+};
+
+int shmem_getpage(struct inode *inode, unsigned long idx,
+ struct page **pagep, enum sgp_type sgp, int *type);
+
int shmem_zero_setup(struct vm_area_struct *);
#ifndef CONFIG_MMU
diff --git a/mm/shmem.c b/mm/shmem.c
index c1e5a3b..7166ff3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -77,14 +77,6 @@
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
- SGP_READ, /* don't exceed i_size, don't allocate page */
- SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
- SGP_WRITE, /* may exceed i_size, may allocate page */
-};
-
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -97,9 +89,6 @@ static unsigned long shmem_default_max_inodes(void)
}
#endif
-static int shmem_getpage(struct inode *inode, unsigned long idx,
- struct page **pagep, enum ...GEM allows the creation of persistent buffer objects accessible by the graphics device through new ioctls for managing execution of commands on the device. The userland API is almost entirely driver-specific to ensure that any driver building on this model can easily map the interface to individual driver requirements. GEM is used by the 2d driver for managing its internal state allocations and will be used for pixmap storage to reduce memory consumption and enable zero-copy GLX_EXT_texture_from_pixmap, and in the 3d driver is used to enable GL_EXT_framebuffer_object and GL_ARB_pixel_buffer_object. Signed-off-by: Eric Anholt <eric@anholt.net> --- drivers/gpu/drm/Makefile | 5 +- drivers/gpu/drm/drm_agpsupport.c | 51 +- drivers/gpu/drm/drm_cache.c | 76 + drivers/gpu/drm/drm_drv.c | 4 + drivers/gpu/drm/drm_fops.c | 6 + drivers/gpu/drm/drm_gem.c | 420 ++++++ drivers/gpu/drm/drm_memory.c | 2 + drivers/gpu/drm/drm_mm.c | 5 +- drivers/gpu/drm/drm_proc.c | 135 ++- drivers/gpu/drm/drm_stub.c | 10 + drivers/gpu/drm/i915/Makefile | 6 +- drivers/gpu/drm/i915/i915_dma.c | 94 +- drivers/gpu/drm/i915/i915_drv.c | 8 +- drivers/gpu/drm/i915/i915_drv.h | 253 ++++- drivers/gpu/drm/i915/i915_gem.c | 2497 ++++++++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_gem_debug.c | 201 +++ drivers/gpu/drm/i915/i915_gem_proc.c | 292 ++++ drivers/gpu/drm/i915/i915_gem_tiling.c | 318 ++++ drivers/gpu/drm/i915/i915_irq.c | 8 +- include/drm/drm.h | 31 + include/drm/drmP.h | 151 ++ include/drm/i915_drm.h | 332 +++++ 22 files changed, 4861 insertions(+), 44 deletions(-) create mode 100644 drivers/gpu/drm/drm_cache.c create mode 100644 drivers/gpu/drm/drm_gem.c create mode 100644 drivers/gpu/drm/i915/i915_gem.c ...
"the 2d driver" ... "the 3d driver". In the kernel source tree, "/**" means "beginning of kernel-doc notation", so please don't use it when kernel-doc isn't being used. and the preferred function format is more like: DRM_AGP_MEM *drm_agp_bind_page(struct drm_device *dev, ... --- ~Randy Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA http://linuxplumbersconf.org/ --
We only need one X.org 2D and one Mesa 3D driver for each card, but DRI supports as many rendering APIs as you like, including XvMC which offers accelerated video decode on top of DRI as well. So, the answer is that yes, at this point, we have only the two Intel integrated graphics drivers using GEM, but also no, the interface can support as many drivers as you want. I hope that GEM will encourage more graphics API research as it allows multiple user-space APIs to talk to the same underlying hardware resources. --=20 keith.packard@intel.com
"The 2D driver" means xf86-video-intel xorg driver in userland. The codebase this is coming from uses doxygen. I hadn't checked to see if all the doxygen was converted to kernel-doc. --=20 Eric Anholt eric@anholt.net eric.anholt@intel.com
Looks like I need to spend more time practicing my git-send-email incantations. Here's the belated introduction to my patch series of 0001-PCI-Add-pci_read_base-API.patch 0002-Export-shmem_file_setup-and-shmem_getpage-for-DRM-GE.patch 0003-drm-Add-GEM-graphics-execution-manager-to-i915.patch This patch series brings a long-awaited kernel memory manager to the i915 driver. This will allow us to do correct composited OpenGL, speed up OpenGL-based compositing, and enable framebuffer objects and other "new" OpenGL extensions. This patchset is also being built on to enable kernel modesetting for a non-root, flicker-free X Server. This patch series relies on a series of cleanups and fixes that have alread= y been queued by airlied. That tree can be browsed at: http://git.kernel.org/?p=3Dlinux/kernel/git/airlied/drm-2.6.git;a=3Dshortlo= g;h=3Ddrm-next The main concern we expect to hear about the GEM API is the set of ioctls t= hat look suspiciously like file-related syscalls. We've got small integer hand= les to things that we can read/write/mmap, flink and open and close, etc. Our initial plan was to use fds, but we ran into two problems as noted in drm_g= em.c: - Process limits prevent more than 1024 or so being used at a time by default, and we're looking at tens of thousands of these objects used by a single client. - Inability to allocate high fds will aggravate the X Server's select() handling, and likely that of many GL client applications as well. Overall, even if we cooked up a patch to make our GEM files get fds allocat= ed above some lower limit, interfering with clients fd namespace for these driver-internal objects sounded like a recipe for more pain than we were re= ady to sign up for. There are also concerns that for other drivers following t= he rough GEM model, they'll need to supply extra hints with mmap/pread/pwrite = for acceptable performance ("access me through this aperture, please, because I know what operation I'm going to do ...
It's fine to be getting the GEM end working using this easiest way, and I don't think we'll have a problem with EXPORT_SYMBOL_GPL() on shmem_file_setup() or something very like it - it just happens that its current callers are not in loadable modules. But shmem_getpage() is very much an internal helper function for mm/shmem.c, and I'm pretty sure we should keep it and its sgp_types that way: I'll take a look at how you're using it later on, and see what makes sense to offer instead. And I don't think you should assume that all GEM users must have CONFIG_SHMEM=y: we should aim for an interface that mm/tiny_shmem.c (redirecting to ramfs) can also support. I can believe that GEM and TINY won't have much intersection, but let's not force that by using shmem_getpage(). --
Cool. Using shmem saved a huge amount of code in DRM and allows the GEM
objects to be paged to disk if necessary. We haven't added the shrinker
API calls to finish this work, but that shouldn't take much additional
We use it to get a struct page within a shmem object; we don't need the
sgp types exposed as we only ever use SGP_DIRTY. We could use a narrower
interface like:
int shmem_file_page(struct file *file, int idx, struct page **page)
{
struct inode *inode =3D file->f_path.dentry->d_inode;
return shmem_getpage(inode, idx, page, SGP_DIRTY, NULL);
}
Seems like the simpler interface should be supportable from ramfs:
int shmem_file_page(struct file *file, int idx, struct page **page)
{
struct inode *inode =3D file->f_path.dentry->d_inode;
struct address_space *mapping =3D inode->i_mapping;
struct page *filepage;
=20
filepage =3D find_lock_page(mapping, idx);
if (!filepage)
return -ENOSPC;
*page =3D filepage;
return 0;
}
I don't know; I'm running without swap on several netbooks these days,
and I can easily believe that TINY would be a reasonable option there.
Thanks for the suggestions.
--=20
keith.packard@intel.com
Nope. Let the userspace protion create a file in shmfs instead of adding fugly kernel interfaces hdiing this fact. --
I can't create that many files; I need thousands of them. --=20 keith.packard@intel.com
>>>>> "Keith" == Keith Packard <keithp@keithp.com> writes: Keith> I can't create that many files; I need thousands of them. Why? If you need thousands of files, won't the overhead of managing them dynamically start to be a big load as well? I assume (sorry for not looking at the code in depth) that you're trying to setup specific regions of memory with various attributes for the DRI/DRM/GEM/TTF access to Video cards? Just seeing your statement that you wanted to add ioctls made me shudder and try to visuallize a better way to do this. I realize you're trying to make the drivers generic so that the *BSD port is simple too, but... thousands of files (per X server? per xclient) just seems like the wrong level. Maybe you just need to open the *entire* card with one FD and then have your own VFS like abstraction in there, which doesn't require lots of filehandles? Dunno... just trying to figure out what you're try to mediate here with ioctl() and how it could be improved/simplified. John --
We're not having any trouble with upwards of 40000 shmem file objects at this point. Most of these are around 4 pages in size, although they range up to around 16MB. A typical 3D game will use around 100MB or so Not really; these are objects used in the rendering process, vertices, command buffers, constant buffers, textures, programs, rendering surfaces and various other state buffers. Lots of these are write-once (from the CPU) and read-many (from the GPU), like textures, programs and vertices. Some of these are effectively streamed from the CPU to the GPU. Each sequence of rendering commands requires that a set of these objects be pinned to physical memory and bound to the graphics translation table within the GPU. Oh, and I allocate enough that I need to make them pageable as well; under memory pressure, I can pull objects back out of the GTT binding I could use fds if the kernel supported applications needing many thousand of them, and also some way to keep these FDs from polluting the I'm not worried about BSD; I want to build the right API for Linux at this point. And, yes, thousands of objects per 3D application. Each X pixmap will end up in one of these objects as well, and so something like Firefox will likely allocate several hundred. To keep the rendering engine busy, we'll have a couple hundred command buffers allocated as well, of 16KB That's what I'm doing at present; the card is opened with a single fd and then these shmem objects are allocated within an ioctl from there. What I don't want to do is use a single linear address space for these Suggestions welcome; I just need a few thousand multi-page allocations, which the kernel deals with quite easily these days. --=20 keith.packard@intel.com
Hi Keith, You may just have identified a scaling problem here. I see no difference to many thousand TCP/IP connections in your description. What actions on many thousand fds are supported poorly or not at all? Are you only concerned about the memory requirements? Please elaborate or point me to a place where you did already :-) Best Regards Ingo Oeser --
I didn't notice that the change in the maximum number of fds per process from 1024 to 1024*1024 back in February. That makes it possible, although requiring root privs, to allocate enough fds for this to work. The other issue is that several important applications (including the X server) use select instead of poll, and they have a small maximum number of fds that they support. It seems like this could be worked around by I should have looked at fs/file.c. --=20 keith.packard@intel.com
That would work yes. Switching to poll() would probably be even smarter, or if you have a large number of fds being scanned take a look at epoll which is likely to be far more efficient but wouldn't be available on so many systems - poll is at least standard. --
On Mon, 4 Aug 2008 09:19:30 +0100 the hard part is that DRI is a library that gets linked into existing applications... which already are using select ;-( Fixing X is one thing and not impossible.. fixing all existing games in the field is quite another -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
X libraries provide the event loop so that isn't really as complex as it seems. Shuffling the handles around for the benefit of the odd legacy app isn't something I'd disagree with but DRI and X in the general case can determine the event loop system used by the application. Alan --
On Mon, 4 Aug 2008 15:11:46 +0100 they do? I could have sworn things like glib etc do their own... but I'm no desktop guy so I can totally be wrong as well --
Uh, no, X doesn't. Gtk+ and Qt provide event loops that applications may Yeah, it would be nice if we could just fix all of the existing applications. When you get that working, let me know and I'll take you for a ride in my flying car. --=20 keith.packard@intel.com
Ok, how many need support for GEM? How many of them would change their event loop, if they can get better performance? Maybe have a slow path for legacy behavior? Really, the sleeping part of of event loops is usually hidden in some libraries and the applications have a big switch statement That is never required as long as only performance suffers, not functionality. And the applications doing every syscall themselves are used to the pain :-) Best Regards Ingo Oeser --
GEM is under OpenGL libs so they don't request support for GEM. However if the underlying OpenGL implemenation suddenly starts doing something with a lot of fds it could potentially confuse the crap out of these exisiting applications. Dave. --
GEM will underlie the OpenGL implementation that applications use; we aren't planning on writing two OpenGL implementations to work around some file descriptor issues. And, even if we want to use fds for every GEM object, we have a fairly simple way of moving them out of the way of select -- dup2. Of course, it would be nice to have some way to get the That's not historically true in desktop applications. Yes, most modern open source applications are sensible and use a library-based event Alas, GEM offers a huge increase in functionality; performance is really just a modest side benefit. In reality, as I want to avoid problems caused by ulimit, I suspect I'd end up treating most of these objects as just a bag of pages and close the related fd after passing them to the driver, effectively turning the whole exercise into a mechanism for passing the struct file from shmem to GEM through user mode instead of directly across the kernel API. I'm not sure this is a win. --=20 keith.packard@intel.com
shmfs is perfectly happy to have thousands of files, there are workloads that have a lot more than that. Note that in other subthreads there is the assumption that all of them are open and in the fd table at the same time. I haven't really seen a good explanation for that - objects on shmfs are persistant until deleted, and when you keep your parent directory fd open an openat to get at the actual object is cheap. --
A typical GPU-intensive application will have a few thousand objects full of GPU-bound data containing GPU commands, vertices, textures, rendering buffers, constants and other misc GPU state. When rendering a typical frame, all of these objects will be referenced in turn -- each frame is much like the last, so the process of drawing the frame ends up using almost exactly the same data each time. It's an animation after all. Any kind of resource constraint places severe pressure on an eviction policy -- the obvious 'LRU' turns out to be almost exactly wrong as you always eject the objects which are next in line for re-use. Obviously, the right solution is to avoid artificial resource constraints, and minimize the impact of real constraints. Limiting the number of objects we can have in-use at a time (by limiting the number of open FDs) is an artificial resource constraint, forcing re-use of closed objects to pay an extra openat cost. That may be cheap, but it's not free. Plus, we don't want persistent objects here -- when the application exits, we want the objects to disappear automatically; otherwise you'll clutter shmem with thousands of spurious files when an application crashes. So, if we create shmem files, we'll want to unlink them immediately, leaving only the fd in place to refer to them. Our alternative is to open the shmem file and pass the fd into our driver to hold onto the file while the application closes the fd. I see that as simply a kludge to work around the absence of anonymous pageable file allocation -- it seems like an abuse of the current API. Perhaps we should construct a different API to allocate a pageable object? Having shmem expose these as 'files' doesn't make a lot of sense in this context, it just happens to be an easy way to plug them into the vm system, as is evidenced by ipc/shm using them as well. --=20 keith.packard@intel.com
shmem_file_setup is already public, just not exposed to kernel modules. I suppose we could have user space allocate the shmem file (either via tmpfs or sysv ipc). tmpfs suffers from the maxfd issue, while sysv ipc runs up against the SHMMAX value. The other interface we use, shmem_getpage, doesn't seem supported through existing interfaces though. I don't want to map the file into any address space, I just need a pointer to the physical pages which underlie it. I'll take those pages and hand them to the graphics engine. shmem_getpage performs precisely the operation needed here. --=20 keith.packard@intel.com
This is not an argument. shmem_file_setup is public because that's how ipc is implemented, not because it makes sense to be used anywhere else. Lots of other things in mm/ are public that should This is how I'd suggested it work as well. I think a little bit Mapping the file into an address space might be a way to make it work (using get_user_pages to get the struct page). splice might also work. read_mapping_page or similar could also be something to look at. But using shmem_getpage seems wrong because it circumvents the vfs API. If you genuinely have problems that can't be fit into existing APIs without significant modification, and that is specific just to your app, then we could always look at making special cases for you. But it would be nice if we generically solve problems you have with processes manipulating thousands of files. --
What I may be able to do is create a file, then hand it to my driver and It seems fairly ugly to map the object to user space just to get page pointers; the expense of constructing that mapping will be entirely wasted most of the time. Would it be imprudent to use pagecache_write_begin/pagecache_write_end here? For shmem, that appears to point at functions which will do what I need. Of course, it will cause extra page-outs as each page will be marked dirty, even if the GPU never writes them.=20 While shmem offers good semantics for graphics objects, it doesn't seem like it is unique in any way, and it seems like it should be possible to There are some unique aspects to this operation which don't really have parallels in other environments. I'm doing memory management for a co-processor which uses the same pages as the CPU. So, I need to allocate many pages that are just handed to the GPU and never used by the CPU at all. Most rendering buffers are of this form -- if you ever need to access them from the CPU, you've done something terribly wrong. Then there are textures which are constructed by the CPU (usually) and handed over to the GPU for the entire lifetime of the application. These are numerous enough that we need to be able to page them to disk; the kernel driver will fault them back in when the GPU needs them again. On the other hand, there are command and vertex buffers which are constructed in user space and passed to the GPU for execution. These operate just like any bulk-data transfer, and, in fact, I'm using the pwrite API to transmit this data. For these buffers, the entire key is to make sure you respect the various caches to keep them from getting trashed. --=20 keith.packard@intel.com
True. It would make it possible for the userspace program to pass in anonymous pages, but maybe not a big deal if you're using files and pagecache_write_begin/pagecache_write_end should be reasonable, but you have to be careful of the semantics of it. For example, you can't really read anything from the page inside the calls because the filesystem may not bring it up to date. Right, that's your specific implementation, but for some cases the memory management can map or be implemented using generic primitives. Using pagecache for your memory for example should work nicely. Making it shmem specific and using internal APIs seems like a negative step until you really have a good reason to. --
We considered using anonymous pages, but as the user-mapping is not a
feature, it seemed like it wasn't the right model. Plus, many of these
objects need to be shared across multiple processes, so anonymous pages
That does look a lot more like what I want, as it returns an unlocked
page. And, makes my code look cleaner to boot:
inode =3D obj->filp->f_path.dentry->d_inode;
mapping =3D inode->i_mapping;
for (i =3D 0; i < page_count; i++) {
page =3D read_mapping_page(mapping, i, NULL);
if (IS_ERR(page)) {
ret =3D PTR_ERR(page);
DRM_ERROR("read_mapping_page failed: %d\n", ret);
i915_gem_object_free_page_list(obj);
return ret;
}
obj_priv->page_list[i] =3D page;
}
Does this look like it conforms to the vfs api? It appears to work when
Yup, I'm liking the general file mechanism, I used shmem only because
that seemed like the obvious file system you'd want underneath these
objects.
--=20
keith.packard@intel.com
Yes, that's a good suggestion from Nick, should work with shmem/tmpfs (with *proviso below) and others, and big improvement to your generality. Whether such usage conforms to VFS API I'm not so sure: as I understand it, it's really for internal use by a filesystem - if it's going to be used beyond that, we ought to add a check that the filesystem it's used upon really has a ->readpage method (and I'd rather we add such a check than you do it at your end, in case we change the implementation later to use something other than a ->readpage method - Nick, you'll be nauseated to hear I was looking to see if ->fault with a pseudo-vma could do it). But if the layering police are happy with this, I am. The *proviso is that for tmpfs itself this actually isn't as convenient as directly using shmem_getpage: because this way passes a page in to shmem_getpage, when maybe the page wanted is already here but currently marked as swapcache rather than filecache. It's an awkward extension that allows tmpfs to support ->readpage at all. But that route is in use and well-tested, and only an inefficiency when swapping, so should not cause you any problems. (I have been agonizing over the way __read_cache_page, like your original i915_gem_object_get_page_list, uses find_get_page: whereas shmem_getpage uses find_lock_page. I remember the latter is important, but don't quite remember all the why. I'm believing that it's really only the ->fault usage where it becomes vital: things get confused, on 2.4 more than 2.6, if a swapcache tmpfs page is mapped into userspace.) You don't examine page->mapping at all: good, there's a race by which it could go to NULL after you acquired the page by read_mapping_page: not by file truncation, but by swapping out at the wrong instant. Don't worry, you have the right page, just don't rely on page->mapping. (Sorry if I'm rather talking aloud to myself here.) Hugh --
It seems like I should put a check into my code that is kernel version dependent so that I can't oops if someone tries to use a filesystem that Yeah, swapping performance isn't my primary concern; I looked through the read_mapping_page codepath and it looked exactly like my existing code in the fast path, which is why I was able to just delete all of that from my driver and just call read_mapping_page. So, when I release the pages from the page cache, I'm currently calling mark_page_accessed for all pages, and set_page_dirty for pages which may have been written by the GPU. Are those calls still needed? --=20 keith.packard@intel.com
Well, I guess put the check on ->readpage into your code for now, and by the time GEM gets into Linus's tree, we should have -EINVAL checks on NULL filler() in __read_cache_page() and read_cache_page_async(), so remove check at your end before final submission. (You could leave it there, and strictly we ought to update GEM if we make any change to our implementation; but it is the kind of detail that gets overlooked - witness the way I failed to grasp the readahead side-effects of adding ->readpage into tmpfs until recently. I'm just afraid we'd break you unwittingly: better not, though easily fixed.) I'm not sending the patch right now, waiting to see if this direction I think you should drop the mark_page_accessed(): that's done in read_cache_page_async() as part of the initial read_mapping_page(). But do it again when releasing if you think there's a good chance that object will be wanted again shortly. set_page_dirty() if modified by GPU, yes, that would still be needed. For how long are these objects' pages pinned in memory like this? I ask because Rik & Lee have patches in -mm, trying to avoid long scans of LRUs cluttered with unevictable pages. I've no idea whether you're adding a lot or a little to that problem. Hugh --
I'll just remove it then; pages pulled from the GTT are unlikely to be Forever? The only reason an object would get unpinned would be on eviction from the GTT, and objects will only be evicted if we fill the aperture with other stuff. When we add a register_shrinker callback, we'll also unpin pages at that point. A busy system should have the GTT entirely full, and that will be If the idea of 1GB of pinned memory doesn't scare you, then I don't see a problem ;-) --=20 keith.packard@intel.com
Okay, thanks for the warning. It sounds like the shrinker will be important, but we'll also need to mark those pages as unevictable while they're unshrunk. (Usually when drivers grab a large number of pages, they're not on any LRU to begin with: you're being nice by choosing swappable LRU pages - in the tmpfs case - but enough to upset the balance while they're not swappable.) Offhand I can't say what will be the appropriate way to do that, it's something we need to revisit later (from your point of view it should amount to one function call or flag set somewhere, not any grand redesign). Hugh --
That seems like a sensible optimization; otherwise the system could It's not a matter of 'being nice', of course, it's a matter of keeping the system running under heavy firefox load. Check out the memory usage by your X server with numerous tabs open to image-heavy pages someday. Not making those pageable would result in OOM visiting far too often. Right now, we make these images pageable by copying them in and out of a set of pages bound permanently to the GTT. The performance implications of doing that are fairly severe, enough so that in many cases it's Ok, cool. I haven't added the shrinker support yet in any case, although that doesn't look like a big job. --=20 keith.packard@intel.com
>>>>> "Hugh" == Hugh Dickins <hugh@veritas.com> writes: Hugh> Okay, thanks for the warning. It sounds like the shrinker will Hugh> be important, but we'll also need to mark those pages as Hugh> unevictable while they're unshrunk. What about the onboard memory of graphics cards? Isn't that where Textures and such are stored as well? So once something is loaded to the card, shouldn't you be able to free it in system memory? Or swap it out ahead of time? Hugh> (Usually when drivers grab a large number of pages, they're not Hugh> on any LRU to begin with: you're being nice by choosing Hugh> swappable LRU pages - in the tmpfs case - but enough to upset Hugh> the balance while they're not swappable.) Offhand I can't say Hugh> what will be the appropriate way to do that, it's something we Hugh> need to revisit later (from your point of view it should amount Hugh> to one function call or flag set somewhere, not any grand Hugh> redesign). I just wonder here, since GPUs are different from other drivers in that (I assume) they are written too much more often than they are read to, that they should hopefully be much more amenable to drop behind semantics for pages they've been sent, to free system resources. Now I admit I am most probably smoking something which is clouding my understanding, so feel free to ingore my comments. John --
Right now, I'm working only on Intel integrated graphics, which doesn't have any on-board memory. My thinking is that we'd best solve the easiest case first before attempting the harder discrete graphics driver. Plus, Intel pays me do do integrated graphics, so I have an incentive. Not that we haven't been thinking about how this plays with a discrete card; we'll need to allocate backing store for any objects which are placed in on-card memory in case we run out of on-card memory, and also as a place to store objects when the system suspends. From the perspective of shmem, it should be precisely the same; allocating anonymous pages and handing them to the DRM driver to store video card Yup, I think you understand the situation fairly well. I've got a big pile of pages sitting idle in the GTT which could be pulled out and given back to the system. I like to keep a bunch around as the cost of getting pages out of the CPU cache and bound to the GTT is non-trivial, but if there's any memory pressure, the cost to free them is quite low. --=20 keith.packard@intel.com
Right, but it sounds adventurous to merge this work upstream before it is generic enough to work on discrete cards. Right now it only works on intel integrated cards ; integrated cards are one business, discrete are a completely different world and the issues afoot completely different. Should this go upstream, the kernel guys have to keep in mind and accept the possibility that another memory manager might be needed at some point. Stephane --
On Wed, 6 Aug 2008 18:20:40 +0200 the ATI driver guys already have the interface working. In addition.. since when is "oh you must also make it work for THAT" a requirement? Traditionally, it's up to the second person to make generic code work for them (within reason of course, but I'll argue that the bar of reasonableness has been met here) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Right, but the current code will basically force the discrete card drivers to implement backing store for all allocations. Do we want this ? Also, for cards that can handle page-based memory allocations, there is no way to make use of this feature, do we want this too ? Stephane --
Aside from not really forcing discrete card drivers to to anything (they're welcome to use or not use this stuff as they prefer), I believe discrete cards will need backing store to support paging objects to disk I don't see how any of the current code directs how future drivers might work; the user-level interface is reasonably abstract and should allow all kinds of internal organizations. Instead of complaining that the current code might not support some abstract hardware, please build something that does work and we'll see how to merge that with code for other drivers. --=20 keith.packard@intel.com
Well, this "abstract hardware" is in stores, it's called G80 and R600. If intel can't do something on its hardware shouldn't prevent others from using it. And I seriously doubt even the current interfaces would be fit for this. Stephane --
Write a driver and we'll talk. Until then, I don't think your argument is well supported. If the result is that your driver requires a completely separate interface than the Intel driver, that's OK too -- we can talk about merging them together in some unified mechanism at that point. I'm trying to get from zero to one kernel driver for Intel graphics at this point; that seems a valuable goal in itself. I hope it leads to other people building kernel drivers for other graphics hardware. Don't let me stop you from building a better mousetrap if you think the code I've written is wrong somewhere. --=20 keith.packard@intel.com
Btw, I was wondering what prevented you (you being the intel people) from fixing TTM instead of writing GEM ? That's what happened, there was TTM first, intel came first, they should've fixed it... Stephane --
Well, we (mostly Eric, but a bit of work by me as well) spent about a year trying to fix it, with Thomas' help, but after that, we still hadn't managed to make the 965 run OpenGL reasonably well. Around April 1, we (Eric and I) evaluated how long it would take to built just what we needed, rather than trying to work out a general architecture that could serve every hardware need we could think of. We figured it would take about a month to get something working, and by the end of April, we had 3D working better on 965 than we had ever managed with TTM. Given the difficulty we had understanding the TTM internals, it's possible that this was just a failure on our part. But as we expect a wide range of people to maintain this code going forward, it seemed reasonable to want something simple enough that even we could figure out how it works. I think TTM is over designed, but that is colored by experience with very simple hardware. While I think something simple, like GEM, could work on most hardware, I really don't know and invite you to try whatever approach you want. Also, we're talking about moving code into the Linux kernel source tree to support Intel integrated graphics. There is no incumbent implementation there that we're trying to replace, this is new functionality for Linux. --=20 keith.packard@intel.com
Important difference between code and interfaces. GEM is adding new userspace interface, and it's current form rather ugly ones. We'd better think about these hard, and make sure they are useful for the whole usecase and not a limited subset. --
Yes, readpage is optional... A check will be required. I don't exactly know if this would be deemed the best way to do it, but it seems *relatively* filesystem agnostic (with the readpage test being the exception), and it seems like it is to address_space mappings what get_user_pages is to virtual memory mappings. And there is no shortage of drivers using get_user_pages. At any rate, it seems a bit nicer than using shmem_getpage directly. --
Using read_mapping_page is fine on any pagecache backed file. What is much more difficult is actually writing into pagecache on a sub-page level where we don't have proper APIs for anything but the filesystem itself. --
Well, I've spent a day thinking about using existing user-space APIs to get at shmem files. While it's nice that we've discovered a filesystem-independent mechanism to pin file pages, we haven't found anything similar for creating the files. In particular, what I want is: 1) Anonymous files backed by swap 2) Freed when the last process using them exits 3) That never appear in the file system 4) Do not consume a low FD (yeah, I know, rewrite the desktop) So, what I could do is char template[] =3D "/dev/shm/drm-XXXXXX"; int fd; fd =3D mkstemp (template); unlink (template); ftruncate (fd, size) object =3D drm_create_an_object_for_a_file (fd); close (fd); While I haven't written any code yet, this should work and will even be compatible with my current user-space API. I have to create a DRM object for the file in any case, and so I don't need to hold onto the fd. Releasing the fd also eliminates any ulimit issues.=20 The drm_create_an_object_for_a_file call could return another fd. But, note that the original shmem fd has no real value to the application in this case. I can imagine other cases where mapping non-shmem files would make sense though, in particular it's fairly easy to envision mapping an image file to the GTT and having the graphics process decode and display it without any additional copies. I think this demonstrates the potential utility of the general file mapping operation. But, I'd like to have you reconsider whether it makes sense for user space to go through the above dance to create anonymous shared objects when the kernel already supports precisely the desired semantics and even exposes them to the ipc/shm implementation. We'd offer two paths in DRM -- one that used an existing file and created an object using that as backing store, and a second one that created anonymous objects using shmem as backing store. Transient data would use anonymous objects while applications could directly map arbitrary file contents as ...
We also have a need to create in-kernel objects for things like fbcon. If we cannot create these without doing a major dance around the vfs then it'll be rather ugly. Dave. --
In my opinion, doing this little song and dance (which is a few lines of quite well defined APIs, btw) in userspace is preferable to adding a single line or exporting a single function in kernel space. Unless there is a better reason than eliminating a few lines of userspace code. I'm absolutely not against exporting a nice API for a swappable, object based memory allocator using ipc or shm to the wider kernel (with a nice API rather than just using shmem functions directly of course). But the fact that most or all of this seems to be able to be done in userspace just tells me that's where it should be prototyped first. It adds nothing to maintainence costs of the kernel code, and might actually be helpful to show some shortcomings of our user API definition or implementation. In the worst case it completely fails, the effort will still show much better how and why it needs to be done in kernel. --
Yeah, Dave Airlie just reminded me of the real reason for creating an API within the kernel. We need this to support kernel mode setting where we want to provide a frame buffer console and smoothly transition to an X desktop. Ideally, the console would start before user mode was running Sure, let's see if Dave can address the need to allocate these objects from within the kernel. Of course, that allocation could also be done using existing kernel APIs. --=20 keith.packard@intel.com
Yeah, I like this approach too, but to echo what Keith & Dave already mentioned, it would make the in-kernel aspect of things much more difficult (without abusing VFS calls from within the kernel anyway). It's not just the low level gfx libraries that need to create, map, operate on, and wait for objects. The kernel also needs to create objects and map them at the very least (hopefully we can avoid most of the other stuff inside the kernel). It seems like that's hard to do without duplicating big chunks of shmem.c. Any suggestions, Nick or Christoph? I'm trying to think of analogues in other kernel subsystems, but the best I can come up with is shmem. :) Some of the cluster fabric cards have similar issues (wanting to pin/map/unmap memory, and manage on-board IOMMUs) but I think gfx may be different in that it tends to use large chunks of memory for long periods of time; and even when a given GPU program is done with a piece of memory, the very next program run is likely to need it again. And I think the in-kernel requirements in this case are fairly unique. Jesse --
Now I also have to do the same song and dance for in-kernel allocated objects? still think this is acceptable... I'm not even sure what vfs calls I really need in-kernel to do that. If someone codes up the in-kernel path to do this using 4-5 API calls instead of one exported function that IPC/SHM already does then maybe we can gauge the uglyness vs that. --
Not exactly sure what you mean by this. But I would like to see an effort made to use existing userspace APIs in order to do this swappable object allocation over tmpfs scheme. As I said, I don't object to a nice kernel implementation, but we would be in a much better position to assess it if we had an existing userspace implementation to compare it with. --
We need to allocate objects from kernel mode to get the console running. I'd prefer to let that occur before user mode was available. Also, emulating the existing fbdev syscall interface will require that we allocate objects within the kernel. Hence, the question about how we should create objects from kernel mode. I think we can do this with a series of VFS function calls. Would that series of VFS calls be preferable to directly accessing the existing shmem API? Another alternative is to improve the existing shmem API to better capture what we're trying to do here. Both drm and sysv shm just want anonymous pages that are backed by swap. If we started from scratch, what API would we like to have here? Would we have it support both shmem and hugetlbfs? --=20 keith.packard@intel.com
Yeah there's no question we need early kernel space allocations of GEM objects. We need to setup the frame buffer, ring buffer, hardware status page, and potentially other state at module init time so that people can see Improving the shmem API makes sense to me. There's a flip side to using the ioctls as well; in doing the GTT mapping with the current code, I had to add a new ioctl and create a new core function that would allow me to do I/O remapping from something other than an ->mmap hook. I think we could have conceptually cleaner code and less hassle if we extended shmem a bit to allow for shmem "drivers" that could hook into its open/close/mmap/etc. routines (though in the mmap case in particular we'd either need to abuse an existing mmap flag or create a new one to recognize the backing store/GTT mapping distinction). What do you think, Nick? Adding this stuff to shmem would probably involve using the file->private_data inside shmem, and creating a new sub-driver registration function, then checking for sub-ops in the various important shmem operations structs... The advantage of all this is that we could probably use regular fds for the most part (assuming we get a "high fd" mapping function sometime soon), and all the regular file operations routines, making GEM look a like more like a regular file system driver. As for in-kernel stuff, as long as we keep the GEM shmem hooks separate from the actual bookkeeping (like we do now with i915_gem_create_ioctl() vs drm_gem_object_alloc() for example) we should be able to do the in-kernel stuff w/o jumping through too many VFS/VM hoops. That would also assume we don't care about swapping in the in-kernel case, which we don't; we want to pin the kernel allocated frame buffer and other memory anyway, so using the internal functions should be fine. Thanks, -- Jesse Barnes, Intel Open Source Technology Center --
On Tue, 19 Aug 2008 11:50:11 -0700 What about suspend to disk ? How do we save such buffers ? Btw i think that GTT looks a lot like IOMMU, i don't know the IOMMU kernel side API that much, but from memory i think that you have call to ask IOMMU mapping why not do somethings like that for GTT ? You get normal mapping of object diret but userspace can ask some kind of GTT mapping on a given object. Anyway new flag on fd sounds good enough too. Cheers, Jerome Glisse <glisse@freedesktop.org> --
We'll have to deal with it like other kernel memory, or in the case of cards with VRAM maybe have a special hook that copies out the front buffer. Depends on how Dave implemented the GEM stuff on a device with VRAM (I At a high level that's pretty much what we're doing. We already have to map things through the GTT in the kernel for pwrite/pread etc., so that code is done. It's just exposing it to userspace that's a bit of a pain, since we have an all-ioctl interface... -- Jesse Barnes, Intel Open Source Technology Center --
