Re: [PATCH] Export shmem_file_setup and shmem_getpage for DRM-GEM

Previous thread: none

Next thread: [PATCH] ath5k : ath5k_config_interface deadlock fix by Dave Young on Friday, August 1, 2008 - 12:46 am. (15 messages)
From: Eric Anholt
Date: Thursday, July 31, 2008 - 11:58 pm

From: Matthew Wilcox <matthew@wil.cx>

Some devices have a BAR at a non-standard address.  The pci_read_base()
API allows us to probe these BARs and fill in a resource for it as if
they were standard BARs.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 drivers/pci/probe.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/pci.h |   10 ++++++++++
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7098dfb..1518c4f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -181,13 +181,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask)
 	return size;
 }
 
-enum pci_bar_type {
-	pci_bar_unknown,	/* Standard PCI BAR probe */
-	pci_bar_io,		/* An io port BAR */
-	pci_bar_mem32,		/* A 32-bit memory BAR */
-	pci_bar_mem64,		/* A 64-bit memory BAR */
-};
-
 static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
 {
 	if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
@@ -300,6 +293,46 @@ static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 	goto out;
 }
 
+/**
+ * pci_read_base - Read a BAR from a specified location
+ * @dev: The PCI device to read
+ * @type: The type of BAR to read
+ * @res: A struct resource to be filled in
+ * @reg: The address in PCI config space to read the BAR from.
+ *
+ * Some devices have BARs in unusual places.  This function lets a driver ask
+ * the PCI subsystem to read it and place it in the resource tree.  If it is
+ * like a ROM BAR with an enable in bit 0, the caller should specify a @type
+ * of io, mem32 or mem64.  If it's like a normal BAR with memory type in the
+ * low bits, specify unknown, even if the caller knows what kind of BAR it is.
+ *
+ * Returns -ENXIO if the BAR was not successfully read.  If the BAR is read,
+ * but no suitable parent resource can be found for the BAR, this function
+ * returns -ENODEV.  If the resource cannot be inserted into the resource tree,
+ ...
From: Eric Anholt
Date: Thursday, July 31, 2008 - 11:58 pm

From: Keith Packard <keithp@keithp.com>

GEM needs to create shmem files and get pages related to a shmem file, and
using this pair of functions is the easiest way to do that.

Signed-off-by: Eric Anholt <eric@anholt.net>
---
 include/linux/mm.h |   11 +++++++++++
 mm/shmem.c         |   17 ++++-------------
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 866a3db..6e54210 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -715,6 +715,17 @@ static inline int shmem_lock(struct file *file, int lock,
 #endif
 struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type);
+
 int shmem_zero_setup(struct vm_area_struct *);
 
 #ifndef CONFIG_MMU
diff --git a/mm/shmem.c b/mm/shmem.c
index c1e5a3b..7166ff3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -77,14 +77,6 @@
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -97,9 +89,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum ...
From: Eric Anholt
Date: Thursday, July 31, 2008 - 11:58 pm

GEM allows the creation of persistent buffer objects accessible by the
graphics device through new ioctls for managing execution of commands on the
device.  The userland API is almost entirely driver-specific to ensure that
any driver building on this model can easily map the interface to individual
driver requirements.

GEM is used by the 2d driver for managing its internal state allocations and
will be used for pixmap storage to reduce memory consumption and enable
zero-copy GLX_EXT_texture_from_pixmap, and in the 3d driver is used to enable
GL_EXT_framebuffer_object and GL_ARB_pixel_buffer_object.

Signed-off-by: Eric Anholt <eric@anholt.net>
---
 drivers/gpu/drm/Makefile               |    5 +-
 drivers/gpu/drm/drm_agpsupport.c       |   51 +-
 drivers/gpu/drm/drm_cache.c            |   76 +
 drivers/gpu/drm/drm_drv.c              |    4 +
 drivers/gpu/drm/drm_fops.c             |    6 +
 drivers/gpu/drm/drm_gem.c              |  420 ++++++
 drivers/gpu/drm/drm_memory.c           |    2 +
 drivers/gpu/drm/drm_mm.c               |    5 +-
 drivers/gpu/drm/drm_proc.c             |  135 ++-
 drivers/gpu/drm/drm_stub.c             |   10 +
 drivers/gpu/drm/i915/Makefile          |    6 +-
 drivers/gpu/drm/i915/i915_dma.c        |   94 +-
 drivers/gpu/drm/i915/i915_drv.c        |    8 +-
 drivers/gpu/drm/i915/i915_drv.h        |  253 ++++-
 drivers/gpu/drm/i915/i915_gem.c        | 2497 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_debug.c  |  201 +++
 drivers/gpu/drm/i915/i915_gem_proc.c   |  292 ++++
 drivers/gpu/drm/i915/i915_gem_tiling.c |  318 ++++
 drivers/gpu/drm/i915/i915_irq.c        |    8 +-
 include/drm/drm.h                      |   31 +
 include/drm/drmP.h                     |  151 ++
 include/drm/i915_drm.h                 |  332 +++++
 22 files changed, 4861 insertions(+), 44 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_cache.c
 create mode 100644 drivers/gpu/drm/drm_gem.c
 create mode 100644 drivers/gpu/drm/i915/i915_gem.c
 ...
From: Randy Dunlap
Date: Friday, August 1, 2008 - 8:44 am

"the 2d driver" ... "the 3d driver".

In the kernel source tree, "/**" means "beginning of kernel-doc notation",
so please don't use it when kernel-doc isn't being used.

and the preferred function format is more like:

DRM_AGP_MEM *drm_agp_bind_page(struct drm_device *dev,
...




---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/
--

From: Keith Packard
Date: Friday, August 1, 2008 - 11:11 am

We only need one X.org 2D and one Mesa 3D driver for each card, but DRI
supports as many rendering APIs as you like, including XvMC which offers
accelerated video decode on top of DRI as well.

So, the answer is that yes, at this point, we have only the two Intel
integrated graphics drivers using GEM, but also no, the interface can
support as many drivers as you want.

I hope that GEM will encourage more graphics API research as it allows
multiple user-space APIs to talk to the same underlying hardware
resources.

--=20
keith.packard@intel.com
From: Eric Anholt
Date: Tuesday, August 5, 2008 - 6:11 pm

"The 2D driver" means xf86-video-intel xorg driver in userland.

The codebase this is coming from uses doxygen.  I hadn't checked to see
if all the doxygen was converted to kernel-doc.

--=20
Eric Anholt
eric@anholt.net                         eric.anholt@intel.com


From: Eric Anholt
Date: Friday, August 1, 2008 - 12:10 am

Looks like I need to spend more time practicing my git-send-email
incantations.  Here's the belated introduction to my patch series of

0001-PCI-Add-pci_read_base-API.patch
0002-Export-shmem_file_setup-and-shmem_getpage-for-DRM-GE.patch
0003-drm-Add-GEM-graphics-execution-manager-to-i915.patch

This patch series brings a long-awaited kernel memory manager to the i915
driver.  This will allow us to do correct composited OpenGL, speed up
OpenGL-based compositing, and enable framebuffer objects and other "new"
OpenGL extensions.  This patchset is also being built on to enable kernel
modesetting for a non-root, flicker-free X Server.

This patch series relies on a series of cleanups and fixes that have alread=
y
been queued by airlied.  That tree can be browsed at:
http://git.kernel.org/?p=3Dlinux/kernel/git/airlied/drm-2.6.git;a=3Dshortlo=
g;h=3Ddrm-next

The main concern we expect to hear about the GEM API is the set of ioctls t=
hat
look suspiciously like file-related syscalls.  We've got small integer hand=
les
to things that we can read/write/mmap, flink and open and close, etc.  Our
initial plan was to use fds, but we ran into two problems as noted in drm_g=
em.c:

- Process limits prevent more than 1024 or so being used at a time by
  default, and we're looking at tens of thousands of these objects used by
  a single client.
- Inability to allocate high fds will aggravate the X Server's select()
  handling, and likely that of many GL client applications as well.

Overall, even if we cooked up a patch to make our GEM files get fds allocat=
ed
above some lower limit, interfering with clients fd namespace for these
driver-internal objects sounded like a recipe for more pain than we were re=
ady
to sign up for.  There are also concerns that for other drivers following t=
he
rough GEM model, they'll need to supply extra hints with mmap/pread/pwrite =
for
acceptable performance ("access me through this aperture, please, because I
know what operation I'm going to do ...
From: Hugh Dickins
Date: Friday, August 1, 2008 - 3:57 am

It's fine to be getting the GEM end working using this easiest way,
and I don't think we'll have a problem with EXPORT_SYMBOL_GPL() on
shmem_file_setup() or something very like it - it just happens that
its current callers are not in loadable modules.

But shmem_getpage() is very much an internal helper function for
mm/shmem.c, and I'm pretty sure we should keep it and its sgp_types
that way: I'll take a look at how you're using it later on, and see
what makes sense to offer instead.

And I don't think you should assume that all GEM users must have
CONFIG_SHMEM=y: we should aim for an interface that mm/tiny_shmem.c
(redirecting to ramfs) can also support.  I can believe that GEM and
TINY won't have much intersection, but let's not force that by using
shmem_getpage().

--

From: Keith Packard
Date: Friday, August 1, 2008 - 11:06 am

Cool. Using shmem saved a huge amount of code in DRM and allows the GEM
objects to be paged to disk if necessary. We haven't added the shrinker
API calls to finish this work, but that shouldn't take much additional

We use it to get a struct page within a shmem object; we don't need the
sgp types exposed as we only ever use SGP_DIRTY. We could use a narrower
interface like:

int shmem_file_page(struct file *file, int idx, struct page **page)
{
        struct inode *inode =3D file->f_path.dentry->d_inode;

        return shmem_getpage(inode, idx, page, SGP_DIRTY, NULL);
}

Seems like the simpler interface should be supportable from ramfs:

int shmem_file_page(struct file *file, int idx, struct page **page)
{
        struct inode *inode =3D file->f_path.dentry->d_inode;
        struct address_space *mapping =3D inode->i_mapping;
        struct page *filepage;
       =20
        filepage =3D find_lock_page(mapping, idx);
        if (!filepage)
                return -ENOSPC;
        *page =3D filepage;
        return 0;
}


I don't know; I'm running without swap on several netbooks these days,
and I can easily believe that TINY would be a reasonable option there.
Thanks for the suggestions.

--=20
keith.packard@intel.com
From: Christoph Hellwig
Date: Friday, August 1, 2008 - 1:50 pm

Nope.  Let the userspace protion create a file in shmfs instead of
adding fugly kernel interfaces hdiing this fact.

--

From: Keith Packard
Date: Friday, August 1, 2008 - 4:01 pm

I can't create that many files; I need thousands of them.

--=20
keith.packard@intel.com
From: John Stoffel
Date: Sunday, August 3, 2008 - 5:49 am

>>>>> "Keith" == Keith Packard <keithp@keithp.com> writes:


Keith> I can't create that many files; I need thousands of them.

Why?  If you need thousands of files, won't the overhead of managing
them dynamically start to be a big load as well?  

I assume (sorry for not looking at the code in depth) that you're
trying to setup specific regions of memory with various attributes for
the DRI/DRM/GEM/TTF access to Video cards?  

Just seeing your statement that you wanted to add ioctls made me
shudder and try to visuallize a better way to do this.  

I realize you're trying to make the drivers generic so that the *BSD
port is simple too, but... thousands of files (per X server?  per
xclient) just seems like the wrong level.

Maybe you just need to open the *entire* card with one FD and then
have your own VFS like abstraction in there, which doesn't require
lots of filehandles?  

Dunno... just trying to figure out what you're try to mediate here
with ioctl() and how it could be improved/simplified.


John

--

From: Keith Packard
Date: Sunday, August 3, 2008 - 10:52 am

We're not having any trouble with upwards of 40000 shmem file objects at
this point. Most of these are around 4 pages in size, although they
range up to around 16MB. A typical 3D game will use around 100MB or so

Not really; these are objects used in the rendering process, vertices,
command buffers, constant buffers, textures, programs, rendering
surfaces and various other state buffers.  Lots of these are write-once
(from the CPU) and read-many (from the GPU), like textures, programs and
vertices. Some of these are effectively streamed from the CPU to the
GPU.

Each sequence of rendering commands requires that a set of these objects
be pinned to physical memory and bound to the graphics translation table
within the GPU.

Oh, and I allocate enough that I need to make them pageable as well;
under memory pressure, I can pull objects back out of the GTT binding

I could use fds if the kernel supported applications needing many
thousand of them, and also some way to keep these FDs from polluting the

I'm not worried about BSD; I want to build the right API for Linux at
this point.

And, yes, thousands of objects per 3D application. Each X pixmap will
end up in one of these objects as well, and so something like Firefox
will likely allocate several hundred. To keep the rendering engine busy,
we'll have a couple hundred command buffers allocated as well, of 16KB

That's what I'm doing at present; the card is opened with a single fd
and then these shmem objects are allocated within an ioctl from there.
What I don't want to do is use a single linear address space for these

Suggestions welcome; I just need a few thousand multi-page allocations,
which the kernel deals with quite easily these days.

--=20
keith.packard@intel.com
From: Ingo Oeser
Date: Sunday, August 3, 2008 - 4:35 pm

Hi Keith,


You may just have identified a scaling problem here. 
I see no difference to many thousand TCP/IP connections in your description.

What actions on many thousand fds are supported poorly or not at all?
Are you only concerned about the memory requirements?

Please elaborate or point me to a place where you did already :-)


Best Regards

Ingo Oeser
--

From: Keith Packard
Date: Sunday, August 3, 2008 - 5:19 pm

I didn't notice that the change in the maximum number of fds per process
from 1024 to 1024*1024 back in February. That makes it possible,
although requiring root privs, to allocate enough fds for this to work.

The other issue is that several important applications (including the X
server) use select instead of poll, and they have a small maximum number
of fds that they support. It seems like this could be worked around by

I should have looked at fs/file.c.

--=20
keith.packard@intel.com

That would work yes. Switching to poll() would probably be even smarter,
or if you have a large number of fds being scanned take a look at epoll
which is likely to be far more efficient but wouldn't be available on so
many systems - poll is at least standard.
--

From: Arjan van de Ven
Date: Monday, August 4, 2008 - 6:51 am

On Mon, 4 Aug 2008 09:19:30 +0100

the hard part is that DRI is a library that gets linked into existing
applications... which already are using select ;-(

Fixing X is one thing and not impossible.. fixing all existing games in
the field is quite another


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--


X libraries provide the event loop so that isn't really as complex as it
seems. Shuffling the handles around for the benefit of the odd legacy app
isn't something I'd disagree with but DRI and X in the general case can
determine the event loop system used by the application.

Alan
--

From: Arjan van de Ven
Date: Monday, August 4, 2008 - 9:38 am

On Mon, 4 Aug 2008 15:11:46 +0100

they do?
I could have sworn things like glib etc do their own...
but I'm no desktop guy so I can totally be wrong as well
--

From: Keith Packard
Date: Monday, August 4, 2008 - 9:58 am

Uh, no, X doesn't. Gtk+ and Qt provide event loops that applications may

Yeah, it would be nice if we could just fix all of the existing
applications. When you get that working, let me know and I'll take you
for a ride in my flying car.

--=20
keith.packard@intel.com

Ok, how many need support for GEM? How many of them would change their
event loop, if they can get better performance?

Maybe have a slow path for legacy behavior?

Really, the sleeping part of of event loops is usually hidden 
in some libraries and the applications have a big switch statement 

That is never required as long as only performance suffers, 
not functionality.

And the applications doing every syscall themselves are used to the pain :-)


Best Regards

Ingo Oeser
--


GEM is under OpenGL libs so they don't request support for GEM.
However if the underlying OpenGL implemenation
suddenly starts doing something with a lot of fds it could potentially
confuse the crap out of these exisiting applications.

Dave.
--

From: Keith Packard
Date: Monday, August 4, 2008 - 5:34 pm

GEM will underlie the OpenGL implementation that applications use; we
aren't planning on writing two OpenGL implementations to work around
some file descriptor issues. And, even if we want to use fds for every
GEM object, we have a fairly simple way of moving them out of the way of
select -- dup2. Of course, it would be nice to have some way to get the

That's not historically true in desktop applications. Yes, most modern
open source applications are sensible and use a library-based event

Alas, GEM offers a huge increase in functionality; performance is really
just a modest side benefit.

In reality, as I want to avoid problems caused by ulimit, I suspect I'd
end up treating most of these objects as just a bag of pages and close
the related fd after passing them to the driver, effectively turning the
whole exercise into a mechanism for passing the struct file from shmem
to GEM through user mode instead of directly across the kernel API. I'm
not sure this is a win.

--=20
keith.packard@intel.com
From: Christoph Hellwig
Date: Sunday, August 10, 2008 - 6:23 pm

shmfs is perfectly happy to have thousands of files, there are workloads
that have a lot more than that.  Note that in other subthreads there is
the assumption that all of them are open and in the fd table at the same
time.  I haven't really seen a good explanation for that - objects on
shmfs are persistant until deleted, and when you keep your parent
directory fd open an openat to get at the actual object is cheap.

--

From: Keith Packard
Date: Sunday, August 10, 2008 - 8:03 pm

A typical GPU-intensive application will have a few thousand objects
full of GPU-bound data containing GPU commands, vertices, textures,
rendering buffers, constants and other misc GPU state.

When rendering a typical frame, all of these objects will be referenced
in turn -- each frame is much like the last, so the process of drawing
the frame ends up using almost exactly the same data each time. It's an
animation after all.

Any kind of resource constraint places severe pressure on an eviction
policy -- the obvious 'LRU' turns out to be almost exactly wrong as you
always eject the objects which are next in line for re-use. Obviously,
the right solution is to avoid artificial resource constraints, and
minimize the impact of real constraints.

Limiting the number of objects we can have in-use at a time (by limiting
the number of open FDs) is an artificial resource constraint, forcing
re-use of closed objects to pay an extra openat cost. That may be cheap,
but it's not free.

Plus, we don't want persistent objects here -- when the application
exits, we want the objects to disappear automatically; otherwise you'll
clutter shmem with thousands of spurious files when an application
crashes. So, if we create shmem files, we'll want to unlink them
immediately, leaving only the fd in place to refer to them.

Our alternative is to open the shmem file and pass the fd into our
driver to hold onto the file while the application closes the fd.
I see that as simply a kludge to work around the absence of anonymous
pageable file allocation -- it seems like an abuse of the current API.

Perhaps we should construct a different API to allocate a pageable
object? Having shmem expose these as 'files' doesn't make a lot of sense
in this context, it just happens to be an easy way to plug them into the
vm system, as is evidenced by ipc/shm using them as well.

--=20
keith.packard@intel.com
From: Keith Packard
Date: Sunday, August 3, 2008 - 6:54 pm

shmem_file_setup is already public, just not exposed to kernel modules.
I suppose we could have user space allocate the shmem file (either via
tmpfs or sysv ipc). tmpfs suffers from the maxfd issue, while sysv ipc
runs up against the SHMMAX value.

The other interface we use, shmem_getpage, doesn't seem supported
through existing interfaces though. I don't want to map the file into
any address space, I just need a pointer to the physical pages which
underlie it. I'll take those pages and hand them to the graphics engine.
shmem_getpage performs precisely the operation needed here.

--=20
keith.packard@intel.com
From: Nick Piggin
Date: Monday, August 4, 2008 - 2:02 am

This is not an argument. shmem_file_setup is public because that's
how ipc is implemented, not because it makes sense to be used
anywhere else. Lots of other things in mm/ are public that should

This is how I'd suggested it work as well. I think a little bit

Mapping the file into an address space might be a way to make it
work (using get_user_pages to get the struct page). splice might
also work. read_mapping_page or similar could also be something to
look at. But using shmem_getpage seems wrong because it circumvents
the vfs API.

If you genuinely have problems that can't be fit into existing
APIs without significant modification, and that is specific just to
your app, then we could always look at making special cases for you.
But it would be nice if we generically solve problems you have with
processes manipulating thousands of files.
--

From: Keith Packard
Date: Monday, August 4, 2008 - 3:26 am

What I may be able to do is create a file, then hand it to my driver and

It seems fairly ugly to map the object to user space just to get page
pointers; the expense of constructing that mapping will be entirely
wasted most of the time.

Would it be imprudent to use pagecache_write_begin/pagecache_write_end
here? For shmem, that appears to point at functions which will do what I
need. Of course, it will cause extra page-outs as each page will be
marked dirty, even if the GPU never writes them.=20

While shmem offers good semantics for graphics objects, it doesn't seem
like it is unique in any way, and it seems like it should be possible to

There are some unique aspects to this operation which don't really have
parallels in other environments.

I'm doing memory management for a co-processor which uses the same pages
as the CPU. So, I need to allocate many pages that are just handed to
the GPU and never used by the CPU at all. Most rendering buffers are of
this form -- if you ever need to access them from the CPU, you've done
something terribly wrong.

Then there are textures which are constructed by the CPU (usually) and
handed over to the GPU for the entire lifetime of the application. These
are numerous enough that we need to be able to page them to disk; the
kernel driver will fault them back in when the GPU needs them again.

On the other hand, there are command and vertex buffers which are
constructed in user space and passed to the GPU for execution. These
operate just like any bulk-data transfer, and, in fact, I'm using the
pwrite API to transmit this data. For these buffers, the entire key is
to make sure you respect the various caches to keep them from getting
trashed.

--=20
keith.packard@intel.com
From: Nick Piggin
Date: Monday, August 4, 2008 - 3:43 am

True. It would make it possible for the userspace program to pass in
anonymous pages, but maybe not a big deal if you're using files and

pagecache_write_begin/pagecache_write_end should be reasonable, but you
have to be careful of the semantics of it. For example, you can't really
read anything from the page inside the calls because the filesystem may
not bring it up to date.


Right, that's your specific implementation, but for some cases the
memory management can map or be implemented using generic primitives.
Using pagecache for your memory for example should work nicely. Making
it shmem specific and using internal APIs seems like a negative step
until you really have a good reason to.
--

From: Keith Packard
Date: Monday, August 4, 2008 - 4:45 am

We considered using anonymous pages, but as the user-mapping is not a
feature, it seemed like it wasn't the right model. Plus, many of these
objects need to be shared across multiple processes, so anonymous pages


That does look a lot more like what I want, as it returns an unlocked
page. And, makes my code look cleaner to boot:

        inode =3D obj->filp->f_path.dentry->d_inode;
        mapping =3D inode->i_mapping;
        for (i =3D 0; i < page_count; i++) {
                page =3D read_mapping_page(mapping, i, NULL);
                if (IS_ERR(page)) {
                        ret =3D PTR_ERR(page);
                        DRM_ERROR("read_mapping_page failed: %d\n", ret);
                        i915_gem_object_free_page_list(obj);
                        return ret;
                }
                obj_priv->page_list[i] =3D page;
        }

Does this look like it conforms to the vfs api? It appears to work when

Yup, I'm liking the general file mechanism, I used shmem only because
that seemed like the obvious file system you'd want underneath these
objects.

--=20
keith.packard@intel.com
From: Hugh Dickins
Date: Monday, August 4, 2008 - 10:09 am

Yes, that's a good suggestion from Nick, should work with shmem/tmpfs
(with *proviso below) and others, and big improvement to your generality.

Whether such usage conforms to VFS API I'm not so sure: as I understand
it, it's really for internal use by a filesystem - if it's going to be
used beyond that, we ought to add a check that the filesystem it's used
upon really has a ->readpage method (and I'd rather we add such a check
than you do it at your end, in case we change the implementation later
to use something other than a ->readpage method - Nick, you'll be
nauseated to hear I was looking to see if ->fault with a pseudo-vma
could do it).  But if the layering police are happy with this, I am.

The *proviso is that for tmpfs itself this actually isn't as convenient
as directly using shmem_getpage: because this way passes a page in to
shmem_getpage, when maybe the page wanted is already here but currently
marked as swapcache rather than filecache.  It's an awkward extension
that allows tmpfs to support ->readpage at all.  But that route is in
use and well-tested, and only an inefficiency when swapping, so should
not cause you any problems.

(I have been agonizing over the way __read_cache_page, like your
original i915_gem_object_get_page_list, uses find_get_page: whereas
shmem_getpage uses find_lock_page.  I remember the latter is important,
but don't quite remember all the why.  I'm believing that it's really
only the ->fault usage where it becomes vital: things get confused, on
2.4 more than 2.6, if a swapcache tmpfs page is mapped into userspace.)

You don't examine page->mapping at all: good, there's a race by which
it could go to NULL after you acquired the page by read_mapping_page:
not by file truncation, but by swapping out at the wrong instant.
Don't worry, you have the right page, just don't rely on page->mapping.

(Sorry if I'm rather talking aloud to myself here.)

Hugh
--

From: Keith Packard
Date: Monday, August 4, 2008 - 10:25 am

It seems like I should put a check into my code that is kernel version
dependent so that I can't oops if someone tries to use a filesystem that

Yeah, swapping performance isn't my primary concern; I looked through
the read_mapping_page codepath and it looked exactly like my existing
code in the fast path, which is why I was able to just delete all of
that from my driver and just call read_mapping_page.

So, when I release the pages from the page cache, I'm currently calling
mark_page_accessed for all pages, and set_page_dirty for pages which may
have been written by the GPU. Are those calls still needed?

--=20
keith.packard@intel.com
From: Hugh Dickins
Date: Monday, August 4, 2008 - 11:39 am

Well, I guess put the check on ->readpage into your code for now, and
by the time GEM gets into Linus's tree, we should have -EINVAL checks
on NULL filler() in __read_cache_page() and read_cache_page_async(),
so remove check at your end before final submission.

(You could leave it there, and strictly we ought to update GEM if we
make any change to our implementation; but it is the kind of detail
that gets overlooked - witness the way I failed to grasp the readahead
side-effects of adding ->readpage into tmpfs until recently.  I'm just
afraid we'd break you unwittingly: better not, though easily fixed.)

I'm not sending the patch right now, waiting to see if this direction

I think you should drop the mark_page_accessed(): that's done in
read_cache_page_async() as part of the initial read_mapping_page().
But do it again when releasing if you think there's a good chance
that object will be wanted again shortly.  set_page_dirty() if
modified by GPU, yes, that would still be needed.

For how long are these objects' pages pinned in memory like this?
I ask because Rik & Lee have patches in -mm, trying to avoid long
scans of LRUs cluttered with unevictable pages.  I've no idea
whether you're adding a lot or a little to that problem.

Hugh
--

From: Keith Packard
Date: Monday, August 4, 2008 - 12:20 pm

I'll just remove it then; pages pulled from the GTT are unlikely to be

Forever? The only reason an object would get unpinned would be on
eviction from the GTT, and objects will only be evicted if we fill the
aperture with other stuff. When we add a register_shrinker callback,
we'll also unpin pages at that point.

A busy system should have the GTT entirely full, and that will be

If the idea of 1GB of pinned memory doesn't scare you, then I don't see
a problem ;-)

--=20
keith.packard@intel.com
From: Hugh Dickins
Date: Monday, August 4, 2008 - 12:55 pm

Okay, thanks for the warning.  It sounds like the shrinker will be
important, but we'll also need to mark those pages as unevictable
while they're unshrunk.  (Usually when drivers grab a large number
of pages, they're not on any LRU to begin with: you're being nice
by choosing swappable LRU pages - in the tmpfs case - but enough
to upset the balance while they're not swappable.)  Offhand I can't
say what will be the appropriate way to do that, it's something we
need to revisit later (from your point of view it should amount to
one function call or flag set somewhere, not any grand redesign).

Hugh
--

From: Keith Packard
Date: Monday, August 4, 2008 - 2:37 pm

That seems like a sensible optimization; otherwise the system could

It's not a matter of 'being nice', of course, it's a matter of keeping
the system running under heavy firefox load. Check out the memory usage
by your X server with numerous tabs open to image-heavy pages someday.
Not making those pageable would result in OOM visiting far too often.

Right now, we make these images pageable by copying them in and out of a
set of pages bound permanently to the GTT. The performance implications
of doing that are fairly severe, enough so that in many cases it's

Ok, cool. I haven't added the shrinker support yet in any case, although
that doesn't look like a big job.

--=20
keith.packard@intel.com
From: John Stoffel
Date: Monday, August 4, 2008 - 7:25 pm

>>>>> "Hugh" == Hugh Dickins <hugh@veritas.com> writes:


Hugh> Okay, thanks for the warning.  It sounds like the shrinker will
Hugh> be important, but we'll also need to mark those pages as
Hugh> unevictable while they're unshrunk.  

What about the onboard memory of graphics cards?  Isn't that where
Textures and such are stored as well?  So once something is loaded to
the card, shouldn't you be able to free it in system memory?  Or swap
it out ahead of time?  

Hugh> (Usually when drivers grab a large number of pages, they're not
Hugh> on any LRU to begin with: you're being nice by choosing
Hugh> swappable LRU pages - in the tmpfs case - but enough to upset
Hugh> the balance while they're not swappable.)  Offhand I can't say
Hugh> what will be the appropriate way to do that, it's something we
Hugh> need to revisit later (from your point of view it should amount
Hugh> to one function call or flag set somewhere, not any grand
Hugh> redesign).

I just wonder here, since GPUs are different from other drivers in
that (I assume) they are written too much more often than they are
read to, that they should hopefully be much more amenable to drop
behind semantics for pages they've been sent, to free system
resources.

Now I admit I am most probably smoking something which is clouding my
understanding, so feel free to ingore my comments.  

John
--

From: Keith Packard
Date: Monday, August 4, 2008 - 9:28 pm

Right now, I'm working only on Intel integrated graphics, which doesn't
have any on-board memory. My thinking is that we'd best solve the
easiest case first before attempting the harder discrete graphics
driver. Plus, Intel pays me do do integrated graphics, so I have an
incentive.

Not that we haven't been thinking about how this plays with a discrete
card; we'll need to allocate backing store for any objects which are
placed in on-card memory in case we run out of on-card memory, and also
as a place to store objects when the system suspends. From the
perspective of shmem, it should be precisely the same; allocating
anonymous pages and handing them to the DRM driver to store video card

Yup, I think you understand the situation fairly well. I've got a big
pile of pages sitting idle in the GTT which could be pulled out and
given back to the system. I like to keep a bunch around as the cost of
getting pages out of the CPU cache and bound to the GTT is non-trivial,
but if there's any memory pressure, the cost to free them is quite low.

--=20
keith.packard@intel.com
From: Stephane Marchesin
Date: Wednesday, August 6, 2008 - 9:20 am

Right, but it sounds adventurous to merge this work upstream before it
is generic enough to work on discrete cards. Right now it only works
on intel integrated cards ; integrated cards are one business,
discrete are a completely different world and the issues afoot
completely different.

Should this go upstream, the kernel guys have to keep in mind and
accept the possibility that another memory manager might be needed at
some point.

Stephane
--

From: Arjan van de Ven
Date: Wednesday, August 6, 2008 - 10:24 am

On Wed, 6 Aug 2008 18:20:40 +0200

the ATI driver guys already have the interface working.

In addition.. since when is "oh you must also make it work for THAT" a
requirement? Traditionally, it's up to the second person to make
generic code work for them (within reason of course, but I'll argue
that the bar of reasonableness has been met here)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Stephane Marchesin
Date: Wednesday, August 6, 2008 - 10:32 am

Right, but the current code will basically force the discrete card
drivers to implement backing store for all allocations. Do we want
this ? Also, for cards that can handle page-based memory allocations,
there is no way to make use of this feature, do we want this too ?

Stephane
--

From: Keith Packard
Date: Wednesday, August 6, 2008 - 10:56 am

Aside from not really forcing discrete card drivers to to anything
(they're welcome to use or not use this stuff as they prefer), I believe
discrete cards will need backing store to support paging objects to disk

I don't see how any of the current code directs how future drivers might
work; the user-level interface is reasonably abstract and should allow
all kinds of internal organizations.

Instead of complaining that the current code might not support some
abstract hardware, please build something that does work and we'll see
how to merge that with code for other drivers.

--=20
keith.packard@intel.com
From: Stephane Marchesin
Date: Wednesday, August 6, 2008 - 11:09 am

Well, this "abstract hardware" is in stores, it's called G80 and R600.
If intel can't do something on its hardware shouldn't prevent others
from using it. And I seriously doubt even the current interfaces would
be fit for this.

Stephane
--

From: Keith Packard
Date: Wednesday, August 6, 2008 - 2:22 pm

Write a driver and we'll talk. Until then, I don't think your argument
is well supported.

If the result is that your driver requires a completely separate
interface than the Intel driver, that's OK too  -- we can talk about
merging them together in some unified mechanism at that point.

I'm trying to get from zero to one kernel driver for Intel graphics at
this point; that seems a valuable goal in itself. I hope it leads to
other people building kernel drivers for other graphics hardware. Don't
let me stop you from building a better mousetrap if you think the code
I've written is wrong somewhere.

--=20
keith.packard@intel.com
From: Stephane Marchesin
Date: Wednesday, August 6, 2008 - 7:16 pm

Btw, I was wondering what prevented you (you being the intel people)
from fixing TTM instead of writing GEM ? That's what happened, there
was TTM first, intel came first, they should've fixed it...

Stephane
--

From: Keith Packard
Date: Wednesday, August 6, 2008 - 7:57 pm

Well, we (mostly Eric, but a bit of work by me as well) spent about a
year trying to fix it, with Thomas' help, but after that, we still
hadn't managed to make the 965 run OpenGL reasonably well.

Around April 1, we (Eric and I) evaluated how long it would take to
built just what we needed, rather than trying to work out a general
architecture that could serve every hardware need we could think of. We
figured it would take about a month to get something working, and by the
end of April, we had 3D working better on 965 than we had ever managed
with TTM.

Given the difficulty we had understanding the TTM internals, it's
possible that this was just a failure on our part. But as we expect a
wide range of people to maintain this code going forward, it seemed
reasonable to want something simple enough that even we could figure out
how it works.

I think TTM is over designed, but that is colored by experience with
very simple hardware. While I think something simple, like GEM, could
work on most hardware, I really don't know and invite you to try
whatever approach you want.

Also, we're talking about moving code into the Linux kernel source tree
to support Intel integrated graphics. There is no incumbent
implementation there that we're trying to replace, this is new
functionality for Linux.

--=20
keith.packard@intel.com
From: Christoph Hellwig
Date: Sunday, August 10, 2008 - 6:34 pm

Important difference between code and interfaces.  GEM is adding new
userspace interface, and it's current form rather ugly ones.  We'd
better think about these hard, and make sure they are useful for the
whole usecase and not a limited subset.

--

From: Nick Piggin
Date: Monday, August 4, 2008 - 9:28 pm

Yes, readpage is optional... A check will be required.

I don't exactly know if this would be deemed the best way to do it,
but it seems *relatively* filesystem agnostic (with the readpage
test being the exception), and it seems like it is to address_space
mappings what get_user_pages is to virtual memory mappings. And
there is no shortage of drivers using get_user_pages.

At any rate, it seems a bit nicer than using shmem_getpage directly.

--

From: Christoph Hellwig
Date: Sunday, August 10, 2008 - 6:30 pm

Using read_mapping_page is fine on any pagecache backed file.  What is
much more difficult is actually writing into pagecache on a sub-page
level where we don't have proper APIs for anything but the filesystem
itself.

--

From: Keith Packard
Date: Monday, August 4, 2008 - 2:58 pm

Well, I've spent a day thinking about using existing user-space APIs to
get at shmem files. While it's nice that we've discovered a
filesystem-independent mechanism to pin file pages, we haven't found
anything similar for creating the files. In particular, what I want is:

 1) Anonymous files backed by swap
 2) Freed when the last process using them exits
 3) That never appear in the file system
 4) Do not consume a low FD (yeah, I know, rewrite the desktop)

So, what I could do is

	char	template[] =3D "/dev/shm/drm-XXXXXX";
	int	fd;
	fd =3D mkstemp (template);
	unlink (template);
	ftruncate (fd, size)
	object =3D drm_create_an_object_for_a_file (fd);
	close (fd);

While I haven't written any code yet, this should work and will even be
compatible with my current user-space API. I have to create a DRM object
for the file in any case, and so I don't need to hold onto the fd.
Releasing the fd also eliminates any ulimit issues.=20

The drm_create_an_object_for_a_file call could return another fd. But,
note that the original shmem fd has no real value to the application in
this case.

I can imagine other cases where mapping non-shmem files would make sense
though, in particular it's fairly easy to envision mapping an image file
to the GTT and having the graphics process decode and display it without
any additional copies. I think this demonstrates the potential utility
of the general file mapping operation.

But, I'd like to have you reconsider whether it makes sense for user
space to go through the above dance to create anonymous shared objects
when the kernel already supports precisely the desired semantics and
even exposes them to the ipc/shm implementation.

We'd offer two paths in DRM -- one that used an existing file and
created an object using that as backing store, and a second one that
created anonymous objects using shmem as backing store. Transient data
would use anonymous objects while applications could directly map
arbitrary file contents as ...
From: Dave Airlie
Date: Monday, August 4, 2008 - 3:22 pm

We also have a need to create in-kernel objects for things like fbcon.
If we cannot
create these without doing a major dance around the vfs then it'll be
rather ugly.

Dave.
--

From: Nick Piggin
Date: Monday, August 4, 2008 - 9:43 pm

In my opinion, doing this little song and dance (which is a few lines
of quite well defined APIs, btw) in userspace is preferable to adding
a single line or exporting a single function in kernel space. Unless
there is a better reason than eliminating a few lines of userspace code.

I'm absolutely not against exporting a nice API for a swappable, object
based memory allocator using ipc or shm to the wider kernel (with a nice
API rather than just using shmem functions directly of course). But the
fact that most or all of this seems to be able to be done in userspace
just tells me that's where it should be prototyped first. It adds
nothing to maintainence costs of the kernel code, and might actually be
helpful to show some shortcomings of our user API definition or
implementation.

In the worst case it completely fails, the effort will still show much
better how and why it needs to be done in kernel.
--

From: Keith Packard
Date: Monday, August 4, 2008 - 10:19 pm

Yeah, Dave Airlie just reminded me of the real reason for creating an
API within the kernel. We need this to support kernel mode setting where
we want to provide a frame buffer console and smoothly transition to an
X desktop. Ideally, the console would start before user mode was running


Sure, let's see if Dave can address the need to allocate these objects
from within the kernel. Of course, that allocation could also be done
using existing kernel APIs.

--=20
keith.packard@intel.com
From: Jesse Barnes
Date: Wednesday, August 6, 2008 - 5:45 pm

Yeah, I like this approach too, but to echo what Keith & Dave already 
mentioned, it would make the in-kernel aspect of things much more difficult 
(without abusing VFS calls from within the kernel anyway).

It's not just the low level gfx libraries that need to create, map, operate 
on, and wait for objects.  The kernel also needs to create objects and map 
them at the very least (hopefully we can avoid most of the other stuff inside 
the kernel).

It seems like that's hard to do without duplicating big chunks of shmem.c.  
Any suggestions, Nick or Christoph?

I'm trying to think of analogues in other kernel subsystems, but the best I 
can come up with is shmem. :)  Some of the cluster fabric cards have similar 
issues (wanting to pin/map/unmap memory, and manage on-board IOMMUs) but I 
think gfx may be different in that it tends to use large chunks of memory for 
long periods of time; and even when a given GPU program is done with a piece 
of memory, the very next program run is likely to need it again.  And I think 
the in-kernel requirements in this case are fairly unique.

Jesse
--

From: Dave Airlie
Date: Monday, August 18, 2008 - 6:17 pm

Now I also have to do the same song and dance for in-kernel allocated objects?
still think this is acceptable... I'm not even sure what vfs calls I
really need in-kernel to do that.

If someone codes up the in-kernel path to do this using 4-5 API calls
instead of one exported
function that IPC/SHM already does then maybe we can gauge the uglyness vs that.

--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 3:00 am

Not exactly sure what you mean by this. But I would like to see an effort
made to use existing userspace APIs in order to do this swappable object
allocation over tmpfs scheme. As I said, I don't object to a nice kernel
implementation, but we would be in a much better position to assess it if
we had an existing userspace implementation to compare it with.
--

From: Keith Packard
Date: Tuesday, August 19, 2008 - 9:46 am

We need to allocate objects from kernel mode to get the console running.
I'd prefer to let that occur before user mode was available. Also,
emulating the existing fbdev syscall interface will require that we
allocate objects within the kernel.

Hence, the question about how we should create objects from kernel mode.
I think we can do this with a series of VFS function calls. Would that
series of VFS calls be preferable to directly accessing the existing
shmem API?

Another alternative is to improve the existing shmem API to better
capture what we're trying to do here. Both drm and sysv shm just want
anonymous pages that are backed by swap. If we started from scratch,
what API would we like to have here? Would we have it support both shmem
and hugetlbfs?

--=20
keith.packard@intel.com
From: Jesse Barnes
Date: Tuesday, August 19, 2008 - 11:50 am

Yeah there's no question we need early kernel space allocations of GEM 
objects.  We need to setup the frame buffer, ring buffer, hardware status 
page, and potentially other state at module init time so that people can see 

Improving the shmem API makes sense to me.  There's a flip side to using the 
ioctls as well; in doing the GTT mapping with the current code, I had to add 
a new ioctl and create a new core function that would allow me to do I/O 
remapping from something other than an ->mmap hook.  I think we could have 
conceptually cleaner code and less hassle if we extended shmem a bit to allow 
for shmem "drivers" that could hook into its open/close/mmap/etc. routines 
(though in the mmap case in particular we'd either need to abuse an existing 
mmap flag or create a new one to recognize the backing store/GTT mapping 
distinction).

What do you think, Nick?  Adding this stuff to shmem would probably involve 
using the file->private_data inside shmem, and creating a new sub-driver 
registration function, then checking for sub-ops in the various important 
shmem operations structs...

The advantage of all this is that we could probably use regular fds for the 
most part (assuming we get a "high fd" mapping function sometime soon), and 
all the regular file operations routines, making GEM look a like more like a 
regular file system driver.

As for in-kernel stuff, as long as we keep the GEM shmem hooks separate from 
the actual bookkeeping (like we do now with i915_gem_create_ioctl() vs 
drm_gem_object_alloc() for example) we should be able to do the in-kernel 
stuff w/o jumping through too many VFS/VM hoops.  That would also assume we 
don't care about swapping in the in-kernel case, which we don't; we want to 
pin the kernel allocated frame buffer and other memory anyway, so using the 
internal functions should be fine.

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center
--

From: Jerome Glisse
Date: Thursday, August 21, 2008 - 6:42 am

On Tue, 19 Aug 2008 11:50:11 -0700

What about suspend to disk ? How do we save such buffers ? 

Btw i think that GTT looks a lot like IOMMU, i don't know the IOMMU kernel
side API that much, but from memory i think that you have call to ask IOMMU
mapping why not do somethings like that for GTT ?

You get normal mapping of object diret but userspace can ask some kind of
GTT mapping on a given object. Anyway new flag on fd sounds good enough too.

Cheers,
Jerome Glisse <glisse@freedesktop.org>
--

From: Jesse Barnes
Date: Thursday, August 21, 2008 - 9:15 am

We'll have to deal with it like other kernel memory, or in the case of cards 
with VRAM maybe have a special hook that copies out the front buffer.  
Depends on how Dave implemented the GEM stuff on a device with VRAM (I 

At a high level that's pretty much what we're doing.  We already have to map 
things through the GTT in the kernel for pwrite/pread etc., so that code is 
done.  It's just exposing it to userspace that's a bit of a pain, since we 
have an all-ioctl interface...

-- 
Jesse Barnes, Intel Open Source Technology Center
--

Previous thread: none

Next thread: [PATCH] ath5k : ath5k_config_interface deadlock fix by Dave Young on Friday, August 1, 2008 - 12:46 am. (15 messages)