[PATCH 42/45] NFS: Read pages from FS-Cache into an NFS inode [ver #35]

Previous thread: none

Next thread: [patch 2/2] x86: cleanup - rename VM_MASK to X86_VM_MASK by gorcunov on Friday, March 28, 2008 - 10:56 am. (1 message)
To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

These patches add local caching for network filesystems such as NFS. To give a
really quick overview of the way the facility works:

+---------+
| |
| NFS |--+
| | |
+---------+ | +----------+
| | |
+---------+ +-->| |
| | | |
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+ +--------------+
+---------+ | +----------+ | | | | |
| | | +-->| CacheFiles |-->| Ext3 |
| ISOFS |--+ | /var/cache | | /dev/sda6 |
| | +--------------+ +--------------+
+---------+

(1) NFS, say, asks FS-Cache to store/retrieve data for it;

(2) FS-Cache asks the cache backend, in this case CacheFiles to honour the
operation;

(3) CacheFiles 'opens' a file in a mounted filesystem, say Ext3, and does read
and write operations of a sort on it;

(4) Ext3 decides how the cache data is laid out on disk - CacheFiles just
attempts to use one sparse file per netfs inode.

(5) If NFS asks for data from the cache, but the file has a hole in it, NFS
falls back to asking the server. The data obtained from the server is
then written over the hole in the file.

To look at it another way:

+---------+
| |
| Server |
| |
+---------+
| NETWORK
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +----------+
V | |
+---------+ | |
| | | |
| NFS |----->| FS-Cache |
| | | |--+
+---------+ | | | +--------------+ +--------------+
| | | | | | | |
V +----------+ +--...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Define and create inode-level cache data storage objects (as managed by
nfs_inode structs).

Each inode-level object is created in a superblock-level index object and is
itself a data storage object into which pages from the inode are stored.

The inode object key is the NFS file handle for the inode.

The inode object is given coherency data to carry in the auxiliary data
permitted by the cache. This is a sequence made up of:

(1) i_mtime from the NFS inode.

(2) i_ctime from the NFS inode.

(3) i_size from the NFS inode.

As the cache is a persistent cache, the auxiliary data is checked when a new
NFS in-memory inode is set up that matches an already existing data storage
object in the cache. If the coherency data is the same, the on-disk object is
retained and used; if not, it is scrapped and a new one created.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache-index.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 1
2 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index e092a73..1c158d5 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -150,3 +150,117 @@ const struct fscache_cookie_def nfs_fscache_super_index_def = {
.type = FSCACHE_COOKIE_TYPE_INDEX,
.get_key = nfs_super_get_key,
};
+
+/*
+ * Definition of the auxiliary data attached to NFS inode storage objects
+ * within the cache.
+ *
+ * The contents of this struct are recorded in the on-disk local cache in the
+ * auxiliary data attached to the data storage object backing an inode. This
+ * permits coherency to be managed when a new inode binds to an already extant
+ * cache object.
+ */
+struct nfs_fscache_inode_auxdata {
+ struct timespec mtime;
+ struct timespec ctime;
+ loff_t size;
+};
+
+/*
+ * Generate a key to describe an NFS inode in an NFS server's index
+ */
+static uint16_t nfs_fscache_inode_get_key(const void *cookie_netfs...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Invalidate the FsCache page flags on the pages belonging to an inode when the
cache backing that NFS inode is removed.

This allows a live cache to be withdrawn.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache-index.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 1c158d5..1522496 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -248,6 +248,45 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
}

/*
+ * Indication from FS-Cache that the cookie is no longer cached
+ * - This function is called when the backing store currently caching a cookie
+ * is removed
+ * - The netfs should use this to clean up any markers indicating cached pages
+ * - This is mandatory for any object that may have data
+ */
+static void nfs_fscache_inode_now_uncached(void *cookie_netfs_data)
+{
+ struct nfs_inode *nfsi = cookie_netfs_data;
+ struct pagevec pvec;
+ pgoff_t first;
+ int loop, nr_pages;
+
+ pagevec_init(&pvec, 0);
+ first = 0;
+
+ dprintk("NFS: nfs_inode_now_uncached: nfs_inode 0x%p\n", nfsi);
+
+ for (;;) {
+ /* grab a bunch of pages to unmark */
+ nr_pages = pagevec_lookup(&pvec,
+ nfsi->vfs_inode.i_mapping,
+ first,
+ PAGEVEC_SIZE - pagevec_count(&pvec));
+ if (!nr_pages)
+ break;
+
+ for (loop = 0; loop < nr_pages; loop++)
+ ClearPageFsCache(pvec.pages[loop]);
+
+ first = pvec.pages[nr_pages - 1]->index + 1;
+
+ pvec.nr = nr_pages;
+ pagevec_release(&pvec);
+ cond_resched();
+ }
+}
+
+/*
* Define the inode object for FS-Cache. This is used to describe an inode
* object to fscache_acquire_cookie(). It is keyed by the NFS file handle for
* an inode.
@@ -263,4 +302,5 @@ const struct fscache_cookie_def nfs_fscache_inode_object_def = {
.get_attr = nfs_fscache_inode_get_attr,
.get_aux = nfs_fscache_in...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Read pages from an FS-Cache data storage object representing an inode into an
NFS inode.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 47 +++++++++++++++++++++++
fs/nfs/read.c | 18 +++++++++
3 files changed, 177 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index d06f837..d147bd0 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -391,3 +391,115 @@ void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
nfs_add_fscache_stats(page->mapping->host,
NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
}
+
+/*
+ * Handle completion of a page being read from the cache.
+ * - Called in process (keventd) context.
+ */
+static void nfs_readpage_from_fscache_complete(struct page *page,
+ void *context,
+ int error)
+{
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache_complete (0x%p/0x%p/%d)\n",
+ page, context, error);
+
+ /* if the read completes with an error, we just unlock the page and let
+ * the VM reissue the readpage */
+ if (!error) {
+ SetPageUptodate(page);
+ unlock_page(page);
+ } else {
+ error = nfs_readpage_async(context, page->mapping->host, page);
+ if (error)
+ unlock_page(page);
+ }
+}
+
+/*
+ * Retrieve a page from fscache
+ */
+int __nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode, struct page *page)
+{
+ int ret;
+
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache(fsc:%p/p:%p(i:%lx f:%lx)/0x%p)\n",
+ NFS_I(inode)->fscache, page, page->index, page->flags, inode);
+
+ ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+ page,
+ nfs_readpage_from_fscache_complete,
+ ctx,
+ GFP_KERNEL);
+
+ switch (ret) {
+ case 0: /* read BIO submitted (page in fscache) */
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache: BIO submitted\n");
+ nfs_add_fs...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Register NFS for caching and retrieve the top-level cache index object cookie.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/Makefile | 1 +
fs/nfs/fscache-index.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 35 ++++++++++++++++++++++++++++++++
fs/nfs/inode.c | 8 +++++++
4 files changed, 97 insertions(+), 0 deletions(-)
create mode 100644 fs/nfs/fscache-index.c
create mode 100644 fs/nfs/fscache.h

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..6d7176d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
nfs4namespace.o
nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
new file mode 100644
index 0000000..d696c06
--- /dev/null
+++ b/fs/nfs/fscache-index.c
@@ -0,0 +1,53 @@
+/* NFS FS-Cache index structure definition
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+#include <linux/in6.h>
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITY NFSDBG_FSCACHE
+
+static const struct fscache_netfs_operations nfs_fscache_ops = {
+};
+
+/*
+ * Define the NFS filesystem for FS-Cache. Upon registration FS-Cache sticks
+ * the cookie for the top-level index object for NFS into here. The top-level
+ * ...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

FS-Cache page management for NFS. This includes hooking the releasing and
invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for
completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2).

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/file.c | 17 +++++++++++++----
fs/nfs/fscache.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 22 ++++++++++++++++++++++
3 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 26a073b..60db3ea 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
#include "delegation.h"
#include "internal.h"
#include "iostat.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_FILE

@@ -358,7 +359,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
* Partially or wholly invalidate a page
* - Release the private state associated with a page if undergoing complete
* page invalidation
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
* - Caller holds page lock
*/
static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -367,30 +368,35 @@ static void nfs_invalidate_page(struct page *page, unsigned long offset)
return;
/* Cancel any unstarted writes on this page */
nfs_wb_page_cancel(page->mapping->host, page);
+
+ nfs_fscache_invalidate_page(page, page->mapping->host);
}

/*
* Attempt to release the private state associated with a page
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
* - Caller holds page lock
* - Return true (may release page) or false (may not)
*/
static int nfs_release_page(struct page *page, gfp_t gfp)
{
/* If PagePrivate() is set, then the page is not freeable */
- return 0;
+ if (PagePrivate(p...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

nfs_readpage_async() needs to be non-static so that it can be used as a
fallback for the local on-disk caching should an EIO crop up when reading the
cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/read.c | 4 ++--
include/linux/nfs_fs.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 5a70be5..2b61a0b 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -114,8 +114,8 @@ static void nfs_readpage_truncate_uninitialised_page(struct nfs_read_data *data)
}
}

-static int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
- struct page *page)
+int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
+ struct page *page)
{
LIST_HEAD(one_request);
struct nfs_page *new;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index b41806b..7632732 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -507,6 +507,8 @@ extern int nfs_readpages(struct file *, struct address_space *,
struct list_head *, unsigned);
extern int nfs_readpage_result(struct rpc_task *, struct nfs_read_data *);
extern void nfs_readdata_release(void *data);
+extern int nfs_readpage_async(struct nfs_open_context *, struct inode *,
+ struct page *);

/*
* Allocate nfs_read_data structures

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Bind data storage objects in the local cache to NFS inodes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 15 ++++
fs/nfs/inode.c | 39 +++++++++--
include/linux/nfs_fs.h | 11 +++
4 files changed, 233 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index ab2de2c..0c71bae 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -166,3 +166,177 @@ void nfs_fscache_release_super_cookie(struct super_block *sb)
nfss->fscache_key = NULL;
}
}
+
+/*
+ * Initialise the per-inode cache cookie pointer for an NFS inode.
+ */
+void nfs_fscache_init_inode_cookie(struct inode *inode)
+{
+ NFS_I(inode)->fscache = NULL;
+ if (S_ISREG(inode->i_mode))
+ set_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+}
+
+/*
+ * Get the per-inode cache cookie for an NFS inode.
+ */
+static void nfs_fscache_enable_inode_cookie(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ if (nfsi->fscache || !NFS_FSCACHE(inode))
+ return;
+
+ if ((NFS_SB(sb)->options & NFS_OPTION_FSCACHE)) {
+ nfsi->fscache = fscache_acquire_cookie(
+ NFS_SB(sb)->fscache,
+ &nfs_fscache_inode_object_def,
+ nfsi);
+
+ dfprintk(FSCACHE, "NFS: get FH cookie (0x%p/0x%p/0x%p)\n",
+ sb, nfsi, nfsi->fscache);
+ }
+}
+
+/*
+ * Release a per-inode cookie.
+ */
+void nfs_fscache_release_inode_cookie(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 0);
+ nfsi->fscache = NULL;
+}
+
+/*
+ * Retire a per-inode cookie, destroying the data attached to it.
+ */
+void nfs_fscache_zap_inode_cookie(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ dfprintk(FSCACHE, "NFS:...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Define and create superblock-level cache index objects (as managed by
nfs_server structs).

Each superblock object is created in a server level index object and is itself
an index into which inode-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The superblock object key is a sequence consisting of:

(1) Certain superblock s_flags.

(2) Various connection parameters that serve to distinguish superblocks for
sget().

(3) The volume FSID.

(4) The security flavour.

(5) The uniquifier length.

(6) The uniquifier text. This is normally an empty string, unless the fsc=xyz
mount option was used to explicitly specify a uniquifier.

The key blob is of variable length, depending on the length of (6).

The superblock object is given no coherency data to carry in the auxiliary data
permitted by the cache. It is assumed that the superblock is always coherent.

This patch also adds uniquification handling such that two otherwise identical
superblocks, at least one of which is marked "nosharecache", won't end up
trying to share the on-disk cache. It will be possible to manually provide a
uniquifier through a mount option with a later patch to avoid the error
otherwise produced.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache-index.c | 34 +++++++++++++
fs/nfs/fscache.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 49 +++++++++++++++++++
fs/nfs/internal.h | 3 +
fs/nfs/super.c | 8 ++-
include/linux/nfs_fs_sb.h | 5 ++
6 files changed, 213 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 01f0f41..e092a73 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -116,3 +116,37 @@ const struct fscache_cookie_def nfs_fscache_serve...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:34 am

Add NFS mount options to allow the local caching support to be enabled.

The attached patch makes it possible for the NFS filesystem to be told to make
use of the network filesystem local caching service (FS-Cache).

To be able to use this, a recent nfsutils package is required.

There are three variant NFS mount options that can be added to a mount command
to control caching for a mount. Only the last one specified takes effect:

(*) Adding "fsc" will request caching.

(*) Adding "fsc=<string>" will request caching and also specify a uniquifier.

(*) Adding "nofsc" will disable caching.

For example:

mount warthog:/ /a -o fsc

The cache of a particular superblock (NFS FSID) will be shared between all
mounts of that volume, provided they have the same connection parameters and
are not marked 'nosharecache'.

Where it is otherwise impossible to distinguish superblocks because all the
parameters are identical, but the 'nosharecache' option is supplied, a
uniquifying string must be supplied, else only the first mount will be
permitted to use the cache.

If there's a key collision, then the second mount will disable caching and give
a warning into the kernel log.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/client.c | 2 ++
fs/nfs/internal.h | 1 +
fs/nfs/super.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index d67d52f..8357f68 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -669,6 +669,7 @@ static int nfs_init_server(struct nfs_server *server,

/* Initialise the client representation from the mount data */
server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+ server->options = data->options;

if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
@@ -1056,6 +1057,7 @@ static int nfs4_init_server(struct nfs_server *server,
/* Initialise the client representation from th...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Define and create server-level cache index objects (as managed by nfs_client
structs).

Each server object is created in the NFS top-level index object and is itself
an index into which superblock-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The server object key is a sequence consisting of:

(1) NFS version

(2) Server address family (eg: AF_INET or AF_INET6)

(3) Server port.

(4) Server IP address.

The key blob is of variable length, depending on the length of (4).

The server object is given no coherency data to carry in the auxiliary data
permitted by the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/Makefile | 2 +
fs/nfs/client.c | 5 +++
fs/nfs/fscache-index.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.c | 52 ++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 10 +++++++
include/linux/nfs_fs_sb.h | 4 +++
6 files changed, 137 insertions(+), 1 deletions(-)
create mode 100644 fs/nfs/fscache.c

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6d7176d..d848c97 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,4 +16,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
nfs4namespace.o
nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
nfs-$(CONFIG_SYSCTL) += sysctl.o
-nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index c5c0175..51e9346 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -45,6 +45,7 @@
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_CLIENT

@@ -151,6 +152,8 @@ static struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_
...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Permit local filesystem caching to be enabled for NFS in the kernel
configuration.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/Kconfig | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 6ca14c1..944c34b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1644,6 +1644,14 @@ config NFS_V4

If unsure, say N.

+config NFS_FSCACHE
+ bool "Provide NFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
config NFS_DIRECTIO
bool "Allow direct I/O on NFS files"
depends on NFS_FS

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/client.c | 7 ++++---
fs/nfs/fscache.h | 15 +++++++++++++++
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 51e9346..d67d52f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1451,7 +1451,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)

/* display header on line 1 */
if (v == &nfs_volume_list) {
- seq_puts(m, "NV SERVER PORT DEV FSID\n");
+ seq_puts(m, "NV SERVER PORT DEV FSID FSC\n");
return 0;
}
/* display one transport per line on subsequent lines */
@@ -1465,12 +1465,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
(unsigned long long) server->fsid.major,
(unsigned long long) server->fsid.minor);

- seq_printf(m, "v%u %s %s %-7s %-17s\n",
+ seq_printf(m, "v%u %s %s %-7s %-17s %s\n",
clp->rpc_ops->version,
rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_ADDR),
rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_PORT),
dev,
- fsid);
+ fsid,
+ nfs_server_fscache_state(server));

return 0;
}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 2f71d7e..027a4ca 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -144,6 +144,16 @@ static inline void nfs_readpage_to_fscache(struct inode *inode,
__nfs_readpage_to_fscache(inode, page, sync);
}

+/*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+ if (server->fscache && (server->options & NFS_OPTION_FSCACHE))
+ return "yes";
+ return "no ";
+}
+

#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
@@ -191,5 +201,10 @@ static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
static inli...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised). The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/ext2/inode.c | 2 ++
fs/ext3/inode.c | 3 +++
include/linux/fs.h | 7 ++++++
mm/filemap.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c620068..f483014 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -792,6 +792,7 @@ const struct address_space_operations ext2_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};

const struct address_space_operations ext2_aops_xip = {
@@ -810,6 +811,7 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};

/*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index c976123..0209f3b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1776,6 +1776,7 @@ static const struct address_space_operations ext3_ordered_aops = {
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};

static const struct address_space_operations ext3_writeback_aops = {
@@ -1790,6 +1791,7 @@ static const struct address_space_operations...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches. The kAFS filesystem will use caching
automatically if it's available.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

fs/Kconfig | 8 +
fs/afs/Makefile | 3
fs/afs/cache.c | 507 ++++++++++++++++++++++++++++++++++------------------
fs/afs/cache.h | 15 --
fs/afs/cell.c | 16 +-
fs/afs/file.c | 212 +++++++++++++---------
fs/afs/inode.c | 25 +--
fs/afs/internal.h | 53 ++---
fs/afs/main.c | 27 +--
fs/afs/mntpt.c | 4
fs/afs/vlocation.c | 23 +-
fs/afs/volume.c | 14 -
fs/afs/write.c | 19 ++
13 files changed, 553 insertions(+), 373 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index c42ec50..6ca14c1 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -2118,6 +2118,14 @@ config AFS_DEBUG

If unsure, say N.

+config AFS_FSCACHE
+ bool "Provide AFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on AFS_FS=m && FSCACHE || AFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want AFS data to be cached locally on disk through
+ the generic filesystem cache manager
+
config 9P_FS
tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)"
depends on INET && NET_9P && EXPERIMENTAL
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
# Makefile for Red Hat Linux AFS client.
#

+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
kafs-objs := \
+ $(afs-cache-y) \
callback.o \
cell.o \
cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..9b93466 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -1,6 +1,6 @@
/* AFS caching stuff
*
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (d...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Add missing consts to xattr function arguments.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/xattr.c | 41 ++++++++++++++++---------------
include/linux/security.h | 46 ++++++++++++++++++++---------------
include/linux/syscalls.h | 30 ++++++++++++-----------
include/linux/xattr.h | 6 ++---
security/commoncap.c | 6 ++---
security/dummy.c | 13 +++++-----
security/security.c | 12 +++++----
security/selinux/hooks.c | 14 ++++++-----
security/selinux/include/security.h | 2 +-
security/selinux/ss/services.c | 5 ++--
security/smack/smack_lsm.c | 12 +++++----
11 files changed, 100 insertions(+), 87 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 3acab16..391c752 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -68,7 +68,7 @@ xattr_permission(struct inode *inode, const char *name, int mask)
}

int
-vfs_setxattr(struct dentry *dentry, char *name, void *value,
+vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
size_t size, int flags)
{
struct inode *inode = dentry->d_inode;
@@ -132,7 +132,7 @@ out_noalloc:
EXPORT_SYMBOL_GPL(xattr_getsecurity);

ssize_t
-vfs_getxattr(struct dentry *dentry, char *name, void *value, size_t size)
+vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
{
struct inode *inode = dentry->d_inode;
int error;
@@ -188,7 +188,7 @@ vfs_listxattr(struct dentry *d, char *list, size_t size)
EXPORT_SYMBOL_GPL(vfs_listxattr);

int
-vfs_removexattr(struct dentry *dentry, char *name)
+vfs_removexattr(struct dentry *dentry, const char *name)
{
struct inode *inode = dentry->d_inode;
int error;
@@ -219,7 +219,7 @@ EXPORT_SYMBOL_GPL(vfs_removexattr);
* Extended attribute SET operations
*/
static long
-setxattr(struct dentry *d, char __user *name, void __user *value,
+setxattr(struc...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add FS-Cache option bit to nfs_server struct. This is set to indicate local
on-disk caching is enabled for a particular superblock.

Also add debug bit for local caching operations.

Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/nfs_fs.h | 1 +
include/linux/nfs_fs_sb.h | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index f4a0e4c..720cdad 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -579,6 +579,7 @@ extern void * nfs_root_data(void);
#define NFSDBG_CALLBACK 0x0100
#define NFSDBG_CLIENT 0x0200
#define NFSDBG_MOUNT 0x0400
+#define NFSDBG_FSCACHE 0x0800
#define NFSDBG_ALL 0xFFFF

#ifdef __KERNEL__
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 3423c67..e7c4cdd 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -99,6 +99,8 @@ struct nfs_server {
unsigned int acdirmin;
unsigned int acdirmax;
unsigned int namelen;
+ unsigned int options; /* extra options enabled by mount */
+#define NFS_OPTION_FSCACHE 0x00000001 /* - local caching enabled */

struct nfs_fsid fsid;
__u64 maxfilesize; /* maximum file size */

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add comment banners to some NFS functions so that they can be modified by the
NFS fscache patches for further information.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/file.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index ef57a5a..26a073b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -354,6 +354,13 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
return copied;
}

+/*
+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ * page invalidation
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ */
static void nfs_invalidate_page(struct page *page, unsigned long offset)
{
if (offset != 0)
@@ -362,12 +369,26 @@ static void nfs_invalidate_page(struct page *page, unsigned long offset)
nfs_wb_page_cancel(page->mapping->host, page);
}

+/*
+ * Attempt to release the private state associated with a page
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return true (may release page) or false (may not)
+ */
static int nfs_release_page(struct page *page, gfp_t gfp)
{
/* If PagePrivate() is set, then the page is not freeable */
return 0;
}

+/*
+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
static int nfs_launder_page(struct page *page)
{
return nfs_wb_page(page->mapping->host, page);
@@ -389,6 +410,11 @@ const struct address_space_operations nfs_file_aops = {
.launder_page = nfs_launder_page,
};

+/*
+ * Notification that a PTE pointing to an NFS...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be
separated from the task_struct.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
---

arch/ia64/kernel/perfmon.c | 4 ++--
arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
drivers/isdn/capi/capifs.c | 4 ++--
drivers/usb/core/inode.c | 4 ++--
fs/9p/fid.c | 2 +-
fs/9p/vfs_inode.c | 4 ++--
fs/9p/vfs_super.c | 4 ++--
fs/affs/inode.c | 4 ++--
fs/anon_inodes.c | 4 ++--
fs/attr.c | 4 ++--
fs/bfs/dir.c | 4 ++--
fs/cifs/cifsproto.h | 2 +-
fs/cifs/dir.c | 12 ++++++------
fs/cifs/inode.c | 8 ++++----
fs/cifs/misc.c | 4 ++--
fs/coda/cache.c | 6 +++---
fs/coda/upcall.c | 4 ++--
fs/devpts/inode.c | 4 ++--
fs/dquot.c | 2 +-
fs/exec.c | 4 ++--
fs/ext2/balloc.c | 2 +-
fs/ext2/ialloc.c | 4 ++--
fs/ext2/ioctl.c | 2 +-
fs/ext3/balloc.c | 2 +-
fs/ext3/ialloc.c | 4 ++--
fs/ext4/balloc.c | 2 +-
fs/ext4/ialloc.c | 4 ++--
fs/fuse/dev.c | 4 ++--
fs/gfs2/inode.c | 10 +++++-----
fs/hfs/inode.c | 4 ++--
fs/hfsplus/inode.c | 4 ++--
fs/hpfs/namei.c | 24 ++++++++++++---------...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.

CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin. The source for the daemon can be downloaded from:

http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
and whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.

============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

- dnotify.

- extended attributes (xattrs).

- openat() and friends.

- bmap() support on files in the filesystem (FIBMAP ioctl).

- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.

=========...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Make the keyring quotas controllable through /proc/sys files:

(*) /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytes

Maximum number of keys that root may have and the maximum total number of
bytes of data that root may have stored in those keys.

(*) /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytes

Maximum number of keys that each non-root user may have and the maximum
total number of bytes of data that each of those users may have stored in
their keys.

Also increase the quotas as a number of people have been complaining that it's
not big enough. I'm not sure that it's big enough now either, but on the
other hand, it can now be set in /etc/sysctl.conf.

Signed-off-by: David Howells <dhowells@redhat.com>
---

Documentation/keys.txt | 24 +++++++++++++++++++++-
include/linux/key.h | 5 +++++
kernel/sysctl.c | 9 ++++++++
security/keys/Makefile | 1 +
security/keys/internal.h | 14 +++++++++----
security/keys/key.c | 23 +++++++++++++++++----
security/keys/keyctl.c | 12 ++++++++---
security/keys/proc.c | 9 ++++++--
security/keys/sysctl.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 131 insertions(+), 16 deletions(-)
create mode 100644 security/keys/sysctl.c

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index b82d38d..38a90d9 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -170,7 +170,8 @@ The key service provides a number of features besides keys:
amount of description and payload space that can be consumed.

The user can view information on this and other statistics through procfs
- files.
+ files. The root user may also alter the quota limits through sysctl files
+ (see the section "New procfs files").

Process-specific and thread-specific keyrings are not counted towards a
user's quota.
@@ -329,6 +330,27 @@ about the status of the ...

To: David Howells <dhowells@...>
Cc: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>, <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, April 1, 2008 - 11:29 am

Hello David,

you're our hero! ;-)

We just hit this wall while migrating from RHEl 3 to RHEL 5 with some of
our webservers.

[root@lvr11 ~]# cat /proc/key-users
0: 99 98/98 96/100 1681/10000
32: 2 2/2 2/100 56/10000
38: 2 2/2 2/100 56/10000
43: 2 2/2 2/100 56/10000
51: 2 2/2 2/100 56/10000
68: 2 2/2 2/100 56/10000
81: 2 2/2 2/100 56/10000
99: 2 2/2 2/100 56/10000
348: 2 2/2 2/100 58/10000
42216: 2 2/2 2/100 62/10000
55188: 3 3/3 3/100 72/10000
56537: 2 2/2 2/100 62/10000
63743: 2 2/2 2/100 62/10000
68054: 2 2/2 2/100 62/10000

....

We're using OpenAFS on our systems and most of our webpages are stored
in AFS. We have a lot of small projects for which a separate server
would be a waste of 'metal'. Even in a virtual environment. So we're
hosting a lot of apache instances on a single machine. Beause suexec
doesn't work in an AFS environment, each instance is started by root
with its own IP (to be able to talk HTTPS) and in a PAG with a separate
token for a service user (to isolate the projects). Although each apache
switches over to the service user, the initial tokens are acquired by root.

On RHEL 3 with the old 2.4 kernel this was never a problem. But now...

Btw.: We have some machines with about hundred (!) different projects
which need tokens.

Best regards,

Berthold Cogel

--
Dr. Berthold Cogel University of Cologne
E-Mail: cogel@uni-koeln.de ZAIK-US (RRZK)
Tel.: +49(0)221/470-7873 Robert-Koch-Str. 10
FAX: +49(0)221/478-85845 D-50931 Cologne - Germany
--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Add read context retention so that FS-Cache can call back into NFS when a read
operation on the cache fails EIO rather than reading data. This permits NFS to
then fetch the data from the server instead using the appropriate security
context.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache-index.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 1522496..54b8d9e 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -287,6 +287,30 @@ static void nfs_fscache_inode_now_uncached(void *cookie_netfs_data)
}

/*
+ * Get an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ * context.
+ * - The read context is passed back to NFS in the event that a data read on the
+ * cache fails with EIO - in which case the server must be contacted to
+ * retrieve the data, which requires the read context for security.
+ */
+static void nfs_fh_get_context(void *cookie_netfs_data, void *context)
+{
+ get_nfs_open_context(context);
+}
+
+/*
+ * Release an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ * context.
+ */
+static void nfs_fh_put_context(void *cookie_netfs_data, void *context)
+{
+ if (context)
+ put_nfs_open_context(context);
+}
+
+/*
* Define the inode object for FS-Cache. This is used to describe an inode
* object to fscache_acquire_cookie(). It is keyed by the NFS file handle for
* an inode.
@@ -303,4 +327,6 @@ const struct fscache_cookie_def nfs_fscache_inode_object_def = {
.get_aux = nfs_fscache_inode_get_aux,
.check_aux = nfs_fscache_inode_check_aux,
.now_uncached = nfs_fscache_inode_now_uncached,
+ .get_context = nfs_fh_get_context,
+ .put_context = nfs_fh_put_context,
};

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add a MAINTAINERS record for AFS

Signed-off-by: David Howells <dhowells@redhat.com>
---

MAINTAINERS | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index f1ed75c..38659ac 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -163,6 +163,12 @@ M: A2232@gmx.net
L: linux-m68k@lists.linux-m68k.org
S: Maintained

+AFS FILESYSTEM & AF_RXRPC SOCKET DOMAIN
+P: David Howells
+M: dhowells@redhat.com
+L: linux-afs@lists.infradead.org
+S: Supported
+
AIO
P: Benjamin LaHaise
M: bcrl@kvack.org

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

From: Sven Schnelle <svens@stackframe.org>

kafs doesn't check if the cell already exists - so if you do an
echo "add newcell.org 1.2.3.4" >/proc/fs/afs/cells it will try to
create this cell again. kobject will also complain about a double
registration. To prevent such problems, return -EEXIST in that case.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/afs/cell.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 970d38f..788865d 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -127,14 +127,20 @@ struct afs_cell *afs_cell_create(const char *name, char *vllist)

_enter("%s,%s", name, vllist);

+ down_write(&afs_cells_sem);
+ read_lock(&afs_cells_lock);
+ list_for_each_entry(cell, &afs_cells, link) {
+ if (strcasecmp(cell->name, name) == 0)
+ goto duplicate_name;
+ }
+ read_unlock(&afs_cells_lock);
+
cell = afs_cell_alloc(name, vllist);
if (IS_ERR(cell)) {
_leave(" = %ld", PTR_ERR(cell));
return cell;
}

- down_write(&afs_cells_sem);
-
/* add a proc directory for this cell */
ret = afs_proc_cell_setup(cell);
if (ret < 0)
@@ -167,6 +173,11 @@ error:
kfree(cell);
_leave(" = %d", ret);
return ERR_PTR(ret);
+
+duplicate_name:
+ read_unlock(&afs_cells_lock);
+ up_write(&afs_cells_sem);
+ return ERR_PTR(-EEXIST);
}

/*

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Remove the temporarily embedded task security record from task_struct. Instead
it is made to dangle from the task_struct::sec and task_struct::act_as pointers
with references counted for each.

do_coredump() is made to create a copy of the security record, modify it and
then use that to override the main one for a task. sys_faccessat() is made to
do the same.

The process and session keyrings are moved from signal_struct into a new
thread_group_security struct. This is then refcounted, with pointers coming
from the task_security struct instead of from signal_struct.

The keyring functions then take pointers to task_security structs rather than
task_structs for their security contexts. This is so that request_key() can
proceed asynchronously without having to worry about the initiator task's
act_as pointer changing.

The LSM hooks for dealing with task security are modified to deal with the task
security struct directly rather than going via the task_struct as appopriate.

This permits the subjective security context of a task to be overridden by
changing its act_as pointer without altering its objective security pointer,
and thus not breaking signalling, ptrace, etc. whilst the override is in force.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/exec.c | 15 +-
fs/open.c | 37 ++---
include/linux/init_task.h | 18 --
include/linux/key-ui.h | 10 -
include/linux/key.h | 29 +---
include/linux/sched.h | 40 ++++-
include/linux/security.h | 43 ++++-
kernel/Makefile | 2
kernel/cred.c | 140 +++++++++++++++++
kernel/exit.c | 1
kernel/fork.c | 40 +----
kernel/kmod.c | 10 -
kernel/sys.c | 16 +-
net/rxrpc/ar-key.c | 4
security/dummy.c | 14 +-
security/keys/internal....

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Recruit a couple of page flags to aid in cache management. The following extra
flags are defined:

(1) PG_fscache (PG_private_2)

The marked page is backed by a local cache and is pinning resources in the
cache driver.

(2) PG_fscache_write (PG_owner_priv_2)

The marked page is being written to the local cache. The page may not be
modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/splice.c | 2 +-
include/linux/page-flags.h | 39 +++++++++++++++++++++++++++++++++++++--
include/linux/pagemap.h | 11 +++++++++++
mm/filemap.c | 18 ++++++++++++++++++
mm/migrate.c | 2 +-
mm/page_alloc.c | 3 +++
mm/readahead.c | 9 +++++----
mm/swap.c | 4 ++--
mm/swap_state.c | 4 ++--
mm/truncate.c | 10 +++++-----
mm/vmscan.c | 2 +-
11 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..40fdc28 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
*/
wait_on_page_writeback(page);

- if (PagePrivate(page))
+ if (page_has_private(page))
try_to_release_page(page, GFP_KERNEL);

/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5b30f1..3c16772 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,32 @@
#define PG_active 6
#define PG_slab 7 /* slab debug (Suparna wants this) */

-#define PG_owner_priv_1 8 /* Owner use. If pagecache, fs may use*/
+#define PG_owner_priv_1...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Don't generate the per-UID user and user session keyrings unless they're
explicitly accessed. This solves a problem during a login process whereby
set*uid() is called before the SELinux PAM module, resulting in the per-UID
keyrings having the wrong security labels.

This also cures the problem of multiple per-UID keyrings sometimes appearing
due to PAM modules (including pam_keyinit) setuiding and causing user_structs
to come into and go out of existence whilst the session keyring pins the user
keyring. This is achieved by first searching for extant per-UID keyrings before
inventing new ones.

The serial bound argument is also dropped from find_keyring_by_name() as it's
not currently made use of (setting it to 0 disables the feature).

Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/key.h | 8 --
kernel/user.c | 15 +---
security/keys/internal.h | 4 -
security/keys/key.c | 45 -------------
security/keys/keyring.c | 19 ++----
security/keys/process_keys.c | 142 +++++++++++++++++++++++++-----------------
security/selinux/hooks.c | 8 --
7 files changed, 96 insertions(+), 145 deletions(-)

diff --git a/include/linux/key.h b/include/linux/key.h
index 8b0bd33..2effd03 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -268,9 +268,6 @@ extern struct key *key_lookup(key_serial_t id);
/*
* the userspace interface
*/
-extern struct key root_user_keyring, root_session_keyring;
-extern int alloc_uid_keyring(struct user_struct *user,
- struct task_struct *ctx);
extern void switch_uid_keyring(struct user_struct *new_user);
extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
extern int copy_thread_group_keys(struct task_struct *tsk);
@@ -299,7 +296,6 @@ extern void key_init(void);
#define make_key_ref(k, p) ({ NULL; })
#define key_ref_to_ptr(k) ({ NULL; })
#define is_key_possessed(k) 0
-#define alloc_uid_keyring(u,c) 0
#define swi...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/super.c | 1 +
security/security.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 09008db..8030909 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -267,6 +267,7 @@ int fsync_super(struct super_block *sb)
__fsync_super(sb);
return sync_blockdev(sb->s_bdev);
}
+EXPORT_SYMBOL_GPL(fsync_super);

/**
* generic_shutdown_super - common helper for ->kill_sb()
diff --git a/security/security.c b/security/security.c
index 87cd150..deb2e65 100644
--- a/security/security.c
+++ b/security/security.c
@@ -358,6 +358,7 @@ int security_inode_create(struct inode *dir, struct dentry *dentry, int mode)
return 0;
return security_ops->inode_create(dir, dentry, mode);
}
+EXPORT_SYMBOL_GPL(security_inode_create);

int security_inode_link(struct dentry *old_dentry, struct inode *dir,
struct dentry *new_dentry)
@@ -388,6 +389,7 @@ int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
return 0;
return security_ops->inode_mkdir(dir, dentry, mode);
}
+EXPORT_SYMBOL_GPL(security_inode_mkdir);

int security_inode_rmdir(struct inode *dir, struct dentry *dentry)
{

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

(*) security_kernel_act_as() which allows modification of the security datum
with which a task acts on other objects (most notably files).

(*) security_create_files_as() which allows modification of the security
datum that is used to initialise the security data on a file that a task
creates.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [Smack changes]
Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/capability.h | 12 ++---
include/linux/cred.h | 23 +++++++++
include/linux/security.h | 43 ++++++++++++++++-
kernel/cred.c | 112 ++++++++++++++++++++++++++++++++++++++++++++
security/dummy.c | 17 ++++++-
security/security.c | 15 ++++++
security/selinux/hooks.c | 51 ++++++++++++++++++++
security/smack/smack_lsm.c | 43 +++++++++++++++++
8 files changed, 303 insertions(+), 13 deletions(-)
create mode 100644 include/linux/cred.h

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 7d50ff6..424de01 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -364,12 +364,12 @@ typedef struct kernel_cap_struct {
# error Fix up hand-coded capability macro initializers
#else /* HAND-CODED capability initializers */

-# define CAP_EMPTY_SET {{ 0, 0 }}
-# define CAP_FULL_SET {{ ~0, ~0 }}
-# define CAP_INIT_EFF_SET {{ ~CAP_TO_MASK(CAP_SETPCAP), ~0 }}
-# define CA...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:32 am

Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/pagemap.h | 5 +++++
mm/filemap.c | 18 ++++++++++++++++++
2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c8bd762..76b5307 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -242,6 +242,11 @@ static inline void wait_on_page_owner_priv_2(struct page *page)
extern void end_page_owner_priv_2(struct page *page);

/*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
*
* This assumes that two userspace pages are always sufficient. That's
diff --git a/mm/filemap.c b/mm/filemap.c
index bd3ab83..5cba32b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -547,6 +547,24 @@ void wait_on_page_bit(struct page *page, int bit_nr)
EXPORT_SYMBOL(wait_on_page_bit);

/**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+ wait_queue_head_t *q = page_waitqueue(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __add_wait_queue(q, waiter);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
* unlock_page - unlock a locked page
* @page: the page
*

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Store pages from an NFS inode into the cache data storage object associated
with that inode.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/fscache.c | 28 ++++++++++++++++++++++++++++
fs/nfs/fscache.h | 16 ++++++++++++++++
fs/nfs/read.c | 5 +++++
3 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index d147bd0..5ca900a 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -503,3 +503,31 @@ int __nfs_readpages_from_fscache(struct nfs_open_context *ctx,

return ret;
}
+
+/*
+ * Store a newly fetched page in fscache
+ * - PG_fscache must be set on the page
+ */
+void __nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ int ret;
+
+ dfprintk(FSCACHE,
+ "NFS: readpage_to_fscache(fsc:%p/p:%p(i:%lx f:%lx)/%d)\n",
+ NFS_I(inode)->fscache, page, page->index, page->flags, sync);
+
+ ret = fscache_write_page(NFS_I(inode)->fscache, page, GFP_KERNEL);
+ dfprintk(FSCACHE,
+ "NFS: readpage_to_fscache: p:%p(i:%lu f:%lx) ret %d\n",
+ page, page->index, page->flags, ret);
+
+ if (ret != 0) {
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ nfs_add_fscache_stats(inode,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_FAIL, 1);
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
+ } else {
+ nfs_add_fscache_stats(inode,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_OK, 1);
+ }
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 52f0ccf..2f71d7e 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -92,6 +92,7 @@ extern int __nfs_readpage_from_fscache(struct nfs_open_context *,
extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
struct inode *, struct address_space *,
struct list_head *, unsigned *);
+extern void __nfs_readpage_to_fscache(struct inode *, struct page *, int);

/*
* release the caching state associated with a page if undergoing complete page
@@ -131,6 +132,19 @...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:

+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+

The patch also documents the netfs interface and the cache backend
interface provided by the facility.

There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:

(1) Most filesystems don't do hole reportage. Holes in files are treated as
blocks of zeros and can't be distinguished otherwise, making it difficult
to distinguish blocks that have been read from the network and cached from
those that haven't.

(2) The backing inode must be fully populated before being exposed to
userspace through the main inode because the VM/VFS goes directly to the
backing inode and does not interrogate the front inode on VM ops.

Therefore:

(a) The backing inode must fit entirely within the cache.

(b) All backed files currently open must fit entirely within the cache at
the same time.

(c) A working set of files in total larger than the cache may not be
cached.

(d) A file may not grow larger than the...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Make NFSD work with detached security, using the patches that excise the
security information from task_struct to struct task_security as a base.

Each time NFSD wants a new security descriptor (to do NFS4 recovery or just to
do NFS operations), a task_security record is derived from NFSD's *objective*
security, modified and then applied as the *subjective* security. This means
(a) the changes are not visible to anyone looking at NFSD through /proc, (b)
there is no leakage between two consecutive ops with different security
configurations.

Consideration should probably be given to caching the task_security record on
the basis that there'll probably be several ops that will want to use any
particular security configuration.

Furthermore, nfs4recover.c perhaps ought to set an appropriate LSM context on
the record pointed to by rec_security so that the disk is accessed
appropriately (see set_security_override[_from_ctx]()).

NOTE! This patch must be rolled in to one of the earlier security patches to
make it compile fully.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfsd/auth.c | 37 +++++++++++++++++++---------
fs/nfsd/nfs4recover.c | 64 +++++++++++++++++++++++++++++++------------------
2 files changed, 65 insertions(+), 36 deletions(-)

diff --git a/fs/nfsd/auth.c b/fs/nfsd/auth.c
index 5586157..ebdc562 100644
--- a/fs/nfsd/auth.c
+++ b/fs/nfsd/auth.c
@@ -6,6 +6,7 @@

#include <linux/types.h>
#include <linux/sched.h>
+#include <linux/cred.h>
#include <linux/sunrpc/svc.h>
#include <linux/sunrpc/svcauth.h>
#include <linux/nfsd/nfsd.h>
@@ -26,12 +27,17 @@ int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export *exp)

int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export *exp)
{
- struct task_security *act_as = current->act_as;
+ struct task_security *sec, *old;
struct svc_cred cred = rqstp->rq_cred;
int i;
int flags = nfsexp_flags(rqstp, exp);
int ret;

+ /* de...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

From: Prarit Bhargava <prarit@redhat.com>

This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches. This patch is not yet upstream, but is required
for cachefile on ia64. It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

arch/ia64/kernel/ia64_ksyms.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index 8e7193d..3e544f4 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -46,6 +46,7 @@ EXPORT_SYMBOL(__do_clear_user);
EXPORT_SYMBOL(__strlen_user);
EXPORT_SYMBOL(__strncpy_from_user);
EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);

/* from arch/ia64/lib */
extern void __divsi3(void);

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Check the starting keyring as part of the search to (a) see if that is what
we're searching for, and (b) to check it is still valid for searching.

The scenario: User in process A does things that cause things to be
created in its process session keyring. The user then does an su to
another user and starts a new process, B. The two processes now
share the same process session keyring.

Process B does an NFS access which results in an upcall to gssd.
When gssd attempts to instantiate the context key (to be linked
into the process session keyring), it is denied access even though it
has an authorization key.

The order of calls is:

keyctl_instantiate_key()
lookup_user_key() (the default: case)
search_process_keyrings(current)
search_process_keyrings(rka->context) (recursive call)
keyring_search_aux()

keyring_search_aux() verifies the keys and keyrings underneath the
top-level keyring it is given, but that top-level keyring is neither
fully validated nor checked to see if it is the thing being searched for.

This patch changes keyring_search_aux() to:
1) do more validation on the top keyring it is given and
2) check whether that top-level keyring is the thing being searched for

Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
---

security/keys/keyring.c | 35 +++++++++++++++++++++++++++++++----
1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 88292e3..76b89b2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,

struct keyring_list *keylist;
struct timespec now;
- unsigned long possessed;
+ unsigned long possessed, kflags;
struct key *keyring, *key;
key_ref_t key_ref;
long err;
@@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
now = current_kernel_time();
...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Make key_serial() an inline function rather than a macro if CONFIG_KEYS=y.
This prevents double evaluation of the key pointer and also provides better
type checking.

Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/key.h | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/key.h b/include/linux/key.h
index ad02d9c..c45c962 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -264,7 +264,10 @@ extern int keyring_add_key(struct key *keyring,

extern struct key *key_lookup(key_serial_t id);

-#define key_serial(key) ((key) ? (key)->serial : 0)
+static inline key_serial_t key_serial(struct key *key)
+{
+ return key ? key->serial : 0;
+}

#ifdef CONFIG_SYSCTL
extern ctl_table key_sysctls[];

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Add a 'kernel_service' object class to SELinux and give this object class two
access vectors: 'use_as_override' and 'create_files_as'.

The first vector is used to grant a process the right to nominate an alternate
process security ID for the kernel to use as an override for the SELinux
subjective security when accessing stuff on behalf of another process.

For example, CacheFiles when accessing the cache on behalf on a process
accessing an NFS file needs to use a subjective security ID appropriate to the
cache rather then the one the calling process is using. The cachefilesd
daemon will nominate the security ID to be used.

The second vector is used to grant a process the right to nominate a file
creation label for a kernel service to use.

Signed-off-by: David Howells <dhowells@redhat.com>
---

security/selinux/include/av_perm_to_string.h | 2 ++
security/selinux/include/av_permissions.h | 2 ++
security/selinux/include/class_to_string.h | 1 +
security/selinux/include/flask.h | 1 +
4 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/security/selinux/include/av_perm_to_string.h b/security/selinux/include/av_perm_to_string.h
index d569669..fd6bef7 100644
--- a/security/selinux/include/av_perm_to_string.h
+++ b/security/selinux/include/av_perm_to_string.h
@@ -171,3 +171,5 @@
S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect")
S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero")
S_(SECCLASS_PEER, PEER__RECV, "recv")
+ S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__USE_AS_OVERRIDE, "use_as_override")
+ S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__CREATE_FILES_AS, "create_files_as")
diff --git a/security/selinux/include/av_permissions.h b/security/selinux/include/av_permissions.h
index 75b4131..02ddf8d 100644
--- a/security/selinux/include/av_permissions.h
+++ b/security/selinux/include/av_permissions.h
@@ -836,3 +836,5 @@
#define DCCP_SOCKET__NAME_CONNECT 0x00800000UL
#defi...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:33 am

Add some new NFS I/O counters for FS-Cache doing things for NFS. A new line is
emitted into /proc/pid/mountstats if caching is enabled that looks like:

fsc: <rok> <rfl> <wok> <wfl> <unc>

Where <rok> is the number of pages read successfully from the cache, <rfl> is
the number of failed page reads against the cache, <wok> is the number of
successful page writes to the cache, <wfl> is the number of failed page writes
to the cache, and <unc> is the number of NFS pages that have been disconnected
from the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/nfs/iostat.h | 30 ++++++++++++++++++++++++++++++
fs/nfs/super.c | 11 +++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
index 6350ecb..20dbc4f 100644
--- a/fs/nfs/iostat.h
+++ b/fs/nfs/iostat.h
@@ -107,6 +107,18 @@ enum nfs_stat_eventcounters {
__NFSIOS_COUNTSMAX,
};

+/*
+ * NFS local caching servicing counters
+ */
+enum nfs_stat_fscachecounters {
+ NFSIOS_FSCACHE_PAGES_READ_OK,
+ NFSIOS_FSCACHE_PAGES_READ_FAIL,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_OK,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_FAIL,
+ NFSIOS_FSCACHE_PAGES_UNCACHED,
+ __NFSIOS_FSCACHEMAX,
+};
+
#ifdef __KERNEL__

#include <linux/percpu.h>
@@ -114,6 +126,9 @@ enum nfs_stat_eventcounters {

struct nfs_iostats {
unsigned long long bytes[__NFSIOS_BYTESMAX];
+#ifdef CONFIG_NFS_FSCACHE
+ unsigned long long fscache[__NFSIOS_FSCACHEMAX];
+#endif
unsigned long events[__NFSIOS_COUNTSMAX];
} ____cacheline_aligned;

@@ -149,6 +164,21 @@ static inline void nfs_add_stats(struct inode *inode, enum nfs_stat_bytecounters
nfs_add_server_stats(NFS_SERVER(inode), stat, addend);
}

+#ifdef CONFIG_NFS_FSCACHE
+static inline void nfs_add_fscache_stats(struct inode *inode,
+ enum nfs_stat_fscachecounters stat,
+ unsigned long addend)
+{
+ struct nfs_iostats *iostats;
+ int cpu;
+
+ cpu = ge...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly. This has two consequences:

(*) Consistency. Without this patch sometimes one is used and sometimes the
other is.

(*) A NULL file pointer can be passed. This feature is then made use of by
the generic hook in the next patch, which is used by CacheFiles to write
pages to a file without setting up a file struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

fs/ext3/inode.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb95670..c976123 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1215,7 +1215,7 @@ static int ext3_generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

@@ -1240,7 +1240,7 @@ static int ext3_ordered_write_end(struct file *file,
struct page *page, void *fsdata)
{
handle_t *handle = ext3_journal_current_handle();
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;
unsigned from, to;
int ret = 0, ret2;

@@ -1281,7 +1281,7 @@ static int ext3_writeback_write_end(struct file *file,
struct page *page, void *fsdata)
{
handle_t *handle = ext3_journal_current_handle();
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;
int ret = 0, ret2;
loff_t new_i_size;

--

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells <dhowells@redhat.com>
---

include/linux/pagemap.h | 7 +++++--
include/linux/wait.h | 1 +
kernel/wait.c | 18 ++++++++++++++++++
mm/filemap.c | 2 +-
4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c5df3ae..ad9484f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -225,8 +225,11 @@ static inline void wait_on_page_writeback(struct page *page)

extern void end_page_writeback(struct page *page);

-/*
- * Wait for a PG_owner_priv_2 to become clear
+/**
+ * wait_on_page_owner_priv_2 - Wait for PG_owner_priv_2 to become clear
+ * @page: The page to monitor
+ *
+ * Wait for a PG_owner_priv_2 to become clear on the specified page.
*/
static inline void wait_on_page_owner_priv_2(struct page *page)
{
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0081147..a6a6607 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,7 @@ static inline int waitqueue_active(wait_queue_head_t *q)
#define is_sync_wait(wait) (!(wait) || ((wait)->private))

extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
+extern void add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait);
extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait);
extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);

diff --git a/kernel/wait.c b/kernel/wait.c
index c275c56..191df0d 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
}
EXPORT_SYMBOL(add_wait_queue);

+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a wa...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the honours.

Signed-off-by: David Howells <dhowells@redhat.com>
---

mm/readahead.c | 39 +++++++++++++++++++++++++++++++++++++--
1 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8762e89..d6b14c1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);

#define list_to_page(head) (list_entry((head)->prev, struct page, lru))

+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ * such as the NFS fs marking pages that are cached locally on disk, thus we
+ * need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+ struct page *page)
+{
+ if (PagePrivate(page)) {
+ if (TestSetPageLocked(page))
+ BUG();
+ page->mapping = mapping;
+ do_invalidatepage(page, 0);
+ page->mapping = NULL;
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+ struct list_head *pages)
+{
+ struct page *victim;
+
+ while (!list_empty(pages)) {
+ victim = list_to_page(pages);
+ list_del(&victim->lru);
+ read_cache_pages_invalidate_page(mapping, victim);
+ }
+}
+
/**
* read_cache_pages - populate an address space with some pages & start reads against them
* @mapping: the address_space
@@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:31 am

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Alpha needs further alteration as it refers to UID & GID in entry.S via asm
offsets.

Sparc needs further alteration as it refers to UID & GID in sclow.S via asm
offsets.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org> [SELinux stuff mostly]
---

arch/parisc/kernel/signal.c | 2
arch/powerpc/mm/fault.c | 2
arch/s390/hypfs/inode.c | 4 -
arch/s390/kernel/compat_linux.c | 28 ++--
arch/sparc64/kernel/sys_sparc32.c | 28 ++--
drivers/block/loop.c | 5 -
drivers/char/drm/drm_fops.c | 2
drivers/char/tty_audit.c | 5 -
drivers/connector/cn_proc.c | 8 +
drivers/media/video/cpia.c | 2
drivers/net/tun.c | 4 -
drivers/net/wan/sbni.c | 8 +
drivers/usb/core/devio.c | 8 +
fs/affs/super.c | 4 -
fs/autofs/inode.c | 4 -
fs/autofs4/inode.c | 4 -
fs/autofs4/waitq.c | 4 -
fs/binfmt_elf.c | 12 +-
fs/binfmt_elf_fdpic.c | 12 +-
fs/cifs/connect.c | 5 -
fs/cifs/ioctl.c | 2
fs/dquot.c | 3
fs/ecryptfs/messaging.c | 15 +-
fs/exec.c | 20 +--
fs/fat/inode.c | 4 -
fs/fcntl.c | 7 +
fs/file_table.c | 4 -
fs/fuse/dir.c | 12 +-
fs/hfs/super.c | 4 -
fs/hfsplus/options.c | 4 -
fs/hpfs/super.c | 4 -
fs/hugetlbfs/inode.c | 4 -
fs/inotify_user.c | 2
fs/ioprio.c | 12 +-
fs/namei.c ...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Add a keyctl() function to get the security label of a key.

The following is added to Documentation/keys.txt:

(*) Get the LSM security context attached to a key.

long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
size_t buflen)

This function returns a string that represents the LSM security context
attached to a key in the buffer provided.

Unless there's an error, it always returns the amount of data it could
produce, even if that's too big for the buffer, but it won't copy more
than requested to userspace. If the buffer pointer is NULL then no copy
will take place.

A NUL character is included at the end of the string if the buffer is
sufficiently big. This is included in the returned count. If no LSM is
in force then an empty string will be returned.

A process must have view permission on the key for this function to be
successful.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
---

Documentation/keys.txt | 21 +++++++++++++++
include/linux/keyctl.h | 1 +
include/linux/security.h | 20 +++++++++++++-
security/dummy.c | 8 ++++++
security/keys/compat.c | 3 ++
security/keys/internal.h | 2 +
security/keys/keyctl.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
security/security.c | 5 +++
security/selinux/hooks.c | 21 +++++++++++++--
9 files changed, 142 insertions(+), 5 deletions(-)

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 38a90d9..d5c7a57 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -733,6 +733,27 @@ The keyctl syscall functions are:
The assumed authoritative key is inherited across fork and exec.

+ (*) Get the LSM security context attached to a key.
+
+ long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
+ size_t buflen)
+
+ This function returns a string that represents the LS...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

From: Arun Raghavan <arunsr@cse.iitk.ac.in>

The key_create_or_update() function provided by the keyring code has a default
set of permissions that are always applied to the key when created. This might
not be desirable to all clients.

Here's a patch that adds a "perm" parameter to the function to address this,
which can be set to KEY_PERM_UNDEF to revert to the current behaviour.

Signed-off-by: Arun Raghavan <arunsr@cse.iitk.ac.in>
Acked-by: David Howells <dhowells@redhat.com>
---

include/linux/key.h | 3 +++
security/keys/key.c | 18 ++++++++++--------
security/keys/keyctl.c | 3 ++-
3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/key.h b/include/linux/key.h
index 163f864..8b0bd33 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -67,6 +67,8 @@ struct key;
#define KEY_OTH_SETATTR 0x00000020
#define KEY_OTH_ALL 0x0000003f

+#define KEY_PERM_UNDEF 0xffffffff
+
struct seq_file;
struct user_struct;
struct signal_struct;
@@ -232,6 +234,7 @@ extern key_ref_t key_create_or_update(key_ref_t keyring,
const char *description,
const void *payload,
size_t plen,
+ key_perm_t perm,
unsigned long flags);

extern int key_update(key_ref_t key,
diff --git a/security/keys/key.c b/security/keys/key.c
index 654d23b..d98c619 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -757,11 +757,11 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
const char *description,
const void *payload,
size_t plen,
+ key_perm_t perm,
unsigned long flags)
{
struct key_type *ktype;
struct key *keyring, *key = NULL;
- key_perm_t perm;
key_ref_t key_ref;
int ret;

@@ -806,15 +806,17 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
goto found_matching_key;
}

- /* decide on the permissions we want */
- perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Allow the callout data to be passed as a blob rather than a string for internal
kernel services that call any request_key_*() interface other than
request_key(). request_key() itself still takes a NUL-terminated string.

The functions that change are:

request_key_with_auxdata()
request_key_async()
request_key_async_with_auxdata()

Signed-off-by: David Howells <dhowells@redhat.com>
---

Documentation/keys-request-key.txt | 11 +++++---
Documentation/keys.txt | 14 +++++++---
include/linux/key.h | 9 ++++---
security/keys/internal.h | 9 ++++---
security/keys/keyctl.c | 7 ++++-
security/keys/request_key.c | 49 ++++++++++++++++++++++--------------
security/keys/request_key_auth.c | 12 +++++----
7 files changed, 70 insertions(+), 41 deletions(-)

diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt
index 266955d..09b55e4 100644
--- a/Documentation/keys-request-key.txt
+++ b/Documentation/keys-request-key.txt
@@ -11,26 +11,29 @@ request_key*():

struct key *request_key(const struct key_type *type,
const char *description,
- const char *callout_string);
+ const char *callout_info);

or:

struct key *request_key_with_auxdata(const struct key_type *type,
const char *description,
- const char *callout_string,
+ const char *callout_info,
+ size_t callout_len,
void *aux);

or:

struct key *request_key_async(const struct key_type *type,
const char *description,
- const char *callout_string);
+ const char *callout_info,
+ size_t callout_len);

or:

struct key *request_key_async_with_auxdata(const struct key_type *type,
const char *description,
- const char *callout_string,
+ const char *callout_info,
+ size_t callout_len,
void *aux);

Or by userspace invoking the request_key system ...

To: <torvalds@...>, <akpm@...>, <trond.myklebust@...>, <chuck.lever@...>
Cc: <nfsv4@...>, <linux-kernel@...>, <linux-fsdevel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Friday, March 28, 2008 - 10:30 am

Increase the size of a payload that can be used to instantiate a key in
add_key() and keyctl_instantiate_key(). This permits huge CIFS SPNEGO blobs to
be passed around. The limit is raised to 1MB. If kmalloc() can't allocate a
buffer of sufficient size, vmalloc() will be tried instead.

Signed-off-by: David Howells <dhowells@redhat.com>
---

security/keys/keyctl.c | 38 ++++++++++++++++++++++++++++++--------
1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index d9ca15c..8ec8432 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -19,6 +19,7 @@
#include <linux/capability.h>
#include <linux/string.h>
#include <linux/err.h>
+#include <linux/vmalloc.h>
#include <asm/uaccess.h>
#include "internal.h"

@@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type,
char type[32], *description;
void *payload;
long ret;
+ bool vm;

ret = -EINVAL;
- if (plen > 32767)
+ if (plen > 1024 * 1024 - 1)
goto error;

/* draw all the data into kernel space */
@@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type,
/* pull the payload in if one was supplied */
payload = NULL;

+ vm = false;
if (_payload) {
ret = -ENOMEM;
payload = kmalloc(plen, GFP_KERNEL);
- if (!payload)
- goto error2;
+ if (!payload) {
+ if (plen <= PAGE_SIZE)
+ goto error2;
+ vm = true;
+ payload = vmalloc(plen);
+ if (!payload)
+ goto error2;
+ }

ret = -EFAULT;
if (copy_from_user(payload, _payload, plen) != 0)
@@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type,

key_ref_put(keyring_ref);
error3:
- kfree(payload);
+ if (!vm)
+ kfree(payload);
+ else
+ vfree(payload);
error2:
kfree(description);
error:
@@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id,
key_ref_t keyring_ref;
void *payload;
long ret;
+ bool vm = false;
...

Previous thread: none

Next thread: [patch 2/2] x86: cleanup - rename VM_MASK to X86_VM_MASK by gorcunov on Friday, March 28, 2008 - 10:56 am. (1 message)