login
Header Space

 
 

Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

Previous thread: none

Next thread: [bug?] ALSA sound/core/info.c:852: BUG? (root) by Ingo Molnar on Wednesday, December 5, 2007 - 4:00 pm. (2 messages)
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:38 pm

These patches add local caching for network filesystems such as NFS and AFS.

The patches can roughly be broken down into a number of sets:

  (*) 01-keys-inc-payload.diff
  (*) 02-keys-search-keyring.diff
  (*) 03-keys-callout-blob.diff

      Three patches to the keyring code made to help the CIFS people.
      Included because of patches 05-08.

  (*) 04-keys-get-label.diff

      A patch to allow the security label of a key to be retrieved.
      Included because of patches 05-08.

  (*) 05-security-current-fsugid.diff
  (*) 06-security-separate-task-bits.diff
  (*) 07-security-subjective.diff
  (*) 08-security-kernel-service.diff

      Patches to permit the subjective security of a task to be overridden.
      All the security details in task_struct are decanted into a new struct
      that task_struct then has two pointers two: one that defines the
      objective security of that task (how other tasks may affect it) and one
      that defines the subjective security (how it may affect other objects).

      Note that I have dropped the idea of struct cred for the moment.  With
      the amount of stuff that was excluded from it, it wasn't actually any
      use to me.  However, it can be added later.

      Required for cachefiles.

  (*) 09-release-page.diff
  (*) 10-fscache-page-flags.diff
  (*) 11-add_wait_queue_tail.diff
  (*) 12-fscache.diff

      Patches to provide a local caching facility for network filesystems.

  (*) 13-cachefiles-ia64.diff
  (*) 14-cachefiles-ext3-f_mapping.diff
  (*) 15-cachefiles-write.diff
  (*) 16-cachefiles-monitor.diff
  (*) 17-cachefiles-export.diff
  (*) 18-cachefiles.diff

      Patches to provide a local cache in a directory of an already mounted
      filesystem.

  (*) 19-fscache-nfs.diff
  (*) 20-fscache-nfs-mount.diff
  (*) 21-fscache-nfs-display.diff

      Patches to provide NFS with local caching.

  (*) 22-fcrypt-bit-annotate.diff

      A fix for AFS.

  (*) 23-afs-testsetpageerror.diff
  (*) 2...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches.  The kAFS filesystem will use caching
automatically if it's available.

Signed-Off-By: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/Kconfig         |    8 +
 fs/afs/Makefile    |    3 
 fs/afs/cache.c     |  505 ++++++++++++++++++++++++++++++++++------------------
 fs/afs/cache.h     |   15 --
 fs/afs/cell.c      |   16 +-
 fs/afs/file.c      |  212 +++++++++++++---------
 fs/afs/inode.c     |   26 +--
 fs/afs/internal.h  |   53 ++---
 fs/afs/main.c      |   27 +--
 fs/afs/mntpt.c     |    4 
 fs/afs/vlocation.c |   23 +-
 fs/afs/volume.c    |   14 -
 fs/afs/write.c     |    6 -
 13 files changed, 537 insertions(+), 375 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 83d1227..7f3278f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -2120,6 +2120,14 @@ config AFS_DEBUG
 
 	  If unsure, say N.
 
+config AFS_FSCACHE
+	bool "Provide AFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on AFS_FS=m &amp;&amp; FSCACHE || AFS_FS=y &amp;&amp; FSCACHE=y
+	help
+	  Say Y here if you want AFS data to be cached locally on disk through
+	  the generic filesystem cache manager
+
 config 9P_FS
 	tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)"
 	depends on INET &amp;&amp; NET_9P &amp;&amp; EXPERIMENTAL
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
 # Makefile for Red Hat Linux AFS client.
 #
 
+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
 kafs-objs := \
+	$(afs-cache-y) \
 	callback.o \
 	cell.o \
 	cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..8e179a9 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -9,248 +9,399 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/afs/file.c     |   20 +++++++++++++++++++-
 fs/afs/internal.h |    1 +
 fs/afs/write.c    |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 525f7c5..1323df4 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned long offset);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
 const struct file_operations afs_file_operations = {
 	.open		= afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
 	.aio_write	= afs_file_write,
-	.mmap		= generic_file_readonly_mmap,
+	.mmap		= afs_mmap,
 	.splice_read	= generic_file_splice_read,
 	.fsync		= afs_fsync,
 	.lock		= afs_lock,
@@ -56,6 +57,11 @@ const struct address_space_operations afs_fs_aops = {
 	.writepages	= afs_writepages,
 };
 
+static struct vm_operations_struct afs_file_vm_ops = {
+	.fault		= filemap_fault,
+	.page_mkwrite	= afs_page_mkwrite,
+};
+
 /*
  * open an AFS file or directory and attach a key to it
  */
@@ -295,3 +301,15 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
 	_leave(" = 0");
 	return 0;
 }
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Save the operation ID to be used with a call that we're making for display
through /proc/net/rxrpc_calls.  This helps debugging stuck operations as we
then know what they are.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/afs/fsclient.c       |   32 +++++++++++++++++++++++---------
 fs/afs/rxrpc.c          |    1 +
 fs/afs/vlclient.c       |    2 ++
 include/net/af_rxrpc.h  |    1 +
 net/rxrpc/af_rxrpc.c    |    3 +++
 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/ar-proc.c     |    7 ++++---
 7 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 04584c0..a468f2d 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -287,6 +287,7 @@ int afs_fs_fetch_file_status(struct afs_server *server,
 	call-&gt;reply2 = volsync;
 	call-&gt;service_id = FS_SERVICE;
 	call-&gt;port = htons(AFS_FS_PORT);
+	call-&gt;operation_ID = htonl(FSFETCHSTATUS);
 
 	/* marshall the parameters */
 	bp = call-&gt;request;
@@ -316,7 +317,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call,
 	case 0:
 		call-&gt;offset = 0;
 		call-&gt;unmarshall++;
-		if (call-&gt;operation_ID != FSFETCHDATA64) {
+		if (call-&gt;operation_ID != htonl(FSFETCHDATA64)) {
 			call-&gt;unmarshall++;
 			goto no_msw;
 		}
@@ -464,7 +465,7 @@ static int afs_fs_fetch_data64(struct afs_server *server,
 	call-&gt;reply3 = buffer;
 	call-&gt;service_id = FS_SERVICE;
 	call-&gt;port = htons(AFS_FS_PORT);
-	call-&gt;operation_ID = FSFETCHDATA64;
+	call-&gt;operation_ID = htonl(FSFETCHDATA64);
 
 	/* marshall the parameters */
 	bp = call-&gt;request;
@@ -509,7 +510,7 @@ int afs_fs_fetch_data(struct afs_server *server,
 	call-&gt;reply3 = buffer;
 	call-&gt;service_id = FS_SERVICE;
 	call-&gt;port = htons(AFS_FS_PORT);
-	call-&gt;operation_ID = FSFETCHDATA;
+	call-&gt;operation_ID = htonl(FSFETCHDATA);
 
 	/* marshall the parameters */
 	bp = call-&gt;request;
@@ -577,6 +578,7 @@ int afs_fs_give_up_callbacks(struct afs_server *ser...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Improve the handling of the case of a server rejecting an attempt to write back
a cached write.  AFS operates a write-back cache, so the following sequence of
events can theoretically occur:

	CLIENT 1		CLIENT 2
	=======================	=======================
	cat data &gt;/the/file
	 (sits in pagecache)
				fs setacl -dir /the/dir/of/the/file \
					-acl system:administrators rlidka
				 (write permission removed for client 1)
	sync
	 (writeback attempt fails)

The way AFS attempts to handle this is:

 (1) The affected region will be excised and discarded on the basis that it
     can't be written back, yet we don't want it lurking in the page cache
     either.  The contents of the affected region will be reread from the
     server when called for again.

 (2) The EOF size will be set to the current server-based file size - usually
     that which it was before the affected write was made - assuming no
     conflicting write has been appended, and assuming the affected write
     extended the file.


This patch makes the following changes:

 (1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS
     server puts a silly value as the size of the returned data.  This prevents
     excised pages beyond the revised EOF being reinstantiated with a surprise
     PG_error.

 (2) Writebacks can now be put into a 'rejected' state in which all further
     attempts to write them back will result in excision of the affected pages
     instead.

 (3) Preparing a page for overwriting now reads the whole page instead of just
     those parts of it that aren't to be covered by the copy to be made.  This
     handles the possibility that the copy might fail on EFAULT.  Corollary to
     this, PG_update can now be set by afs_prepare_page() on behalf of
     afs_prepare_write() rather than setting it in afs_commit_write().

 (4) In the case of a conflicting write, afs_prepare_write() will attempt to
     flush the write to the server, and will then wait for P...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Add a function - cancel_rejected_write() - to excise a rejected write from the
pagecache.  This function is related to the truncation family of routines.  It
permits the pages modified by a network filesystem client (such as AFS) to be
excised and discarded from the pagecache if the attempt to write them back to
the server fails.

The dirty and writeback states of the afflicted pages are cancelled and the
pages themselves are detached for recycling.  All PTEs referring to those
pages are removed.

Note that the locking is tricky as it's very easy to deadlock against
truncate() and other routines once the pages have been unlocked as part of the
writeback process.  To this end, the PG_error flag is set, then the
PG_writeback flag is cleared, and only *then* can lock_page() be called.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 include/linux/mm.h |    5 ++-
 mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 520238c..438270f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1005,12 +1005,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
-/* filemap.c */
-extern unsigned long page_unuse(struct page *);
+/* truncate.c */
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
 
+/* filemap.c */
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index 5b7d1c5..95fc1a8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -465,3 +465,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
 	return inval...
To: David Howells <dhowells@...>
Cc: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Friday, December 14, 2007 - 12:21 am

[Empty message]
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Add a TestSetPageError() macro to the suite of page flag manipulators.  This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 include/linux/page-flags.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fcc9e23..0350c37 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -130,6 +130,7 @@
 #define PageError(page)		test_bit(PG_error, &amp;(page)-&gt;flags)
 #define SetPageError(page)	set_bit(PG_error, &amp;(page)-&gt;flags)
 #define ClearPageError(page)	clear_bit(PG_error, &amp;(page)-&gt;flags)
+#define TestSetPageError(page)	test_and_set_bit(PG_error, &amp;(page)-&gt;flags)
 
 #define PageReferenced(page)	test_bit(PG_referenced, &amp;(page)-&gt;flags)
 #define SetPageReferenced(page)	set_bit(PG_referenced, &amp;(page)-&gt;flags)

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
---

 crypto/fcrypt.c |   88 ++++++++++++++++++++++++++++---------------------------
 1 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/crypto/fcrypt.c b/crypto/fcrypt.c
index d161949..a32cb68 100644
--- a/crypto/fcrypt.c
+++ b/crypto/fcrypt.c
@@ -51,7 +51,7 @@
 #define ROUNDS 16
 
 struct fcrypt_ctx {
-	u32 sched[ROUNDS];
+	__be32 sched[ROUNDS];
 };
 
 /* Rotate right two 32 bit numbers as a 56 bit number */
@@ -73,8 +73,8 @@ do {								\
  * /afs/transarc.com/public/afsps/afs.rel31b.export-src/rxkad/sboxes.h
  */
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x &lt;&lt; 3)
-static const u32 sbox0[256] = {
+#define Z(x) __constant_cpu_to_be32(x &lt;&lt; 3)
+static const __be32 sbox0[256] = {
 	Z(0xea), Z(0x7f), Z(0xb2), Z(0x64), Z(0x9d), Z(0xb0), Z(0xd9), Z(0x11),
 	Z(0xcd), Z(0x86), Z(0x86), Z(0x91), Z(0x0a), Z(0xb2), Z(0x93), Z(0x06),
 	Z(0x0e), Z(0x06), Z(0xd2), Z(0x65), Z(0x73), Z(0xc5), Z(0x28), Z(0x60),
@@ -110,8 +110,8 @@ static const u32 sbox0[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu((x &lt;&lt; 27) | (x &gt;&gt; 5))
-static const u32 sbox1[256] = {
+#define Z(x) __constant_cpu_to_be32((x &lt;&lt; 27) | (x &gt;&gt; 5))
+static const __be32 sbox1[256] = {
 	Z(0x77), Z(0x14), Z(0xa6), Z(0xfe), Z(0xb2), Z(0x5e), Z(0x8c), Z(0x3e),
 	Z(0x67), Z(0x6c), Z(0xa1), Z(0x0d), Z(0xc2), Z(0xa2), Z(0xc1), Z(0x85),
 	Z(0x6c), Z(0x7b), Z(0x67), Z(0xc6), Z(0x23), Z(0xe3), Z(0xf2), Z(0x89),
@@ -147,8 +147,8 @@ static const u32 sbox1[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x &lt;&lt; 11)
-static const u32 sbox2[256] = {
+#define Z(x) __constant_cpu_to_be32(x &lt;&lt; 11)
+static const __be32 sbox2[256] = {
 	Z(0xf0), Z(0x37), Z(0x24), Z(0x53), Z(0x2a), Z(0x03), Z(0x83), Z(0x86),
 	Z(0xd1), Z(0xec), Z(0x50), Z(0xf0), Z(0x42), Z(0x78), Z(0x2f), Z(0x6d),
 	Z(0xbf), Z(0x80), Z(0x87), Z(0x27), Z(0x95), Z(0xe2), Z(0xc5), Z(0x5d),
@@ -184,8 +184,8 @@ static const u32 ...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/nfs/client.c  |    7 ++++---
 fs/nfs/fscache.h |   15 +++++++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index be38c3c..91ecea3 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1335,7 +1335,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 
 	/* display header on line 1 */
 	if (v == &amp;nfs_volume_list) {
-		seq_puts(m, "NV SERVER   PORT DEV     FSID\n");
+		seq_puts(m, "NV SERVER   PORT DEV     FSID              FSC\n");
 		return 0;
 	}
 	/* display one transport per line on subsequent lines */
@@ -1349,12 +1349,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 		 (unsigned long long) server-&gt;fsid.major,
 		 (unsigned long long) server-&gt;fsid.minor);
 
-	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n",
+	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n",
 		   clp-&gt;cl_nfsversion,
 		   NIPQUAD(clp-&gt;cl_addr.sin_addr),
 		   ntohs(clp-&gt;cl_addr.sin_port),
 		   dev,
-		   fsid);
+		   fsid,
+		   nfs_server_fscache_state(server));
 
 	return 0;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 144fb58..9a735fc 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -53,6 +53,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
 extern int nfs_fscache_release_page(struct page *, gfp_t);
 
 /*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+	if (server-&gt;nfs_client-&gt;fscache &amp;&amp;
+	    (server-&gt;options &amp; NFS_OPTION_FSCACHE))
+		return "yes";
+	return "no ";
+}
+
+/*
  * release the caching state associated with a page if undergoing complete page
  * invalidation
  */
@@ -109,6 +120,10 @@ static inline void nfs4_fscache_get_client_cookie(struct nfs_client *...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:40 pm

Changes to the kernel configuration defintions and to the NFS mount options to
allow the local caching support added by the previous patch to be enabled.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/Kconfig        |    8 ++++++++
 fs/nfs/client.c   |    2 ++
 fs/nfs/internal.h |    1 +
 fs/nfs/super.c    |   14 ++++++++++++++
 4 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 215b0d6..83d1227 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1650,6 +1650,14 @@ config NFS_V4
 
 	  If unsure, say N.
 
+config NFS_FSCACHE
+	bool "Provide NFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on NFS_FS=m &amp;&amp; FSCACHE || NFS_FS=y &amp;&amp; FSCACHE=y
+	help
+	  Say Y here if you want NFS data to be cached locally on disc through
+	  the general filesystem cache manager
+
 config NFS_DIRECTIO
 	bool "Allow direct I/O on NFS files"
 	depends on NFS_FS
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index acb2179..be38c3c 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -575,6 +575,7 @@ static int nfs_init_server(struct nfs_server *server,
 
 	/* Initialise the client representation from the mount data */
 	server-&gt;flags = data-&gt;flags &amp; NFS_MOUNT_FLAGMASK;
+	server-&gt;options = data-&gt;options;
 
 	if (data-&gt;rsize)
 		server-&gt;rsize = nfs_block_size(data-&gt;rsize, NULL);
@@ -931,6 +932,7 @@ static int nfs4_init_server(struct nfs_server *server,
 	/* Initialise the client representation from the mount data */
 	server-&gt;flags = data-&gt;flags &amp; NFS_MOUNT_FLAGMASK;
 	server-&gt;caps |= NFS_CAP_ATOMIC_OPEN;
+	server-&gt;options = data-&gt;options;
 
 	if (data-&gt;rsize)
 		server-&gt;rsize = nfs_block_size(data-&gt;rsize, NULL);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f3acf48..ef09e00 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -35,6 +35,7 @@ struct nfs_parsed_mount_data {
 	int			acregmin, acregmax,
 				acdirmin, acdi...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

	http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

	mount warthog:/ /a -o fsc

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/nfs/Makefile           |    1 
 fs/nfs/client.c           |    5 +
 fs/nfs/file.c             |   37 ++++
 fs/nfs/fscache-def.c      |  289 +++++++++++++++++++++++++++++++++
 fs/nfs/fscache.c          |  391 +++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/fscache.h          |  148 +++++++++++++++++
 fs/nfs/inode.c            |   47 +++++
 fs/nfs/read.c             |   28 +++
 fs/nfs/super.c            |    3 
 fs/nfs/sysctl.c           |    1 
 include/linux/nfs_fs.h    |    9 +
 include/linux/nfs_fs_sb.h |   18 ++
 12 files changed, 968 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..073d04c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 			   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 70587f3..acb2179 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -43,6 +43,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
@@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname,
 	clp-&gt;cl_state = 1 &lt;&lt; NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+	nfs_fscache_get_client_cookie(clp);
+
 	return clp;
 
 error_3:
@@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
 	nfs4_shutdown_client(clp);
...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.


CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling.  This is called cachefilesd and lives in
/sbin.  The source for the daemon can be downloaded from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services.  Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
communication with the daemon.  Only one thing may have this open at once, and
whilst it is open, a cache is at least partially in existence.  The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section.  This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.


============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

	- dnotify.

	- extended attributes (xattrs).

	- openat() and friends.

	- bmap() support on files in the filesystem (FIBMAP ioctl).

	- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.


=============
C...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index ceaf2e3..cd199ae 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb)
 	__fsync_super(sb);
 	return sync_blockdev(sb-&gt;s_bdev);
 }
+EXPORT_SYMBOL_GPL(fsync_super);
 
 /**
  *	generic_shutdown_super	-	common helper for -&gt;kill_sb()

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 include/linux/pagemap.h |    5 +++++
 mm/filemap.c            |   18 ++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a1b317..21c35e2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -223,6 +223,11 @@ static inline void wait_on_page_fscache_write(struct page *page)
 extern void end_page_fscache_write(struct page *page);
 
 /*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index bea1ba6..6872d1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,6 +521,24 @@ void fastcall wait_on_page_bit(struct page *page, int bit_nr)
 EXPORT_SYMBOL(wait_on_page_bit);
 
 /**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+	wait_queue_head_t *q = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&amp;q-&gt;lock, flags);
+	__add_wait_queue(q, waiter);
+	spin_unlock_irqrestore(&amp;q-&gt;lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
  * unlock_page - unlock a locked page
  * @page: the page
  *

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised).  The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/ext2/inode.c    |    2 ++
 fs/ext3/inode.c    |    3 +++
 include/linux/fs.h |    7 ++++++
 mm/filemap.c       |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b1ab32a..cfa56e6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -796,6 +796,7 @@ const struct address_space_operations ext2_aops = {
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.write_one_page		= generic_file_buffered_write_one_page,
 };
 
 const struct address_space_operations ext2_aops_xip = {
@@ -814,6 +815,7 @@ const struct address_space_operations ext2_nobh_aops = {
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.write_one_page		= generic_file_buffered_write_one_page,
 };
 
 /*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index bc918d3..435c684 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1780,6 +1780,7 @@ static const struct address_space_operations ext3_ordered_aops = {
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migratepage	= buffer_migrate_page,
+	.write_one_page	= generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
@@ -1794,6 +1795,7 @@ static const struct address_space_operations e...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Change all the usages of file-&gt;f_mapping in ext3_*write_end() functions to use
the mapping argument directly.  This has two consequences:

 (*) Consistency.  Without this patch sometimes one is used and sometimes the
     other is.

 (*) A NULL file pointer can be passed.  This feature is then made use of by
     the generic hook in the next patch, which is used by CacheFiles to write
     pages to a file without setting up a file struct.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/ext3/inode.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..bc918d3 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1227,7 +1227,7 @@ static int ext3_generic_write_end(struct file *file,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata)
 {
-	struct inode *inode = file-&gt;f_mapping-&gt;host;
+	struct inode *inode = mapping-&gt;host;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -1252,7 +1252,7 @@ static int ext3_ordered_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file-&gt;f_mapping-&gt;host;
+	struct inode *inode = mapping-&gt;host;
 	unsigned from, to;
 	int ret = 0, ret2;
 
@@ -1293,7 +1293,7 @@ static int ext3_writeback_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file-&gt;f_mapping-&gt;host;
+	struct inode *inode = mapping-&gt;host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches.  This patch is not yet upstream, but is required
for cachefile on ia64.  It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava &lt;prarit@redhat.com&gt;
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 arch/ia64/kernel/ia64_ksyms.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index bd17190..20c3546 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user);
 EXPORT_SYMBOL(__strlen_user);
 EXPORT_SYMBOL(__strncpy_from_user);
 EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);
 
 /* from arch/ia64/lib */
 extern void __divsi3(void);

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:

	+---------+
	|         |                        +--------------+
	|   NFS   |--+                     |              |
	|         |  |                 +--&gt;|   CacheFS    |
	+---------+  |   +----------+  |   |  /dev/hda5   |
	             |   |          |  |   +--------------+
	+---------+  +--&gt;|          |  |
	|         |      |          |--+
	|   AFS   |-----&gt;| FS-Cache |
	|         |      |          |--+
	+---------+  +--&gt;|          |  |
	             |   |          |  |   +--------------+
	+---------+  |   +----------+  |   |              |
	|         |  |                 +--&gt;|  CacheFiles  |
	|  ISOFS  |--+                     |  /var/cache  |
	|         |                        +--------------+
	+---------+

The patch also documents the netfs interface and the cache backend
interface provided by the facility.


There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:

 (1) Most filesystems don't do hole reportage.  Holes in files are treated as
     blocks of zeros and can't be distinguished otherwise, making it difficult
     to distinguish blocks that have been read from the network and cached from
     those that haven't.

 (2) The backing inode must be fully populated before being exposed to
     userspace through the main inode because the VM/VFS goes directly to the
     backing inode and does not interrogate the front inode on VM ops.

     Therefore:

     (a) The backing inode must fit entirely within the cache.

     (b) All backed files currently open must fit entirely within the cache at
     	 the same time.

     (c) A working set of files in total larger than the cache may not be
     	 cached.

     (d) A file may not grow larger than the...
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 include/linux/wait.h |    2 ++
 kernel/wait.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0e68628..f1038d0 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 #define is_sync_wait(wait)	(!(wait) || ((wait)-&gt;private))
 
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q,
+					 wait_queue_t *wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 
diff --git a/kernel/wait.c b/kernel/wait.c
index 444ddbf..7acc9cc 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
 }
 EXPORT_SYMBOL(add_wait_queue);
 
+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	unsigned long flags;
+
+	wait-&gt;flags &amp;= ~WQ_FLAG_EXCLUSIVE;
+	spin_lock_irqsave(&amp;q-&gt;lock, flags);
+	__add_wait_queue_tail(q, wait);
+	spin_unlock_irqrestore(&amp;q-&gt;lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
 void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;

--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_owner_priv_2)

     The marked page is backed by a local cache and is pinning resources in the
     cache driver.

 (2) PG_fscache_write (PG_owner_priv_3)

     The marked page is being written to the local cache.  The page may not be
     modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to detect this.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/splice.c                |    2 +-
 include/linux/page-flags.h |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/pagemap.h    |   11 +++++++++++
 mm/filemap.c               |   16 ++++++++++++++++
 mm/migrate.c               |    2 +-
 mm/page_alloc.c            |    3 +++
 mm/readahead.c             |    9 +++++----
 mm/swap.c                  |    4 ++--
 mm/swap_state.c            |    4 ++--
 mm/truncate.c              |   10 +++++-----
 mm/vmscan.c                |    2 +-
 11 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6bdcb61..61edad7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 		 */
 		wait_on_page_writeback(page);
 
-		if (PagePrivate(page))
+		if (page_has_private(page))
 			try_to_release_page(page, GFP_KERNEL);
 
 		/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..fcc9e23 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,30 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_owner_priv_1		 8	/* Owner use. If pagecache, fs may use*/
+#define PG_owner_priv_1		 8	/* Owner use. fs may use in pagecache */
 #define PG_ar...
To: David Howells <dhowells@...>
Cc: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Friday, December 14, 2007 - 12:08 am

I'd much prefer if you would handle this in the filesystem, and have it
set PG_private whenever fscache needs to receive a callback, and DTRT
depending on whether PG_fscache etc. is set or not.

Also, this wait_on_page_fscache_write / end_page_fscache_write stuff
seems like it would belong in your fscache headers rather than generic
mm code (ditto for your PG_fscache checks in the page allocator -- you
--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:39 pm

[Empty message]
To: David Howells <dhowells@...>
Cc: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Thursday, December 13, 2007 - 11:51 pm

This is pretty nasty. I would suggest either to have the function
return the number of pages that were added to pagecache, or just
--
To: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <sds@...>, <casey@...>
Cc: <linux-kernel@...>, <selinux@...>, <linux-security-module@...>, <dhowells@...>
Date: Wednesday, December 5, 2007 - 3:38 pm

Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

 (*) security_kernel_act_as() which allows modification of the security datum
     with which a task acts on other objects (most notably files).

 (*) security_create_files_as() which allows modification of the security
     datum that is used to initialise the security data on a file that a task
     creates.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 include/linux/cred.h     |   22 ++++++++++++
 include/linux/security.h |   35 +++++++++++++++++++
 kernel/cred.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++
 security/dummy.c         |   15 ++++++++
 security/security.c      |   13 +++++++
 security/selinux/hooks.c |   45 ++++++++++++++++++++++++
 6 files changed, 216 insertions(+), 0 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
new file mode 100644
index 0000000..c9f8906
--- /dev/null
+++ b/include/linux/cred.h
@@ -0,0 +1,22 @@
+/* Credential management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CRED_H
+#define _LINUX_CRED_H
+
+struct task_security;
+struct inode;
+
+extern struct task_security *get_ke...
To: David Howells <dhowells@...>
Cc: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 12:46 pm

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 1:07 pm

Cleared means what?  Setting to 0?  Or is there some other constant I should
use for that?

David
--
To: David Howells <dhowells@...>
Cc: <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 1:23 pm

Yes, setting to 0.

Otherwise, only other issue I have with this interface is it won't
generalize to dealing with nfsd, where we want to set the acting context
to a context we obtain from or determine based upon the client.

Why can't cachefilesd just push a context into the kernel and pass that
into the hook as the acting context, and then nfsd can do likewise using
the context provided by the client or obtained locally from exports for
ordinary clients?  Avoids the transition SID computation altogether
within the kernel and makes this more generic.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>, <kmacmill@...>
Cc: <dhowells@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 9, 2008 - 12:51 pm

Okay.  I can:

 (1) Have cachefilesd (the daemon) pass a security context string to the
     cachefiles kernel module, which can then convert it to a secID.  It'll
     require a security_secctx_to_secid() function, but I'm fairly certain I
     have a patch to add such kicking around somewhere.

 (2) Make security_task_kernel_act_as() take a task_security struct and a
     secID and just assign the latter to the former.  I'm not sure it makes
     sense to do any checks here, other than checking that under SELinux the
     secID is of SECCLASS_PROCESS class.

However, I need to write a check that the cachefilesd daemon is permitted to
nominate the secID it did.  Can someone tell me how to do this?  The obvious
way to do this is to add another PROCESS__xxx security permit specifically for
cachefiles, but that seems like a waste of a bit when there are only two spare
bits.

	avc_has_perm(daemon_tsec-&gt;sid, nominated_sid,
		     SECCLASS_PROCESS, PROCESS__CACHEFILES_USE, NULL);

Now, I recall the addition of another security class being mentioned, which
presumably would give something like:

	avc_has_perm(daemon_tsec-&gt;sid, nominated_sid,
		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);

And I assume this doesn't care if one, the other or both of the two SIDs
mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.

David
--
To: David Howells <dhowells@...>
Cc: Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 9, 2008 - 2:11 pm

Already planned for 2.6.25, see:
http://marc.info/?l=selinux&amp;m=119973017423487&amp;w=2


Right, the latter is reasonable.
Requires adding the class and permission definition to
policy/flask/security_classes and policy/flask/access_vectors and then
regenerating the kernel headers from those files, ala:
  svn co http://oss.tresys.com/repos/refpolicy/trunk refpolicy
  cd refpolicy/policy/flask
  vi security_classes access_vectors
  &lt;add new class to end&gt;
  make
  make LINUX_D=/path/to/linux-2.6 tokern
 
Dan knows how to do that.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, January 14, 2008 - 10:01 am

Okay...  It looks like I want four security operations/hooks for cachefiles:

 (1) Check that a daemon can nominate a secid for use by the kernel to override
     the process subjective secid.

 (2) Set the secid mentioned in (1).

 (3) Check that the kernel may create files as a particular secid (this could
     be specified indirectly by specifying an inode, which would hide the secid
     inside the LSM).

 (4) Set the fscreate secid mentioned in (3).

Now, it's possible to condense (1) and (2) into a single op, and condense (3)
and (4) into a single op.  That, however, might make the ops unusable by nfsd,
which may well want to bypass the checks or do them elsewhere.

Any thoughts?

David
--
To: David Howells <dhowells@...>
Cc: Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 10:56 am

I don't think this check is on the kernel per se but rather the ability
of the daemon to nominate a secid for use on files created later by the

I think it is fine to combine them.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 12:03 pm

Hmmm...  At the moment the cachefiles module works out for itself what the
file label should be by looking at the root directory it was given and
assuming the label on that is what it's going to be using.  Are you suggesting
this should be specified directly instead by the daemon?

David
--
To: David Howells <dhowells@...>, Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 2:10 pm

Oh my. While there will be cases where the label of the file
will match the label of the containing directory, and in fact
for most label based LSMs that will usually be the case, you
certainly can't count on it. The only place that you can find
the correct label for a file with any confidence in from the
xattr (assuming the LSM uses xattrs) on the file itself. I can
imaging an LSM for which it would make sense to derive the
label from the root directory, but I know Smack isn't one of
them, and I don't think that SELinux is either, although I
would defer a definitive answer on that to Stephen.


Casey Schaufler
casey@schaufler-ca.com
--
To: <casey@...>
Cc: David Howells <dhowells@...>, Daniel J Walsh <dwalsh@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 3:15 pm

The cache files are created by the cachefiles kernel module, not by the
userspace daemon, and the userspace daemon doesn't need to directly
read/write them at all (but I think it does need to be able to unlink
them?).  The userspace daemon merely identifies the directory where the
cache should live as part of configuring the cache when enabling it.

Hence, it is fine to use a fixed label for the cache files (systemhigh
in a MLS world), and to let the directory's label serve as the basis for
it.  Only the cachefiles kernel module directly reads and writes the
files.
 
-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, <casey@...>, Daniel J Walsh <dwalsh@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 5:55 pm

That is what I currently do.  SELinux rules are provided to grant the
appropriate file accesses to the override label used by the kernel module, so

Correct.
--
To: David Howells <dhowells@...>, Stephen Smalley <sds@...>
Cc: <dhowells@...>, <casey@...>, Daniel J Walsh <dwalsh@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 6:23 pm

Well, my bad, and thank you for clearing up my misunderstanding.


Casey Schaufler
casey@schaufler-ca.com
--
To: David Howells <dhowells@...>
Cc: Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 12:08 pm

No, just that however the secid is determined (whether indirectly via
specification of a directory or directly via specification of a secid),
the ability of the daemon to control what secid gets used ought to be
controlled.  Or, alternatively, the ability of the daemon to enable
caching in a given directory ought to be controlled.

-- 
Stephen Smalley
National Security Agency

--
To: David Howells <dhowells@...>, Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, January 14, 2008 - 10:52 am

Yes, and I would recommend doing so to avoid permission races.
You're going to have to deal with the case where step (2) fails
even if you have step (1), so the "test and set" mindset seems

Again, I don't think you're doing yourself any favors with a separate
test operation.

On (4) are you suggesting a third attribute value? There's the secid
of the task originally, the secid you're going to use to do the access

Let me see if I understand your current scheme.

You want a (object) secid that is used to access the task.
You want a (subject) secid that the task uses to accesses objects.
You want a (newobject) secid that an object gets on creation.
And you want them all to be distinct and settable.
Did I get that right?

Thank you.


Casey Schaufler
casey@schaufler-ca.com
--
To: <casey@...>
Cc: <dhowells@...>, Stephen Smalley <sds@...>, Daniel J Walsh <dwalsh@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, January 14, 2008 - 11:19 am

Looking at SELinux, that doesn't get rid of the permission race because there's
no locking.  This may be different for other models.

I was thinking of having steps (2) and (4) not do any checking, but rather
assume that the caller has done the checks before calling the set routines,
possibly by calling the hooks mentioned in (1) and (3).

My main problem is that I don't know how NFSd wants to do things.  I suppose


That's correct.  Let me summarise:

 (1) The daemon has an active process security ID (say A).  When the daemon
     nominates an override process security ID (say B) to be used by the
     kernel, the cachefiles module asks the LSM to check that A is allowed to
     nominate B for this purpose.

 (2) The cachefiles module is given a path under which its cache exists.  The
     directory at the base of this path has its own security ID (say C).
     cachefiles wants to create new files in the cache with the same security
     ID as that directory (ie. C).

     However, when cachefiles is creating files in the cache, the security of
     whatever process is doing the access will be overridden with B, so
     cachefiles asks the LSM to check that B is allowed create files as C.

     Note that this is an instantaneous check in the cache startup stage.  This
     allows caching to be aborted early if the security policy does not permit
     B to create Cs.  Technically this check is superfluous as it's re-checked

That depends on what you mean.  cachefilesd (the daemon) will be run with a
security label because there's a security model in place.

I don't actually need to access the daemon, but the daemon does need to do

Correct.  This is used as an override by any task that accesses the cache
indirectly through the cachefiles module.

The cachefilesd daemon has its own secid with which it accesses the cache
directly.  The sets of permissions that must be granted by the module's
override subjective secid and by the daemon's subjective secid aren't

File and d...
To: <unlisted-recipients@...>, <@...>
Cc: <dhowells@...>, Stephen Smalley <sds@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, January 14, 2008 - 10:06 am

FYI, I added the following vectors:

	# kernel services that need to override task security
	class kernel_service
	{
		use_as_override
		create_files_as
	}

The first allows:

	avc_has_perm(daemon_tsec-&gt;sid, nominated_sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__USE_AS_OVERRIDE,
		     NULL);

And the second something like:

	avc_has_perm(tsec-&gt;sid, inode-&gt;sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__CREATE_FILES_AS,
		     NULL);

Rather than specifically dedicating them to the cache, I made them general.

David
--
To: David Howells <dhowells@...>
Cc: Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Tuesday, January 15, 2008 - 10:58 am

Make sure that you or Dan submits a policy patch to register these
classes and permissions in the policy when the kernel patch is queued
for merge.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 23, 2008 - 4:52 pm

Do I just send the attached patch to &lt;selinux@tycho.nsa.gov&gt;?  Or do I need to
make a diff from a point in the tree nearer the root?  Is there anything else
I need to alter whilst I'm at it?

David
---
Index: policy/flask/security_classes
===================================================================
--- policy/flask/security_classes	(revision 2573)
+++ policy/flask/security_classes	(working copy)
@@ -109,4 +109,7 @@
 # network peer labels
 class peer
 
+# kernel services that need to override task security
+class kernel_service
+
 # FLASK
Index: policy/flask/access_vectors
===================================================================
--- policy/flask/access_vectors	(revision 2573)
+++ policy/flask/access_vectors	(working copy)
@@ -736,3 +736,10 @@
 {
 	recv
 }
+
+# kernel services that need to override task security
+class kernel_service
+{
+	use_as_override
+	create_files_as
+}
--
To: David Howells <dhowells@...>
Cc: Stephen Smalley <sds@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 23, 2008 - 6:03 pm

-- 
James Morris
&lt;jmorris@namei.org&gt;
--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 9, 2008 - 2:56 pm

Does this require rebuilding and updating all the SELinux rpms to know about
the new class?

David
--
To: David Howells <dhowells@...>
Cc: Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 9, 2008 - 3:19 pm

Policy ultimately has to be updated in order to start writing allow
rules based on the new class/perm.  libselinux et al doesn't have to
change.

If you have a "SELinux:  policy loaded with handle_unknown=allow"
message in your /var/log/messages, then new classes/perms that are not
yet known to the policy will be allowed by default, so the operation
will be permitted by the kernel.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>
Cc: <dhowells@...>, Daniel J Walsh <dwalsh@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Thursday, January 10, 2008 - 7:09 am

I don't.  How do I set it?

David
--
To: <unlisted-recipients@...>, <@...>
Cc: <dhowells@...>, Stephen Smalley <sds@...>, <kmacmill@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Wednesday, January 9, 2008 - 1:27 pm

Hmmmm...  I can't see how to add a new security class.  I can see that
security classes are defined in various autogenerated header files, but
autogenerated from what?  The "This file is automatically generated.  Do not
edit." message at the top of these files seems to belie the fact they're
actually checked in to GIT as is.

David
--
To: Stephen Smalley <sds@...>, Karl MacMillan <kmacmill@...>
Cc: <dhowells@...>, <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 5:08 pm

Are you speaking of security_kernel_act_as() and security_create_files_as()
specifically?  Or the task_struct::act_as override pointer in general?

I don't really know how nfsd wants to obtain and set its LSM context, so it's
a bit difficult for me to make something that works for nfsd as well as

How does cachefilesd come up with such a context?  Grab it from
/etc/cachefilesd.conf?


I seem to remember that I was told that it should be done this way, possibly
by Karl MacMillan, but I don't remember exactly.

Now it's configured by cachefilesd.te:

	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;

David
--
To: David Howells <dhowells@...>
Cc: Karl MacMillan <kmacmill@...>, <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 5:27 pm

It would get a context from the client or from a local configuration
that would map security-unaware clients to a default context, and then
want to assume that context for the particular operation.  No transition
the way in which dbusd imports contexts), or directly as a context
returned by a libselinux function.  Has to be done that way so that it
can be set differently for different policy types (strict, targeted,
mls).

Naturally, cachefiles (the kernel module) would invoke a security hook


It doesn't fit with how other users of security_kernel_act_as() will
likely want to work (they will want to just set the context to a
specified value, whether one obtained from the client or from some local
source), nor with how type transitions normally work (exec, with the
program type as the second type field).  I think it will just cause
confusion and subtle breakage.

-- 
Stephen Smalley
National Security Agency

--
To: Stephen Smalley <sds@...>, David Howells <dhowells@...>
Cc: Karl MacMillan <kmacmill@...>, <viro@...>, <hch@...>, <Trond.Myklebust@...>, <casey@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 6:26 pm

I would expect that the operation would be more sophisticated
than that. You certainly aren't going to use what comes from
the other side without any processing, and I expect you'll have
some sort of operation on anything you pull from a config file

Unless you've got an LSM other than SELinux, of course. If
cachefilesd is going to be responsible for maintaining this
magic context there needs to be an LSM interface for it, not

I think that I agree with Stephen, although I could be mirely confused.
That happens to me when interfaces are described in SELinux terms. I
still don't care much for multiple contexts, and I don't have a good
grasp of how you'll deal with Smack, or any LSM other than SELinux.
Just as Stephen mentions, I also don't see the generality that a change
of this magnitude really ought to provide.



Casey Schaufler
casey@schaufler-ca.com
--
To: <casey@...>
Cc: <dhowells@...>, Stephen Smalley <sds@...>, Karl MacMillan <kmacmill@...>, <viro@...>, <hch@...>, <Trond.Myklebust@...>, <linux-kernel@...>, <selinux@...>, <linux-security-module@...>
Date: Monday, December 10, 2007 - 7:44 pm

Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
bits, and WinN