Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

Previous thread: none

Next thread: [bug?] ALSA sound/core/info.c:852: BUG? (root) by Ingo Molnar on Wednesday, December 5, 2007 - 1:00 pm. (2 messages)
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

These patches add local caching for network filesystems such as NFS and AFS.

The patches can roughly be broken down into a number of sets:

  (*) 01-keys-inc-payload.diff
  (*) 02-keys-search-keyring.diff
  (*) 03-keys-callout-blob.diff

      Three patches to the keyring code made to help the CIFS people.
      Included because of patches 05-08.

  (*) 04-keys-get-label.diff

      A patch to allow the security label of a key to be retrieved.
      Included because of patches 05-08.

  (*) 05-security-current-fsugid.diff
  (*) 06-security-separate-task-bits.diff
  (*) 07-security-subjective.diff
  (*) 08-security-kernel-service.diff

      Patches to permit the subjective security of a task to be overridden.
      All the security details in task_struct are decanted into a new struct
      that task_struct then has two pointers two: one that defines the
      objective security of that task (how other tasks may affect it) and one
      that defines the subjective security (how it may affect other objects).

      Note that I have dropped the idea of struct cred for the moment.  With
      the amount of stuff that was excluded from it, it wasn't actually any
      use to me.  However, it can be added later.

      Required for cachefiles.

  (*) 09-release-page.diff
  (*) 10-fscache-page-flags.diff
  (*) 11-add_wait_queue_tail.diff
  (*) 12-fscache.diff

      Patches to provide a local caching facility for network filesystems.

  (*) 13-cachefiles-ia64.diff
  (*) 14-cachefiles-ext3-f_mapping.diff
  (*) 15-cachefiles-write.diff
  (*) 16-cachefiles-monitor.diff
  (*) 17-cachefiles-export.diff
  (*) 18-cachefiles.diff

      Patches to provide a local cache in a directory of an already mounted
      filesystem.

  (*) 19-fscache-nfs.diff
  (*) 20-fscache-nfs-mount.diff
  (*) 21-fscache-nfs-display.diff

      Patches to provide NFS with local caching.

  (*) 22-fcrypt-bit-annotate.diff

      A fix for AFS.

  (*) 23-afs-testsetpageerror.diff
  ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Increase the size of a payload that can be used to instantiate a key in
add_key() and keyctl_instantiate_key().  This permits huge CIFS SPNEGO blobs to
be passed around.  The limit is raised to 1MB.  If kmalloc() can't allocate a
buffer of sufficient size, vmalloc() will be tried instead.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/keyctl.c |   38 ++++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index d9ca15c..8ec8432 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -19,6 +19,7 @@
 #include <linux/capability.h>
 #include <linux/string.h>
 #include <linux/err.h>
+#include <linux/vmalloc.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 	char type[32], *description;
 	void *payload;
 	long ret;
+	bool vm;
 
 	ret = -EINVAL;
-	if (plen > 32767)
+	if (plen > 1024 * 1024 - 1)
 		goto error;
 
 	/* draw all the data into kernel space */
@@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type,
 	/* pull the payload in if one was supplied */
 	payload = NULL;
 
+	vm = false;
 	if (_payload) {
 		ret = -ENOMEM;
 		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload)
-			goto error2;
+		if (!payload) {
+			if (plen <= PAGE_SIZE)
+				goto error2;
+			vm = true;
+			payload = vmalloc(plen);
+			if (!payload)
+				goto error2;
+		}
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 
 	key_ref_put(keyring_ref);
  error3:
-	kfree(payload);
+	if (!vm)
+		kfree(payload);
+	else
+		vfree(payload);
  error2:
 	kfree(description);
  error:
@@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id,
 	key_ref_t keyring_ref;
 	void *payload;
 	long ret;
+	bool vm = false;
 
 	ret = -EINVAL;
-	if (plen > ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Check the starting keyring as part of the search to (a) see if that is what
we're searching for, and (b) to check it is still valid for searching.

The scenario:  User in process A does things that cause things to be
created in its process session keyring.  The user then does an su to
another user and starts a new process, B.  The two processes now
share the same process session keyring.

Process B does an NFS access which results in an upcall to gssd.
When gssd attempts to instantiate the context key (to be linked
into the process session keyring), it is denied access even though it
has an authorization key.

The order of calls is:

   keyctl_instantiate_key()
      lookup_user_key()				    (the default: case)
         search_process_keyrings(current)
	    search_process_keyrings(rka->context)   (recursive call)
	       keyring_search_aux()

keyring_search_aux() verifies the keys and keyrings underneath the
top-level keyring it is given, but that top-level keyring is neither
fully validated nor checked to see if it is the thing being searched for.

This patch changes keyring_search_aux() to:
1) do more validation on the top keyring it is given and
2) check whether that top-level keyring is the thing being searched for


Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/keyring.c |   35 +++++++++++++++++++++++++++++++----
 1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 88292e3..76b89b2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 
 	struct keyring_list *keylist;
 	struct timespec now;
-	unsigned long possessed;
+	unsigned long possessed, kflags;
 	struct key *keyring, *key;
 	key_ref_t key_ref;
 	long err;
@@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 	now = current_kernel_time();
 	err = ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Allow the callout data to be passed as a blob rather than a string for internal
kernel services that call any request_key_*() interface other than
request_key().  request_key() itself still takes a NUL-terminated string.

The functions that change are:

	request_key_with_auxdata()
	request_key_async()
	request_key_async_with_auxdata()

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/keys-request-key.txt |   11 +++++---
 Documentation/keys.txt             |   14 +++++++---
 include/linux/key.h                |    9 ++++---
 security/keys/internal.h           |    9 ++++---
 security/keys/keyctl.c             |    7 ++++-
 security/keys/request_key.c        |   49 ++++++++++++++++++++++--------------
 security/keys/request_key_auth.c   |   12 +++++----
 7 files changed, 70 insertions(+), 41 deletions(-)

diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt
index 266955d..09b55e4 100644
--- a/Documentation/keys-request-key.txt
+++ b/Documentation/keys-request-key.txt
@@ -11,26 +11,29 @@ request_key*():
 
 	struct key *request_key(const struct key_type *type,
 				const char *description,
-				const char *callout_string);
+				const char *callout_info);
 
 or:
 
 	struct key *request_key_with_auxdata(const struct key_type *type,
 					     const char *description,
-					     const char *callout_string,
+					     const char *callout_info,
+					     size_t callout_len,
 					     void *aux);
 
 or:
 
 	struct key *request_key_async(const struct key_type *type,
 				      const char *description,
-				      const char *callout_string);
+				      const char *callout_info,
+				      size_t callout_len);
 
 or:
 
 	struct key *request_key_async_with_auxdata(const struct key_type *type,
 						   const char *description,
-						   const char *callout_string,
+						   const char *callout_info,
+					     	   size_t callout_len,
 						   void *aux);
 
 Or by userspace invoking the request_key system ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Add a keyctl() function to get the security label of a key.

The following is added to Documentation/keys.txt:

 (*) Get the LSM security context attached to a key.

	long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
		    size_t buflen)

     This function returns a string that represents the LSM security context
     attached to a key in the buffer provided.

     Unless there's an error, it always returns the amount of data it could
     produce, even if that's too big for the buffer, but it won't copy more
     than requested to userspace. If the buffer pointer is NULL then no copy
     will take place.

     A NUL character is included at the end of the string if the buffer is
     sufficiently big.  This is included in the returned count.  If no LSM is
     in force then an empty string will be returned.

     A process must have view permission on the key for this function to be
     successful.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/keys.txt   |   21 +++++++++++++++
 include/linux/keyctl.h   |    1 +
 include/linux/security.h |   20 +++++++++++++-
 security/dummy.c         |    8 ++++++
 security/keys/compat.c   |    3 ++
 security/keys/keyctl.c   |   66 ++++++++++++++++++++++++++++++++++++++++++++++
 security/security.c      |    5 +++
 security/selinux/hooks.c |   21 +++++++++++++--
 8 files changed, 141 insertions(+), 4 deletions(-)

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index b82d38d..be424b0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -711,6 +711,27 @@ The keyctl syscall functions are:
      The assumed authoritative key is inherited across fork and exec.
 
 
+ (*) Get the LSM security context attached to a key.
+
+	long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
+		    size_t buflen)
+
+     This function returns a string that represents the LSM security context
+     attached to a key in the buffer provided.
+
+     Unless there's ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be
separated from the task_struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/perfmon.c                |    4 ++--
 arch/powerpc/platforms/cell/spufs/inode.c |    4 ++--
 drivers/isdn/capi/capifs.c                |    4 ++--
 drivers/usb/core/inode.c                  |    4 ++--
 fs/9p/fid.c                               |    2 +-
 fs/9p/vfs_inode.c                         |    4 ++--
 fs/9p/vfs_super.c                         |    4 ++--
 fs/affs/inode.c                           |    4 ++--
 fs/anon_inodes.c                          |    4 ++--
 fs/attr.c                                 |    4 ++--
 fs/bfs/dir.c                              |    4 ++--
 fs/cifs/cifsproto.h                       |    2 +-
 fs/cifs/dir.c                             |   12 ++++++------
 fs/cifs/inode.c                           |    8 ++++----
 fs/cifs/misc.c                            |    4 ++--
 fs/coda/cache.c                           |    6 +++---
 fs/coda/upcall.c                          |    4 ++--
 fs/devpts/inode.c                         |    4 ++--
 fs/dquot.c                                |    2 +-
 fs/exec.c                                 |    4 ++--
 fs/ext2/balloc.c                          |    2 +-
 fs/ext2/ialloc.c                          |    4 ++--
 fs/ext2/ioctl.c                           |    2 +-
 fs/ext3/balloc.c                          |    2 +-
 fs/ext3/ialloc.c                          |    4 ++--
 fs/ext4/balloc.c                          |    2 +-
 fs/ext4/ialloc.c                          |    4 ++--
 fs/fuse/dev.c                             |    4 ++--
 fs/gfs2/inode.c                           |   10 +++++-----
 fs/hfs/inode.c                            |    4 ++--
 fs/hfsplus/inode.c                        |    4 ++--
 fs/hpfs/namei.c                           |   24 ++++++++++++------------
 fs/hugetlbfs/inode.c                      |   16 ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Separate the task security context from task_struct.  At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 drivers/block/loop.c             |    5 -
 drivers/char/agp/frontend.c      |    2 
 drivers/char/drm/drm_fops.c      |    2 
 drivers/char/tty_audit.c         |    2 
 fs/affs/super.c                  |    4 -
 fs/autofs/inode.c                |    4 -
 fs/autofs4/inode.c               |    4 -
 fs/autofs4/waitq.c               |    4 -
 fs/binfmt_elf.c                  |   12 +-
 fs/cifs/connect.c                |    5 -
 fs/cifs/ioctl.c                  |    2 
 fs/ecryptfs/messaging.c          |   15 +-
 fs/exec.c                        |   20 ++-
 fs/fat/inode.c                   |    4 -
 fs/fcntl.c                       |    7 +
 fs/file_table.c                  |    4 -
 fs/fuse/dir.c                    |   12 +-
 fs/hfs/super.c                   |    4 -
 fs/hfsplus/options.c             |    4 -
 fs/hpfs/super.c                  |    4 -
 fs/hugetlbfs/inode.c             |    4 -
 fs/inotify_user.c                |    2 
 fs/ioprio.c                      |   12 +-
 fs/namei.c                       |    6 +
 fs/ncpfs/ioctl.c                 |   32 ++---
 fs/open.c                        |   22 ++-
 fs/proc/array.c                  |   14 +-
 fs/proc/base.c                   |   16 +-
 fs/proc/proc_sysctl.c            |    4 -
 fs/quota.c                       |    4 -
 fs/smbfs/dir.c                   |    4 -
 fs/smbfs/inode.c                 |    2 
 fs/smbfs/proc.c                  |    2 
 include/linux/init_task.h        |   23 +++
 include/linux/sched.h            |   78 +++++++++---
 include/net/scm.h                |    4 -
 ipc/mqueue.c                     |    4 -
 ipc/msg.c                        |    4 -
 ipc/sem.c                        |    4 -
 ipc/shm.c                        |   16 +-
 ipc/util.c                       | ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Remove the temporarily embedded task security record from task_struct.  Instead
it is made to dangle from the task_struct::sec and task_struct::act_as pointers
with references counted for each.

do_coredump() is made to create a copy of the security record, modify it and
then use that to override the main one for a task.  sys_faccessat() is made to
do the same.

The process and session keyrings are moved from signal_struct into a new
thread_group_security struct.  This is then refcounted, with pointers coming
from the task_security struct instead of from signal_struct.

The keyring functions then take pointers to task_security structs rather than
task_structs for their security contexts.  This is so that request_key() can
proceed asynchronously without having to worry about the initiator task's
act_as pointer changing.

The LSM hooks for dealing with task security are modified to deal with the task
security struct directly rather than going via the task_struct as appopriate.

This permits the subjective security context of a task to be overridden by
changing its act_as pointer without altering its objective security pointer,
and thus not breaking signalling, ptrace, etc. whilst the override is in force.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/exec.c                        |   15 +-
 fs/open.c                        |   37 ++---
 include/linux/init_task.h        |   17 --
 include/linux/key-ui.h           |   10 +
 include/linux/key.h              |   31 +---
 include/linux/sched.h            |   40 ++++-
 include/linux/security.h         |   43 ++++--
 kernel/Makefile                  |    2 
 kernel/cred.c                    |  139 ++++++++++++++++++
 kernel/exit.c                    |    1 
 kernel/fork.c                    |   40 ++---
 kernel/kmod.c                    |   10 +
 kernel/sys.c                     |   16 +-
 kernel/user.c                    |    2 
 net/rxrpc/ar-key.c               |    4 -
 security/dummy.c                 |   14 ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:38 pm

Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

 (*) security_kernel_act_as() which allows modification of the security datum
     with which a task acts on other objects (most notably files).

 (*) security_create_files_as() which allows modification of the security
     datum that is used to initialise the security data on a file that a task
     creates.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/cred.h     |   22 ++++++++++++
 include/linux/security.h |   35 +++++++++++++++++++
 kernel/cred.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++
 security/dummy.c         |   15 ++++++++
 security/security.c      |   13 +++++++
 security/selinux/hooks.c |   45 ++++++++++++++++++++++++
 6 files changed, 216 insertions(+), 0 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
new file mode 100644
index 0000000..c9f8906
--- /dev/null
+++ b/include/linux/cred.h
@@ -0,0 +1,22 @@
+/* Credential management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CRED_H
+#define _LINUX_CRED_H
+
+struct task_security;
+struct inode;
+
+extern struct task_security ...
From: Stephen Smalley
Date: Monday, December 10, 2007 - 9:46 am

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Monday, December 10, 2007 - 10:07 am

Cleared means what?  Setting to 0?  Or is there some other constant I should
use for that?

David
--

From: Stephen Smalley
Date: Monday, December 10, 2007 - 10:23 am

Yes, setting to 0.

Otherwise, only other issue I have with this interface is it won't
generalize to dealing with nfsd, where we want to set the acting context
to a context we obtain from or determine based upon the client.

Why can't cachefilesd just push a context into the kernel and pass that
into the hook as the acting context, and then nfsd can do likewise using
the context provided by the client or obtained locally from exports for
ordinary clients?  Avoids the transition SID computation altogether
within the kernel and makes this more generic.

-- 
Stephen Smalley
National Security Agency

--

From: Stephen Smalley
Date: Monday, December 10, 2007 - 2:27 pm

It would get a context from the client or from a local configuration
that would map security-unaware clients to a default context, and then
want to assume that context for the particular operation.  No transition
the way in which dbusd imports contexts), or directly as a context
returned by a libselinux function.  Has to be done that way so that it
can be set differently for different policy types (strict, targeted,
mls).

Naturally, cachefiles (the kernel module) would invoke a security hook


It doesn't fit with how other users of security_kernel_act_as() will
likely want to work (they will want to just set the context to a
specified value, whether one obtained from the client or from some local
source), nor with how type transitions normally work (exec, with the
program type as the second type field).  I think it will just cause
confusion and subtle breakage.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Monday, December 10, 2007 - 4:36 pm

That sounds too SELinux specific.  How do I do it so that it works for any
LSM?

Is linking against libselinux is a viable option if it's not available under
all LSM models?  Is it available under all LSM models?  Perhaps Casey can


It's causing me lots of confusion as it is.  I have been / am being told by
different people to do different things just in dealing with SELinux, and
various people are raising extra requirements or restrictions beyond that.
There doesn't seem to be a consensus.

It sounds like the best option is just to have the kernel nick the userspace
daemon's security context and use that as is, and junk all the restrictions on
what the daemon can do so that the kernel isn't too restricted.

David
--

From: Stephen Smalley
Date: Tuesday, December 11, 2007 - 12:37 pm

You can't.  There is no LSM for userspace; LSM specifically disavowed
any common userspace API, and that was one of our original

Nope, they would all have their own libraries, if they have a library at
all.  But that isn't your problem - your kernel interface should be
generic, and your LSM hooks should be generic, but your userspace isn't
required to be.  Have a look at how many programs in the distribution

Karl isn't a maintainer of the SELinux kernel code.  And I had thought

Well, you could do that, if that meets your needs, but it doesn't sound
very optimal either.  Why are you opposed to having userspace determine
the context and write it to a cachefiles interface, and just have the
kernel authorize it (invoke a hook to check permissions between the
daemon's context and the specified label), and make it the acting
context when appropriate (invoke a different hook to set it as the
acting context)?

-- 
Stephen Smalley
National Security Agency

--

From: Karl MacMillan
Date: Wednesday, December 12, 2007 - 7:41 am

That's what I remember as well - I suggested the transition idea and
then, after discussion, agreed that it wasn't the best approach.

And, as Steve points out, it's not my call to make.

Karl

--

From: David Howells
Date: Wednesday, December 12, 2007 - 7:53 am

Sigh.

Can you tell me then how to do it now?  I don't know very much about using
SELinux userspace stuff libraries or configuring them.

David
--

From: Karl MacMillan
Date: Wednesday, December 12, 2007 - 7:59 am

I don't have enough context anymore to help with design, but I can help
with the actual usage of the libraries if you have specific questions.

Karl

--

From: David Howells
Date: Tuesday, December 11, 2007 - 1:42 pm

So, basically, userspace programs (outside of security tools) aren't supposed

It is if I have to maintain a special pieces of code for each possible LSM.
One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece

In /usr/bin ldd reports approximately 297 binaries link to libselinux, though
I can't say how many of those linked against it directly rather than picking
it up by contamination through a shared library.  Furthermore, I've no idea



Because, from what I gather, that means my userspace program needs to do
something different, depending on the security model that's currently in force
on a system.  Furthermore, I would have to have separate code, as far as I
know, for each security model as there's no commonality in userspace.

I can't just link against libselinux.  It might not be there.  I'm not going
to tie my program to SELinux either.

Furthermore, I worked out "the right way to do this" with some apparent
SELinux person, and you seemed reasonably accepting of it.  Now I have to go
and redo all the work, having had to redo the security stuff a couple of times
already because someone objected.  *That* is the main obstacle to getting my
code accepted at the moment, I think.

How about I just stick the context in /etc/cachefilesd.conf as a textual
configuration item and have the daemon pass that as a string to the cachefiles
kernel module, which can then ask LSM if it's valid to set this context as an
override, given the daemon's own security context?  That seems entirely
reasonable to me.

David
--

From: Stephen Smalley
Date: Tuesday, December 11, 2007 - 2:34 pm

I didn't say that.  I said that LSM provides no standard way for
userspace to access the security infrastructure, at least beyond the
raw /proc/self/attr and xattr APIs, and I view that as a defect in LSM.
SELinux does have a userspace API, because we think there should be a
way for userspace to do that, and it is being used by userspace programs

All your code has to do is invoke a function provided by libselinux.  If
at some later time a liblsm is introduced that provides a common
front-end to a libselinux, libsmack, ..., then you can use that.  But it
doesn't exist today.   But it all just becomes a simple function call

The point being that userspace is already using the SELinux API directly
- cachefilesd certainly won't be the first or most crucial application


You can certainly make the linking conditional at compile time based on
a build flag, and/or you can dlopen it at runtime and fall back
gracefully if not present.  It isn't tying your program to libselinux,
just using it if present, in this case to obtain an input to supply to
your kernel module so that the "policy" of deciding what context to use
is completely determined in userspace and not hardcoded at all into the

I did express reservations about it, and I really don't think that this
is the main obstacle - the determination of the acting task security
label and how to set it is really only a small part of your patch set,

That mostly works, but it means that an update to policy may require an
update to /etc/cachefilesd.conf, or that switching from one policy to
another might likewise require changing that file.  Versus using a
separate policy-provided config file for the label.

BTW, as should be obvious, some LSMs aren't label-based at all, so it
would need to be optional.

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Tuesday, December 11, 2007 - 4:04 pm

That seems like an awful lot of work. I suggest that what you
put in /etc/cachefilesd.conf is a line like:

   security_context:"<whatever>"

and have your daemon pass "<whatever>" into the kernel using
a cachefile mechanism. The kernel code can call
security_secctx_to_secid("<whatever>") to determine if it's valid.
No need to invoke LSM specific code in your daemon. You may need
to have an application, say cachefileselinuxcontext, that will
read the current policy and spit out an appropriate value of
"<whatever>", but that can be separate and LSM specific without

For LSM's that don't use labels what you will have to pass in
won't be a label, it will be something else. But since any LSM
that wants to do networking or audit will have to deal with
secid's and secctx's the method outlined above ought to fit the
bill.



Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Wednesday, December 12, 2007 - 8:25 am

That sounds workable, although I think he will want a more specific hook
than security_secctx_to_secid(), or possibly a second hook call, that
would not only validate the context but authorize the use of it by the
cachefilesd process.  And then the security_task_kernel_act_as() hook
just takes the secid as input rather than the task struct of the daemon,
and applies it.  At that point, nfsd can use the same mechanism for
setting the acting SID based on the client process after doing its own

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Wednesday, December 12, 2007 - 11:29 am

I thought using secids was verboten as it made things too specific.

Have you example code for the security hook you mention?  I'm not sure I
understand why security_secctx_to_secid() is not sufficient.

Or is it that I need something that takes a secctx, converts it to a secid and
authorises its use all in one go?  If it's this, why can't that be rolles into
security_task_kernel_act_as()?  That sets up a task_security struct which is
then switched in and out without consultation of the LSM.

David
--

From: Stephen Smalley
Date: Wednesday, December 12, 2007 - 12:33 pm

Well, that has been Casey's objection in the past to it, but he seems to
have accepted their use now for certain purposes, and they are already

security_secctx_to_secid() would just validate and map a context string
to a secid.  It wouldn't perform any permission check, as the caller
might a kernel-internal user that is just mapping back and forth like
current users of security_secid_to_secctx, or it might be something that
ultimately originated from userspace but the hook has no way of knowing
why or what set of checks would be appropriate.  You'd need a more
specific hook for the authorization, one that would perform a permission
check, e.g. an avc_has_perm() call.  Which likely requires defining a

I was under the impression that security_task_kernel_act_as() was being
used to switch the current task to an acting context, not to initially
set up a struct for later use.  If you go with the latter approach, then
what is the lifecycle on that struct?

BTW, it gets a little confusing with your use of task_security for the
full task security state vs our existing use of task_security_struct
within SELinux for the task's LSM security blob.  I suppose ours could
be renamed to task_selinux.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Wednesday, December 12, 2007 - 3:49 pm

Hmmm...  This is sounding very not-simple.  I don't know how to do this.  I
can probably guess the kernel side by looking at how SECCLASS_KEY is done, but
it sounds like it involves changes to the userspace policy processing tools
too.

It also sounds a bit like overkill, but if it's the right way then I guess it
has to be done.

What does the security class represent in this case?  And can it be generalised

Definitely the latter.  I guess I wasn't very clear in the patch description.

 (1) You create a new task_security struct

 (2) You fill in the fsuid, fsgid, etc.

 (3) You request that the LSM security pointer in it be set to point to the
     context you want (at the moment this is done by attempting a transition
     from the daemon's context):

	ret = security_transition_sid(dtsec->sid, SECINITSID_KERNEL,
				      SECCLASS_PROCESS, &ksid);

     and, in the current code, it returns an error if you're not allowed to do
     that.  But instead you'd ask it to set a specific context, and it'd set
     that if you're permitted to do that, and give you an error if you're not.

 (4) You then use the task_security at will to override the task->act_as
     pointer in whatever task(s) you're operating on behalf of at the moment.

 (5) When you cease operating on the behalf of a task, you revert its act_as
     pointer and drop a reference to your task_security struct.

 (6) When the last ref to the task_security struct goes away, the LSM data

I know.  I thought about it quite a bit, but the problem is there's so much
overloading of various words (eg: security, context), and I wanted to avoid

Better still, perhaps, would be to prefix things with selinux_ to make it
namespace clean.

David
--

From: Stephen Smalley
Date: Thursday, December 13, 2007 - 7:49 am

Not the tools, just the policy definitions.  Dan can help with that.
You add the definitions to the policy, then there is a script to
regenerate the SELinux kernel headers that #define the SECCLASS_ and

It is just a way of carving up the permission space, typically based on
object type, but it can essentially be arbitrary.  The check in this
case seems specific to cachefiles since it is controlling an operation
on the /dev/cachefiles interface that only applies to cachefiles
internal operations, so making a cachefiles class seems reasonable.

If this was instead just a generic "set my acting context to <value>"
operation, then it could be a generic /proc/self/attr/active interface
with an generic implementation and permission check, but here we aren't
setting the active context of the cachefilesd daemon but rather of the
cachefiles kernel module.

The other approach that I suggested a long time ago is to exempt the
cachefiles kernel module internal operations from SELinux permission
checks altogether by setting some task flag when performing those
operations and checking for that flag either on entry to the security_
static inlines, similar to the inode private flag, or within SELinux
itself.  Rationale being that the cachefiles kernel module can already
do what it wants and the SELinux permission checks are really about
controlling what userspace can do.  Then we don't have to invent a
context for the kernel module at all or worry about subtle breakage when

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Thursday, December 13, 2007 - 8:36 am

Can you specify what sort of permissions you're thinking of providing for
tasks to operate on this class?  Can an object of this class 'operate' on
other objects, or can only process-class objects do that?

How does an object of this class acquire a label?  What is an object of this
class?  Is it a "cache"?  Or were you thinking of a "module"?

David
--

From: Stephen Smalley
Date: Thursday, December 13, 2007 - 9:23 am

They would correspond with the operations provided by
the /dev/cachefiles interface, at the granularity you want to support
distinctions to be made.  Could just be a single 'setcontext' permission
if that is all you want to control distinctly, or could be a permission

In this case, I wouldn't expect a cachefiles object to act on anything
else.  Some objects are also used as subjects, especially in the

I was thinking the latter since the only goal was to control what
contexts could be set by a given task, but you could support per-cache
"objects" with their own labels (in which case the label would likely be
determined from the creating task).

If the latter, you don't really need a label for the object, and can
just use the supplied context/secid as the object of the permission
check, ala:
	rc = avc_has_perm(tsec->sid, secid, SECCLASS_CACHEFILES,
CACHEFILES__SETCONTEXT);

If the former, then you'd need more than one check, as you then have to
check whether the task can act on the cache in question, and then check
whether it can set the context for that cache to the specified value.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Thursday, December 13, 2007 - 10:01 am

Can this be made simpler by the fact that /dev/cachefiles has its own unique

There is only one operation that makes sense to have a permission: "set
context and begin caching".

All the other operations on a file descriptor attached to /dev/cachfiles are
necessary for there to be a managed cache at all, and given that you've

Ummm.   I was under the impression that the target SID had to be a member of
target class.

David
--

From: Stephen Smalley
Date: Thursday, December 13, 2007 - 10:27 am

That lets you control who can use the interface at all, but not what

Do any of the interfaces allow a task to act on a cache other than one
it has created?  How does the task identify the desired cache?  What if

Not necessarily.

secid is being applied as the acting context for the cachefiles kernel
module, so the above makes sense, even though there isn't really any
"object" in view here.  Abstractly, the question we are asking above is:

Can this task set the context of the cachefiles kernel module to this
value?

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Thursday, December 13, 2007 - 11:04 am

Each file descriptor opened creates one separate cache instance.  Any commands
sent over that filedescriptor affect only the cache instance it is attached
to; similarly, any status data you read only refers to that one cache
instance.

Closing the file descriptor makes the cache go away as far as the kernel is
concerned.  The cachefiles daemon retains its cache dev file descriptor for

As far as the cache daemon is concerned, the file descriptor is its handle to

So the following (taken from cachefilesd.te):

	allow cachefilesd_t cachefiles_var_t : file { getattr rename unlink };

says, for example, allow:

	avc_has_perm("cachefilesd_t",
		     "cachefiles_var_t",
		     SECCLASS_FILE,
		     FILE__RENAME,
		     ...);

David
--

From: Casey Schaufler
Date: Wednesday, December 12, 2007 - 12:37 pm

I have argued that in the past. I'm reasonably convinced that I have
lost that argument at least for the immediate future as audit, usb,
and networking are all dependent on them. I can't image an LSM that
manages to avoid them, at least for the short term. If secid's are
ever expundged from the kernel cachefiles will require reeducation,

It would seem to me that security_secctx_to_secid() ought to suffice
if the application code was written correctly. Be aware that factors
outside the LSM may be important, too. As Stephen points out elsewhere,
Smack will require you have particular capabilities (CAP_MAC_OVERRIDE,
CAP_MAC_ADMIN) while a DAC LSM may require CAP_DAC_OVERRIDE. SELinux
is likely to be the odd duck in this pond in that it does not use the
capability mechanism in the way Nature intends it to be, opting to
treat "privilege" with a completely different model.


Casey Schaufler
casey@schaufler-ca.com
--

From: Casey Schaufler
Date: Wednesday, December 12, 2007 - 9:51 am

What sort of authorization are you thinking of? I would expect
that to have been done by cachefileselinuxcontext (or
cachefilesspiffylsmcontext) up in userspace. If you're going to
rely on userspace applications for policy enforcement they need

good points all, in spite of my personal distaste for secids.


Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Wednesday, December 12, 2007 - 11:12 am

In Smack, I'd expect that you'd want to apply a CAP_MAC_OVERRIDE check.
In SELinux, we'd apply a permission check between the task's security
context and the specified security context so that we can control the
pairwise relationship between them via allow rules and constraints.

The kernel has no way of knowing whether the context was determined by
cachefileselinuxcontext or not; it only knows that some task is trying
to write some value to /cachefiles/context or whatever the kernel
interface is, and it needs to apply some authorization check there,

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Wednesday, December 12, 2007 - 12:44 pm

Yes, but I would expect that interface to be protected (owned by root,
mode 0400). If /dev/cachefiles has to be publicly accessable make it


Yes, but forgive me being slow, I don't see the problem.


Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Wednesday, December 12, 2007 - 12:49 pm

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Wednesday, December 12, 2007 - 1:09 pm

Yes, but we're talking about writing the configuration information
to the kernel, not actually making any access checks with it. I
think. What I think we're talking about (and please correct me David
if I've stepped into the wrong theatre) is getting the magic
secctx that cachefiles will use instead of the secctx that the task
would have otherwise. I don't think we're talking about recomputing
it on every access, I think David is looking for the blunderbuss
secctx that he can use any time he needs one.

And it would be CAP_MAC_ADMIN, since you're changing the MAC
characteristics of the system, not doing an access check.

Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Wednesday, December 12, 2007 - 3:29 pm

Indeed.

The way I do it is:

 (1) The daemon opens /dev/cachefiles to being an instance of a cache.

 (2) The daemon negotiates a security context for the module to use.

 (3) The security context is place in a task_security structure.

 (4) This task_security struct is attached temporarily to task->act_as each
     time any task attempts to access the cache through the module.

 (5) The task_security struct is discarded when the file descriptor that was
     created in (1) is closed and the cache is withdrawn at the same time.

David
--

From: David Howells
Date: Wednesday, December 12, 2007 - 11:25 am

What would I do with such a thing?  How would it get run?  Spat out to where?

David
--

From: Casey Schaufler
Date: Wednesday, December 12, 2007 - 12:20 pm

Put it in /etc/init.d/cachefiles and run it at boot time. Put the
result into /etc/cachefiles.conf. Have cachefilesd read it and pass
it downward.



Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Wednesday, December 12, 2007 - 12:29 pm

Ewww.  Runtime mangling of the configuration.  I suppose it doesn't have to be
in that file with the rest of the config.

David
--

From: Stephen Smalley
Date: Wednesday, December 12, 2007 - 12:35 pm

More likely, run it at build time in your .spec file to generate
cachefiles.conf, then run it again maybe upon a policy update or if the
user selects a different policy.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Wednesday, December 12, 2007 - 3:55 pm

I don't think sticking it in cachefiles.conf is a good idea necessarily.
That has to be an administrator modifiable file.  Is there a program I could
make cachefiles run directly and capture the output of that could give me the

How do I do that?

David
--

From: Stephen Smalley
Date: Thursday, December 13, 2007 - 7:51 am

Yes, we could easily make a simple program that just invokes a
libselinux function that in turn grabs the proper context from some
context configuration file under /etc/selinux/$SELINUXTYPE/contexts/ and

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Thursday, December 13, 2007 - 9:03 am

That sounds nicely genericisable, perhaps even for any LSM.

	/usr/bin/lsm-get-context cachefiles

It does have to be able to come up with different contexts for different
caches, but that can be controlled by changing the name supplied to it.

David
--

From: Casey Schaufler
Date: Tuesday, December 11, 2007 - 2:18 pm

--- David Howells <dhowells@redhat.com> wrote:


Works for Smack. I can't say definitively, but I think it will
work for SELinux. Beyond that and we're into the fuzzy bit of the
LSM.


Casey Schaufler
casey@schaufler-ca.com
--

From: Casey Schaufler
Date: Monday, December 10, 2007 - 4:46 pm

Linking against libselinux is not now, nor will it ever be, a viable
option. There's just too much sophistication contained in libselinux


That would be consistant with the (perhaps archaic now) behavior
of nfsd on Unix, which did nothing but "lend it's credential" to the
underlying kernel code. I think it's a rational approach, although I
expect that in may have troubles under SELinux.


Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Tuesday, December 11, 2007 - 12:52 pm

nfsd needs to able to set the acting label to a value determined based
on the client so that file operations performed on behalf of the client
are subjected to the right set of permission checks and new files are
labeled properly, just as it already does for uid and gid (via fsuid and
fsgid).  So merely inheriting the label from the nfsd daemon doesn't
help with that purpose.

Both nfsd and cachefiles need a way to set the acting label, so having a
common hook for both to do that makes sense.  The authorization of that
label will differ, so splitting the authorization into a separate hook
also makes sense.
 
-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Monday, December 10, 2007 - 3:26 pm

I would expect that the operation would be more sophisticated
than that. You certainly aren't going to use what comes from
the other side without any processing, and I expect you'll have
some sort of operation on anything you pull from a config file

Unless you've got an LSM other than SELinux, of course. If
cachefilesd is going to be responsible for maintaining this
magic context there needs to be an LSM interface for it, not

I think that I agree with Stephen, although I could be mirely confused.
That happens to me when interfaces are described in SELinux terms. I
still don't care much for multiple contexts, and I don't have a good
grasp of how you'll deal with Smack, or any LSM other than SELinux.
Just as Stephen mentions, I also don't see the generality that a change
of this magnitude really ought to provide.



Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Tuesday, December 11, 2007 - 11:34 am

Yes, that's true - the contexts would be subjected to a permission
check.  But that's separable from the act of setting it as the task's
acting security state (and needs to be separated, as the precise check
will vary depending on the situation - cachefiles is going to apply a

LSM is an in-kernel interface.  Here we are talking about a userspace
interface for obtaining the right security label to use.  There is no
equivalent to LSM in userspace as of yet.  Feel free to invent one, but

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Tuesday, December 11, 2007 - 12:26 pm

I am much more concerned with the interfaces used to pass the
information into the kernel. I would expect that to be LSM
independent, not a call into libselinux that resolves into a
selinuxfs operation, or it's netlink equivilant. It would be
unfortunate if the userland/kernel interface became an obstacle


Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Tuesday, December 11, 2007 - 12:56 pm

That wasn't the issue.  The interface to the cachefiles module would
just consist of cachefilesd writing a string label to some pseudo file
tell cachefiles what label to apply as the acting label for operations
performed by cachefiles.  Which isn't SELinux-specific at all.

David was asking though how cachefilesd (the userspace agent) would
obtain such a label to use.  And that may very well be LSM-specific, and
as there is no LSM userspace API, it makes sense for him to invoke a
libselinux function at present.  If a liblsm is later created and
provides a common front-end API (internally dlopen'ing the right shared
library based on some configuration, whether libselinux or libsmack or
whatever), then cachefilesd can instead call the liblsm interface, but
that doesn't exist today.

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Tuesday, December 11, 2007 - 1:40 pm

I am certainly not in favor of adding such complexity.
I suggest that cachefilesd get the context it wants using
whatever scheme works. I think there should be a cachefiles
specific (LSM independent) scheme for getting it into the
kernel, and a LSM hooks for setting it, if that's what he
really wants to do.



Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Monday, December 10, 2007 - 4:44 pm

Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
bits, and WinNT's security system, but I have no experience of the other

Perhaps it should be a specific interface, solely for cachefiles's use then.

David
--

From: Casey Schaufler
Date: Monday, December 10, 2007 - 4:56 pm

That would help focus things, to be sure. I don't know if that
focus will speed things up or slow them down, but I think that
attempting to accomodate SELinux/NFS, with the state that effort
is in, will only lead to tears.


Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Wednesday, January 9, 2008 - 9:51 am

Okay.  I can:

 (1) Have cachefilesd (the daemon) pass a security context string to the
     cachefiles kernel module, which can then convert it to a secID.  It'll
     require a security_secctx_to_secid() function, but I'm fairly certain I
     have a patch to add such kicking around somewhere.

 (2) Make security_task_kernel_act_as() take a task_security struct and a
     secID and just assign the latter to the former.  I'm not sure it makes
     sense to do any checks here, other than checking that under SELinux the
     secID is of SECCLASS_PROCESS class.

However, I need to write a check that the cachefilesd daemon is permitted to
nominate the secID it did.  Can someone tell me how to do this?  The obvious
way to do this is to add another PROCESS__xxx security permit specifically for
cachefiles, but that seems like a waste of a bit when there are only two spare
bits.

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_PROCESS, PROCESS__CACHEFILES_USE, NULL);

Now, I recall the addition of another security class being mentioned, which
presumably would give something like:

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);

And I assume this doesn't care if one, the other or both of the two SIDs
mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.

David
--

From: David Howells
Date: Wednesday, January 9, 2008 - 10:27 am

Hmmmm...  I can't see how to add a new security class.  I can see that
security classes are defined in various autogenerated header files, but
autogenerated from what?  The "This file is automatically generated.  Do not
edit." message at the top of these files seems to belie the fact they're
actually checked in to GIT as is.

David
--

From: Stephen Smalley
Date: Wednesday, January 9, 2008 - 11:11 am

Already planned for 2.6.25, see:
http://marc.info/?l=selinux&m=119973017423487&w=2


Right, the latter is reasonable.
Requires adding the class and permission definition to
policy/flask/security_classes and policy/flask/access_vectors and then
regenerating the kernel headers from those files, ala:
  svn co http://oss.tresys.com/repos/refpolicy/trunk refpolicy
  cd refpolicy/policy/flask
  vi security_classes access_vectors
  <add new class to end>
  make
  make LINUX_D=/path/to/linux-2.6 tokern
 
Dan knows how to do that.

-- 
Stephen Smalley
National Security Agency

--

From: Stephen Smalley
Date: Wednesday, January 9, 2008 - 12:19 pm

Policy ultimately has to be updated in order to start writing allow
rules based on the new class/perm.  libselinux et al doesn't have to
change.

If you have a "SELinux:  policy loaded with handle_unknown=allow"
message in your /var/log/messages, then new classes/perms that are not
yet known to the policy will be allowed by default, so the operation
will be permitted by the kernel.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Thursday, January 10, 2008 - 4:09 am

I don't.  How do I set it?

David
--

From: David Howells
Date: Monday, January 14, 2008 - 7:01 am

Okay...  It looks like I want four security operations/hooks for cachefiles:

 (1) Check that a daemon can nominate a secid for use by the kernel to override
     the process subjective secid.

 (2) Set the secid mentioned in (1).

 (3) Check that the kernel may create files as a particular secid (this could
     be specified indirectly by specifying an inode, which would hide the secid
     inside the LSM).

 (4) Set the fscreate secid mentioned in (3).

Now, it's possible to condense (1) and (2) into a single op, and condense (3)
and (4) into a single op.  That, however, might make the ops unusable by nfsd,
which may well want to bypass the checks or do them elsewhere.

Any thoughts?

David
--

From: David Howells
Date: Monday, January 14, 2008 - 7:06 am

FYI, I added the following vectors:

	# kernel services that need to override task security
	class kernel_service
	{
		use_as_override
		create_files_as
	}

The first allows:

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__USE_AS_OVERRIDE,
		     NULL);

And the second something like:

	avc_has_perm(tsec->sid, inode->sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__CREATE_FILES_AS,
		     NULL);

Rather than specifically dedicating them to the cache, I made them general.

David
--

From: Stephen Smalley
Date: Tuesday, January 15, 2008 - 7:58 am

Make sure that you or Dan submits a policy patch to register these
classes and permissions in the policy when the kernel patch is queued
for merge.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Wednesday, January 23, 2008 - 1:52 pm

Do I just send the attached patch to <selinux@tycho.nsa.gov>?  Or do I need to
make a diff from a point in the tree nearer the root?  Is there anything else
I need to alter whilst I'm at it?

David
---
Index: policy/flask/security_classes
===================================================================
--- policy/flask/security_classes	(revision 2573)
+++ policy/flask/security_classes	(working copy)
@@ -109,4 +109,7 @@
 # network peer labels
 class peer
 
+# kernel services that need to override task security
+class kernel_service
+
 # FLASK
Index: policy/flask/access_vectors
===================================================================
--- policy/flask/access_vectors	(revision 2573)
+++ policy/flask/access_vectors	(working copy)
@@ -736,3 +736,10 @@
 {
 	recv
 }
+
+# kernel services that need to override task security
+class kernel_service
+{
+	use_as_override
+	create_files_as
+}
--

From: James Morris
Date: Wednesday, January 23, 2008 - 3:03 pm

-- 
James Morris
<jmorris@namei.org>
--

From: Casey Schaufler
Date: Monday, January 14, 2008 - 7:52 am

Yes, and I would recommend doing so to avoid permission races.
You're going to have to deal with the case where step (2) fails
even if you have step (1), so the "test and set" mindset seems

Again, I don't think you're doing yourself any favors with a separate
test operation.

On (4) are you suggesting a third attribute value? There's the secid
of the task originally, the secid you're going to use to do the access

Let me see if I understand your current scheme.

You want a (object) secid that is used to access the task.
You want a (subject) secid that the task uses to accesses objects.
You want a (newobject) secid that an object gets on creation.
And you want them all to be distinct and settable.
Did I get that right?

Thank you.


Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Monday, January 14, 2008 - 8:19 am

Looking at SELinux, that doesn't get rid of the permission race because there's
no locking.  This may be different for other models.

I was thinking of having steps (2) and (4) not do any checking, but rather
assume that the caller has done the checks before calling the set routines,
possibly by calling the hooks mentioned in (1) and (3).

My main problem is that I don't know how NFSd wants to do things.  I suppose


That's correct.  Let me summarise:

 (1) The daemon has an active process security ID (say A).  When the daemon
     nominates an override process security ID (say B) to be used by the
     kernel, the cachefiles module asks the LSM to check that A is allowed to
     nominate B for this purpose.

 (2) The cachefiles module is given a path under which its cache exists.  The
     directory at the base of this path has its own security ID (say C).
     cachefiles wants to create new files in the cache with the same security
     ID as that directory (ie. C).

     However, when cachefiles is creating files in the cache, the security of
     whatever process is doing the access will be overridden with B, so
     cachefiles asks the LSM to check that B is allowed create files as C.

     Note that this is an instantaneous check in the cache startup stage.  This
     allows caching to be aborted early if the security policy does not permit
     B to create Cs.  Technically this check is superfluous as it's re-checked

That depends on what you mean.  cachefilesd (the daemon) will be run with a
security label because there's a security model in place.

I don't actually need to access the daemon, but the daemon does need to do

Correct.  This is used as an override by any task that accesses the cache
indirectly through the cachefiles module.

The cachefilesd daemon has its own secid with which it accesses the cache
directly.  The sets of permissions that must be granted by the module's
override subjective secid and by the daemon's subjective secid aren't

File ...
From: Stephen Smalley
Date: Tuesday, January 15, 2008 - 7:56 am

I don't think this check is on the kernel per se but rather the ability
of the daemon to nominate a secid for use on files created later by the

I think it is fine to combine them.

-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Tuesday, January 15, 2008 - 9:03 am

Hmmm...  At the moment the cachefiles module works out for itself what the
file label should be by looking at the root directory it was given and
assuming the label on that is what it's going to be using.  Are you suggesting
this should be specified directly instead by the daemon?

David
--

From: Stephen Smalley
Date: Tuesday, January 15, 2008 - 9:08 am

No, just that however the secid is determined (whether indirectly via
specification of a directory or directly via specification of a secid),
the ability of the daemon to control what secid gets used ought to be
controlled.  Or, alternatively, the ability of the daemon to enable
caching in a given directory ought to be controlled.

-- 
Stephen Smalley
National Security Agency

--

From: Casey Schaufler
Date: Tuesday, January 15, 2008 - 11:10 am

Oh my. While there will be cases where the label of the file
will match the label of the containing directory, and in fact
for most label based LSMs that will usually be the case, you
certainly can't count on it. The only place that you can find
the correct label for a file with any confidence in from the
xattr (assuming the LSM uses xattrs) on the file itself. I can
imaging an LSM for which it would make sense to derive the
label from the root directory, but I know Smack isn't one of
them, and I don't think that SELinux is either, although I
would defer a definitive answer on that to Stephen.


Casey Schaufler
casey@schaufler-ca.com
--

From: Stephen Smalley
Date: Tuesday, January 15, 2008 - 12:15 pm

The cache files are created by the cachefiles kernel module, not by the
userspace daemon, and the userspace daemon doesn't need to directly
read/write them at all (but I think it does need to be able to unlink
them?).  The userspace daemon merely identifies the directory where the
cache should live as part of configuring the cache when enabling it.

Hence, it is fine to use a fixed label for the cache files (systemhigh
in a MLS world), and to let the directory's label serve as the basis for
it.  Only the cachefiles kernel module directly reads and writes the
files.
 
-- 
Stephen Smalley
National Security Agency

--

From: David Howells
Date: Tuesday, January 15, 2008 - 2:55 pm

That is what I currently do.  SELinux rules are provided to grant the
appropriate file accesses to the override label used by the kernel module, so

Correct.
--

From: Casey Schaufler
Date: Tuesday, January 15, 2008 - 3:23 pm

Well, my bad, and thank you for clearing up my misunderstanding.


Casey Schaufler
casey@schaufler-ca.com
--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

[Empty message]
From: Nick Piggin
Date: Thursday, December 13, 2007 - 8:51 pm

This is pretty nasty. I would suggest either to have the function
return the number of pages that were added to pagecache, or just
--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_owner_priv_2)

     The marked page is backed by a local cache and is pinning resources in the
     cache driver.

 (2) PG_fscache_write (PG_owner_priv_3)

     The marked page is being written to the local cache.  The page may not be
     modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to detect this.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/splice.c                |    2 +-
 include/linux/page-flags.h |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/pagemap.h    |   11 +++++++++++
 mm/filemap.c               |   16 ++++++++++++++++
 mm/migrate.c               |    2 +-
 mm/page_alloc.c            |    3 +++
 mm/readahead.c             |    9 +++++----
 mm/swap.c                  |    4 ++--
 mm/swap_state.c            |    4 ++--
 mm/truncate.c              |   10 +++++-----
 mm/vmscan.c                |    2 +-
 11 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6bdcb61..61edad7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 		 */
 		wait_on_page_writeback(page);
 
-		if (PagePrivate(page))
+		if (page_has_private(page))
 			try_to_release_page(page, GFP_KERNEL);
 
 		/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..fcc9e23 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,30 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_owner_priv_1		 8	/* Owner use. If pagecache, fs may use*/
+#define PG_owner_priv_1		 8	/* Owner use. fs may use in pagecache */
 #define ...
From: Nick Piggin
Date: Thursday, December 13, 2007 - 9:08 pm

I'd much prefer if you would handle this in the filesystem, and have it
set PG_private whenever fscache needs to receive a callback, and DTRT
depending on whether PG_fscache etc. is set or not.

Also, this wait_on_page_fscache_write / end_page_fscache_write stuff
seems like it would belong in your fscache headers rather than generic
mm code (ditto for your PG_fscache checks in the page allocator -- you
--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/wait.h |    2 ++
 kernel/wait.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0e68628..f1038d0 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 #define is_sync_wait(wait)	(!(wait) || ((wait)->private))
 
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q,
+					 wait_queue_t *wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 
diff --git a/kernel/wait.c b/kernel/wait.c
index 444ddbf..7acc9cc 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
 }
 EXPORT_SYMBOL(add_wait_queue);
 
+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	unsigned long flags;
+
+	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+	spin_lock_irqsave(&q->lock, flags);
+	__add_wait_queue_tail(q, wait);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
 void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:

	+---------+
	|         |                        +--------------+
	|   NFS   |--+                     |              |
	|         |  |                 +-->|   CacheFS    |
	+---------+  |   +----------+  |   |  /dev/hda5   |
	             |   |          |  |   +--------------+
	+---------+  +-->|          |  |
	|         |      |          |--+
	|   AFS   |----->| FS-Cache |
	|         |      |          |--+
	+---------+  +-->|          |  |
	             |   |          |  |   +--------------+
	+---------+  |   +----------+  |   |              |
	|         |  |                 +-->|  CacheFiles  |
	|  ISOFS  |--+                     |  /var/cache  |
	|         |                        +--------------+
	+---------+

The patch also documents the netfs interface and the cache backend
interface provided by the facility.


There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:

 (1) Most filesystems don't do hole reportage.  Holes in files are treated as
     blocks of zeros and can't be distinguished otherwise, making it difficult
     to distinguish blocks that have been read from the network and cached from
     those that haven't.

 (2) The backing inode must be fully populated before being exposed to
     userspace through the main inode because the VM/VFS goes directly to the
     backing inode and does not interrogate the front inode on VM ops.

     Therefore:

     (a) The backing inode must fit entirely within the cache.

     (b) All backed files currently open must fit entirely within the cache at
     	 the same time.

     (c) A working set of files in total larger than the cache may not be
     	 cached.

     (d) A file may not grow larger than the available ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches.  This patch is not yet upstream, but is required
for cachefile on ia64.  It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/ia64_ksyms.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index bd17190..20c3546 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user);
 EXPORT_SYMBOL(__strlen_user);
 EXPORT_SYMBOL(__strncpy_from_user);
 EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);
 
 /* from arch/ia64/lib */
 extern void __divsi3(void);

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly.  This has two consequences:

 (*) Consistency.  Without this patch sometimes one is used and sometimes the
     other is.

 (*) A NULL file pointer can be passed.  This feature is then made use of by
     the generic hook in the next patch, which is used by CacheFiles to write
     pages to a file without setting up a file struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/ext3/inode.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..bc918d3 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1227,7 +1227,7 @@ static int ext3_generic_write_end(struct file *file,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -1252,7 +1252,7 @@ static int ext3_ordered_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 	unsigned from, to;
 	int ret = 0, ret2;
 
@@ -1293,7 +1293,7 @@ static int ext3_writeback_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/pagemap.h |    5 +++++
 mm/filemap.c            |   18 ++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a1b317..21c35e2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -223,6 +223,11 @@ static inline void wait_on_page_fscache_write(struct page *page)
 extern void end_page_fscache_write(struct page *page);
 
 /*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index bea1ba6..6872d1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,6 +521,24 @@ void fastcall wait_on_page_bit(struct page *page, int bit_nr)
 EXPORT_SYMBOL(wait_on_page_bit);
 
 /**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+	wait_queue_head_t *q = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&q->lock, flags);
+	__add_wait_queue(q, waiter);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
  * unlock_page - unlock a locked page
  * @page: the page
  *

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index ceaf2e3..cd199ae 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb)
 	__fsync_super(sb);
 	return sync_blockdev(sb->s_bdev);
 }
+EXPORT_SYMBOL_GPL(fsync_super);
 
 /**
  *	generic_shutdown_super	-	common helper for ->kill_sb()

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.


CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling.  This is called cachefilesd and lives in
/sbin.  The source for the daemon can be downloaded from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services.  Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
communication with the daemon.  Only one thing may have this open at once, and
whilst it is open, a cache is at least partially in existence.  The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section.  This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.


============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

	- dnotify.

	- extended attributes (xattrs).

	- openat() and friends.

	- bmap() support on files in the filesystem (FIBMAP ioctl).

	- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:39 pm

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

	http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

	mount warthog:/ /a -o fsc

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/Makefile           |    1 
 fs/nfs/client.c           |    5 +
 fs/nfs/file.c             |   37 ++++
 fs/nfs/fscache-def.c      |  289 +++++++++++++++++++++++++++++++++
 fs/nfs/fscache.c          |  391 +++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/fscache.h          |  148 +++++++++++++++++
 fs/nfs/inode.c            |   47 +++++
 fs/nfs/read.c             |   28 +++
 fs/nfs/super.c            |    3 
 fs/nfs/sysctl.c           |    1 
 include/linux/nfs_fs.h    |    9 +
 include/linux/nfs_fs_sb.h |   18 ++
 12 files changed, 968 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..073d04c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 			   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 70587f3..acb2179 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -43,6 +43,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
@@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname,
 	clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+	nfs_fscache_get_client_cookie(clp);
+
 	return clp;
 
 error_3:
@@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
 	nfs4_shutdown_client(clp);
 ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Changes to the kernel configuration defintions and to the NFS mount options to
allow the local caching support added by the previous patch to be enabled.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/Kconfig        |    8 ++++++++
 fs/nfs/client.c   |    2 ++
 fs/nfs/internal.h |    1 +
 fs/nfs/super.c    |   14 ++++++++++++++
 4 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 215b0d6..83d1227 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1650,6 +1650,14 @@ config NFS_V4
 
 	  If unsure, say N.
 
+config NFS_FSCACHE
+	bool "Provide NFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+	help
+	  Say Y here if you want NFS data to be cached locally on disc through
+	  the general filesystem cache manager
+
 config NFS_DIRECTIO
 	bool "Allow direct I/O on NFS files"
 	depends on NFS_FS
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index acb2179..be38c3c 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -575,6 +575,7 @@ static int nfs_init_server(struct nfs_server *server,
 
 	/* Initialise the client representation from the mount data */
 	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+	server->options = data->options;
 
 	if (data->rsize)
 		server->rsize = nfs_block_size(data->rsize, NULL);
@@ -931,6 +932,7 @@ static int nfs4_init_server(struct nfs_server *server,
 	/* Initialise the client representation from the mount data */
 	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
 	server->caps |= NFS_CAP_ATOMIC_OPEN;
+	server->options = data->options;
 
 	if (data->rsize)
 		server->rsize = nfs_block_size(data->rsize, NULL);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f3acf48..ef09e00 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -35,6 +35,7 @@ struct nfs_parsed_mount_data {
 	int			acregmin, acregmax,
 				acdirmin, acdirmax;
 	int			namlen;
+	unsigned int		options;
 	unsigned ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/client.c  |    7 ++++---
 fs/nfs/fscache.h |   15 +++++++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index be38c3c..91ecea3 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1335,7 +1335,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 
 	/* display header on line 1 */
 	if (v == &nfs_volume_list) {
-		seq_puts(m, "NV SERVER   PORT DEV     FSID\n");
+		seq_puts(m, "NV SERVER   PORT DEV     FSID              FSC\n");
 		return 0;
 	}
 	/* display one transport per line on subsequent lines */
@@ -1349,12 +1349,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 		 (unsigned long long) server->fsid.major,
 		 (unsigned long long) server->fsid.minor);
 
-	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n",
+	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n",
 		   clp->cl_nfsversion,
 		   NIPQUAD(clp->cl_addr.sin_addr),
 		   ntohs(clp->cl_addr.sin_port),
 		   dev,
-		   fsid);
+		   fsid,
+		   nfs_server_fscache_state(server));
 
 	return 0;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 144fb58..9a735fc 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -53,6 +53,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
 extern int nfs_fscache_release_page(struct page *, gfp_t);
 
 /*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+	if (server->nfs_client->fscache &&
+	    (server->options & NFS_OPTION_FSCACHE))
+		return "yes";
+	return "no ";
+}
+
+/*
  * release the caching state associated with a page if undergoing complete page
  * invalidation
  */
@@ -109,6 +120,10 @@ static inline void nfs4_fscache_get_client_cookie(struct nfs_client *clp) {}
 static inline void ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

 crypto/fcrypt.c |   88 ++++++++++++++++++++++++++++---------------------------
 1 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/crypto/fcrypt.c b/crypto/fcrypt.c
index d161949..a32cb68 100644
--- a/crypto/fcrypt.c
+++ b/crypto/fcrypt.c
@@ -51,7 +51,7 @@
 #define ROUNDS 16
 
 struct fcrypt_ctx {
-	u32 sched[ROUNDS];
+	__be32 sched[ROUNDS];
 };
 
 /* Rotate right two 32 bit numbers as a 56 bit number */
@@ -73,8 +73,8 @@ do {								\
  * /afs/transarc.com/public/afsps/afs.rel31b.export-src/rxkad/sboxes.h
  */
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x << 3)
-static const u32 sbox0[256] = {
+#define Z(x) __constant_cpu_to_be32(x << 3)
+static const __be32 sbox0[256] = {
 	Z(0xea), Z(0x7f), Z(0xb2), Z(0x64), Z(0x9d), Z(0xb0), Z(0xd9), Z(0x11),
 	Z(0xcd), Z(0x86), Z(0x86), Z(0x91), Z(0x0a), Z(0xb2), Z(0x93), Z(0x06),
 	Z(0x0e), Z(0x06), Z(0xd2), Z(0x65), Z(0x73), Z(0xc5), Z(0x28), Z(0x60),
@@ -110,8 +110,8 @@ static const u32 sbox0[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu((x << 27) | (x >> 5))
-static const u32 sbox1[256] = {
+#define Z(x) __constant_cpu_to_be32((x << 27) | (x >> 5))
+static const __be32 sbox1[256] = {
 	Z(0x77), Z(0x14), Z(0xa6), Z(0xfe), Z(0xb2), Z(0x5e), Z(0x8c), Z(0x3e),
 	Z(0x67), Z(0x6c), Z(0xa1), Z(0x0d), Z(0xc2), Z(0xa2), Z(0xc1), Z(0x85),
 	Z(0x6c), Z(0x7b), Z(0x67), Z(0xc6), Z(0x23), Z(0xe3), Z(0xf2), Z(0x89),
@@ -147,8 +147,8 @@ static const u32 sbox1[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x << 11)
-static const u32 sbox2[256] = {
+#define Z(x) __constant_cpu_to_be32(x << 11)
+static const __be32 sbox2[256] = {
 	Z(0xf0), Z(0x37), Z(0x24), Z(0x53), Z(0x2a), Z(0x03), Z(0x83), Z(0x86),
 	Z(0xd1), Z(0xec), Z(0x50), Z(0xf0), Z(0x42), Z(0x78), Z(0x2f), Z(0x6d),
 	Z(0xbf), Z(0x80), Z(0x87), Z(0x27), Z(0x95), Z(0xe2), Z(0xc5), Z(0x5d),
@@ -184,8 +184,8 @@ static const u32 sbox2[256] = {
 };
 
 #undef Z
-#define Z(x) ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Add a TestSetPageError() macro to the suite of page flag manipulators.  This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/page-flags.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fcc9e23..0350c37 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -130,6 +130,7 @@
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
 #define ClearPageError(page)	clear_bit(PG_error, &(page)->flags)
+#define TestSetPageError(page)	test_and_set_bit(PG_error, &(page)->flags)
 
 #define PageReferenced(page)	test_bit(PG_referenced, &(page)->flags)
 #define SetPageReferenced(page)	set_bit(PG_referenced, &(page)->flags)

--

From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Add a function - cancel_rejected_write() - to excise a rejected write from the
pagecache.  This function is related to the truncation family of routines.  It
permits the pages modified by a network filesystem client (such as AFS) to be
excised and discarded from the pagecache if the attempt to write them back to
the server fails.

The dirty and writeback states of the afflicted pages are cancelled and the
pages themselves are detached for recycling.  All PTEs referring to those
pages are removed.

Note that the locking is tricky as it's very easy to deadlock against
truncate() and other routines once the pages have been unlocked as part of the
writeback process.  To this end, the PG_error flag is set, then the
PG_writeback flag is cleared, and only *then* can lock_page() be called.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    5 ++-
 mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 520238c..438270f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1005,12 +1005,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
-/* filemap.c */
-extern unsigned long page_unuse(struct page *);
+/* truncate.c */
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
 
+/* filemap.c */
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index 5b7d1c5..95fc1a8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -465,3 +465,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
 	return ...
From: Nick Piggin
Date: Thursday, December 13, 2007 - 9:21 pm

[Empty message]
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Improve the handling of the case of a server rejecting an attempt to write back
a cached write.  AFS operates a write-back cache, so the following sequence of
events can theoretically occur:

	CLIENT 1		CLIENT 2
	=======================	=======================
	cat data >/the/file
	 (sits in pagecache)
				fs setacl -dir /the/dir/of/the/file \
					-acl system:administrators rlidka
				 (write permission removed for client 1)
	sync
	 (writeback attempt fails)

The way AFS attempts to handle this is:

 (1) The affected region will be excised and discarded on the basis that it
     can't be written back, yet we don't want it lurking in the page cache
     either.  The contents of the affected region will be reread from the
     server when called for again.

 (2) The EOF size will be set to the current server-based file size - usually
     that which it was before the affected write was made - assuming no
     conflicting write has been appended, and assuming the affected write
     extended the file.


This patch makes the following changes:

 (1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS
     server puts a silly value as the size of the returned data.  This prevents
     excised pages beyond the revised EOF being reinstantiated with a surprise
     PG_error.

 (2) Writebacks can now be put into a 'rejected' state in which all further
     attempts to write them back will result in excision of the affected pages
     instead.

 (3) Preparing a page for overwriting now reads the whole page instead of just
     those parts of it that aren't to be covered by the copy to be made.  This
     handles the possibility that the copy might fail on EFAULT.  Corollary to
     this, PG_update can now be set by afs_prepare_page() on behalf of
     afs_prepare_write() rather than setting it in afs_commit_write().

 (4) In the case of a conflicting write, afs_prepare_write() will attempt to
     flush the write to the server, and will then wait for ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

Save the operation ID to be used with a call that we're making for display
through /proc/net/rxrpc_calls.  This helps debugging stuck operations as we
then know what they are.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/fsclient.c       |   32 +++++++++++++++++++++++---------
 fs/afs/rxrpc.c          |    1 +
 fs/afs/vlclient.c       |    2 ++
 include/net/af_rxrpc.h  |    1 +
 net/rxrpc/af_rxrpc.c    |    3 +++
 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/ar-proc.c     |    7 ++++---
 7 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 04584c0..a468f2d 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -287,6 +287,7 @@ int afs_fs_fetch_file_status(struct afs_server *server,
 	call->reply2 = volsync;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSFETCHSTATUS);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -316,7 +317,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call,
 	case 0:
 		call->offset = 0;
 		call->unmarshall++;
-		if (call->operation_ID != FSFETCHDATA64) {
+		if (call->operation_ID != htonl(FSFETCHDATA64)) {
 			call->unmarshall++;
 			goto no_msw;
 		}
@@ -464,7 +465,7 @@ static int afs_fs_fetch_data64(struct afs_server *server,
 	call->reply3 = buffer;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
-	call->operation_ID = FSFETCHDATA64;
+	call->operation_ID = htonl(FSFETCHDATA64);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -509,7 +510,7 @@ int afs_fs_fetch_data(struct afs_server *server,
 	call->reply3 = buffer;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
-	call->operation_ID = FSFETCHDATA;
+	call->operation_ID = htonl(FSFETCHDATA);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -577,6 +578,7 @@ int afs_fs_give_up_callbacks(struct afs_server *server,
 
 	call->service_id = FS_SERVICE;
 	call->port = ...
From: David Howells
Date: Wednesday, December 5, 2007 - 12:40 pm

The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches.  The kAFS filesystem will use caching
automatically if it's available.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 fs/Kconfig         |    8 +
 fs/afs/Makefile    |    3 
 fs/afs/cache.c     |  505 ++++++++++++++++++++++++++++++++++------------------
 fs/afs/cache.h     |   15 --
 fs/afs/cell.c      |   16 +-
 fs/afs/file.c      |  212 +++++++++++++---------
 fs/afs/inode.c     |   26 +--
 fs/afs/internal.h  |   53 ++---
 fs/afs/main.c      |   27 +--
 fs/afs/mntpt.c     |    4 
 fs/afs/vlocation.c |   23 +-
 fs/afs/volume.c    |   14 -
 fs/afs/write.c     |    6 -
 13 files changed, 537 insertions(+), 375 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 83d1227..7f3278f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -2120,6 +2120,14 @@ config AFS_DEBUG
 
 	  If unsure, say N.
 
+config AFS_FSCACHE
+	bool "Provide AFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on AFS_FS=m && FSCACHE || AFS_FS=y && FSCACHE=y
+	help
+	  Say Y here if you want AFS data to be cached locally on disk through
+	  the generic filesystem cache manager
+
 config 9P_FS
 	tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)"
 	depends on INET && NET_9P && EXPERIMENTAL
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
 # Makefile for Red Hat Linux AFS client.
 #
 
+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
 kafs-objs := \
+	$(afs-cache-y) \
 	callback.o \
 	cell.o \
 	cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..8e179a9 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -9,248 +9,399 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const void *entry);
-static void ...
Previous thread: none

Next thread: [bug?] ALSA sound/core/info.c:852: BUG? (root) by Ingo Molnar on Wednesday, December 5, 2007 - 1:00 pm. (2 messages)