Hi Al, Christoph, Trond, Stephen, Casey,
Here's a set of patches that implement a very basic set of COW credentials. It
compiles, links and runs for x86_64 with EXT3, (V)FAT, NFS, AFS, SELinux and
keyrings all enabled. Most other filesystems are disabled, apart from things
like proc. It is not intended to completely cover the kernel at this point.The cred struct contains the credentials that the kernel needs to act upon
something or to create something. Credentials that govern how a task may be
acted upon remain in the task struct.In essence, the introduction of the cred struct separates a task's subjective
context (the authority with which it acts) from its objective context (the
authorisation required by others that want to act upon it), and permits
overriding of the subjective context by a kernel service so that the service
can act on the task's behalf to do something the task couldn't do on its own
authority.Because keyrings and effective capabilities can be installed or changed in one
process by another process, they are shadowed by the cred structure rather than
residing there. Additionally, the session and process keyrings are shared
between all the threads of a process. The shadowing is performed by
update_current_cred() which is invoked on entry to any system call that might
need it.A thread's cred struct may be read by that thread without any RCU precautions
as only that thread may replace the its own cred struct. To change a thread's
credentials, dup_cred() should be called to create a new copy, the copy should
be changed, and then set_current_cred() should be called to make it live. Once
live, it may not be changed as it may then be shared with file descriptors, RPC
calls and other threads. RCU will be used to dispose of the old structure.The four patches are:
(1) Introduce struct cred and migrate fsuid, fsgid, the groups list and the
keyrings pointer to it.(2) Introduce a security pointer into the cred struct and add LSM hooks t...
The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+The patch also documents the netfs interface and the cache backend
interface provided by the facility.There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:(1) Most filesystems don't do hole reportage. Holes in files are treated as
blocks of zeros and can't be distinguished otherwise, making it difficult
to distinguish blocks that have been read from the network and cached from
those that haven't.(2) The backing inode must be fully populated before being exposed to
userspace through the main inode because the VM/VFS goes directly to the
backing inode and does not interrogate the front inode on VM ops.Therefore:
(a) The backing inode must fit entirely within the cache.
(b) All backed files currently open must fit entirely within the cache at
the same time.(c) A working set of files in total larger than the cache may not be
cached.(d) A file may not grow larger than the...
Introduce a copy on write credentials record (struct cred). The fsuid, fsgid,
supplementary groups list move into it (DAC security). The session, process
and thread keyrings are reflected in it, but don't primarily reside there as
they aren't per-thread and occasionally need to be instantiated or replaced by
other threads or processes.The LSM security information (MAC security) does *not* migrate from task_struct
at this point, but will be addressed by a later patch.task_struct then gains an RCU-governed pointer to the credentials as a
replacement to the members it lost.struct file gains a pointer to (f_cred) and a reference on the cred struct that
the opener was using at the time the file was opened. This replaces f_uid and
f_gid.To alter the credentials record, a copy must be made. This copy may then be
altered and then the pointer in the task_struct redirected to it. From that
point on the new record should be considered immutable.In addition, the default setting of i_uid and i_gid to fsuid and fsgid has been
moved from the callers of new_inode() into new_inode() itself.Signed-off-by: David Howells <dhowells@redhat.com>
---arch/x86_64/kernel/sys_x86_64.c | 4 +
fs/aio.c | 25 +++++
fs/anon_inodes.c | 2
fs/attr.c | 4 -
fs/compat.c | 65 +++++++++++++-
fs/compat_ioctl.c | 7 +
fs/dcookies.c | 11 +-
fs/devpts/inode.c | 6 +
fs/dquot.c | 2
fs/eventfd.c | 4 +
fs/eventpoll.c | 16 +++
fs/exec.c | 37 +++++++-
fs/ext3/balloc.c | 2
fs/ext3/ialloc.c | 4 -
fs/fcntl.c | 11 ++
fs/file_table.c | 3 -
fs/filesystems.c | 7 +
fs/inode.c | 6 +
fs/inotify_user.c | ...
Move into the cred struct the part of the task security data that defines how a
task acts upon an object. The part that defines how something acts upon a task
remains attached to the task.For SELinux this requires some of task_security_struct to be split off into
cred_security_struct which is then attached to struct cred. Note that the
contents of cred_security_struct may not be changed without the generation of a
new struct cred.The split is as follows:
(*) create_sid, keycreate_sid and sockcreate_sid just move across.
(*) sid is split into victim_sid - which remains - and action_sid - which
migrates.(*) osid, exec_sid and ptrace_sid remain.
victim_sid is the SID used to govern actions upon the task. action_sid is used
to govern actions made by the task.When accessing the cred_security_struct of another process, RCU read procedures
must be observed.Signed-off-by: David Howells <dhowells@redhat.com>
---include/linux/cred.h | 1
include/linux/security.h | 33 ++
kernel/cred.c | 7 +
security/dummy.c | 11 +
security/selinux/exports.c | 6
security/selinux/hooks.c | 497 +++++++++++++++++++++++--------------
security/selinux/include/objsec.h | 16 +
security/selinux/selinuxfs.c | 8 -
security/selinux/xfrm.c | 6
9 files changed, 379 insertions(+), 206 deletions(-)diff --git a/include/linux/cred.h b/include/linux/cred.h
index f3d98a8..98d5279 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -26,6 +26,7 @@ struct cred {
gid_t gid; /* fsgid as was */
struct rcu_head exterminate; /* cred destroyer */
struct group_info *group_info;
+ void *security;/* caches for references to the three task keyrings
* - note that key_ref_t isn't typedef'd at this point, hence the odd
diff --git a/include/linux/security.h b/include/linux/security.h
index 1a15526..74cc204 100644
--- a/include/linux/securi...
Move the effective capabilities mask from the task struct into the credentials
record.Note that the effective capabilities mask in the cred struct shadows that in
the task_struct because a thread can have its capabilities masks changed by
another thread. The shadowing is performed by update_current_cred() which is
invoked on entry to any system call that might need it.Signed-off-by: David Howells <dhowells@redhat.com>
---fs/buffer.c | 3 +++
fs/ioprio.c | 3 +++
fs/open.c | 27 +++++++++------------------
fs/proc/array.c | 2 +-
fs/readdir.c | 3 +++
include/linux/cred.h | 2 ++
include/linux/init_task.h | 2 +-
include/linux/sched.h | 2 +-
ipc/msg.c | 3 +++
ipc/sem.c | 3 +++
ipc/shm.c | 3 +++
kernel/acct.c | 3 +++
kernel/capability.c | 3 +++
kernel/compat.c | 3 +++
kernel/cred.c | 36 +++++++++++++++++++++++++++++-------
kernel/exit.c | 2 ++
kernel/fork.c | 6 +++++-
kernel/futex.c | 3 +++
kernel/futex_compat.c | 3 +++
kernel/kexec.c | 3 +++
kernel/module.c | 6 ++++++
kernel/ptrace.c | 3 +++
kernel/sched.c | 9 +++++++++
kernel/signal.c | 6 ++++++
kernel/sys.c | 39 +++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 3 +++
kernel/time.c | 9 +++++++++
kernel/uid16.c | 3 +++
mm/mempolicy.c | 6 ++++++
mm/migrate.c | 3 +++
mm/mlock.c | 4 ++++
mm/mmap.c | 3 +++
mm/mremap.c | 3 +++
mm/oom_kill.c | 9 +++++++--
mm/swapfile.c | 6 ++++++
net/compat.c | 6 ++++++
net/socket.c | 45 ++++++++++++++++++++++++++++++++++++...
The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.The invalidatepage() address space op is called (indirectly) to do the honours.
Signed-Off-By: David Howells <dhowells@redhat.com>
---mm/readahead.c | 40 ++++++++++++++++++++++++++++++++++++++--
1 files changed, 38 insertions(+), 2 deletions(-)diff --git a/mm/readahead.c b/mm/readahead.c
index 39bf45d..12d1378 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -15,6 +15,7 @@
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/pagevec.h>
+#include <linux/buffer_head.h>void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -51,6 +52,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ * such as the NFS fs marking pages that are cached locally on disk, thus we
+ * need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+ struct page *page)
+{
+ if (PagePrivate(page)) {
+ if (TestSetPageLocked(page))
+ BUG();
+ page->mapping = mapping;
+ do_invalidatepage(page, 0);
+ page->mapping = NULL;
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+ struct list_head *pages)
+{
+ struct page *victim;
+
+ while (!list_empty(pages)) {
+ victim = list_to_page(pages);
+ list_del(&victim->lru);
+ read_cache_pages_invalidate_page...
Recruit a couple of page flags to aid in cache management. The following extra
flags are defined:(1) PG_fscache (PG_owner_priv_2)
The marked page is backed by a local cache and is pinning resources in the
cache driver.(2) PG_fscache_write (PG_owner_priv_3)
The marked page is being written to the local cache. The page may not be
modified whilst this is in progress.If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to detect this.Signed-off-by: David Howells <dhowells@redhat.com>
---fs/splice.c | 2 +-
include/linux/page-flags.h | 30 +++++++++++++++++++++++++++++-
include/linux/pagemap.h | 11 +++++++++++
mm/filemap.c | 16 ++++++++++++++++
mm/migrate.c | 2 +-
mm/page_alloc.c | 3 +++
mm/readahead.c | 9 +++++----
mm/swap.c | 4 ++--
mm/swap_state.c | 4 ++--
mm/truncate.c | 10 +++++-----
mm/vmscan.c | 2 +-
11 files changed, 76 insertions(+), 17 deletions(-)diff --git a/fs/splice.c b/fs/splice.c
index ceb1f07..1a8b80c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
*/
wait_on_page_writeback(page);- if (PagePrivate(page))
+ if (page_has_private(page))
try_to_release_page(page, GFP_KERNEL);/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..eaf9854 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,19 +83,24 @@
#define PG_private 11 /* If pagecache, has fs-private data */#define PG_writeback 12 /* Page is under writeback */
+#define PG_owner_priv_2 13 /* Owner use. If pagecache, fs may use */
#define PG_compound 14 /* Part of a compound page */
...
Request a credential record for the named kernel service. This produces a
cred struct with appropriate DAC and MAC controls for effecting that service.
It may be used to override the credentials on a task to do work on that task's
behalf.Signed-off-by: David Howells <dhowells@redhat.com>
---include/linux/cred.h | 2 +
include/linux/security.h | 45 ++++++++++++++++++++++++++++++
kernel/cred.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
security/dummy.c | 13 +++++++++
security/selinux/hooks.c | 47 ++++++++++++++++++++++++++++++++
5 files changed, 175 insertions(+), 0 deletions(-)diff --git a/include/linux/cred.h b/include/linux/cred.h
index f2df1c3..fcbfc89 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -49,6 +49,8 @@ extern void change_fsgid(struct cred *, gid_t);
extern void change_groups(struct cred *, struct group_info *);
extern void change_cap(struct cred *, kernel_cap_t);
extern struct cred *dup_cred(const struct cred *);
+extern struct cred *get_kernel_cred(const char *, struct task_struct *);
+extern int change_create_files_as(struct cred *, struct inode *);/**
* get_cred - Get an extra reference on a credentials record
diff --git a/include/linux/security.h b/include/linux/security.h
index 74cc204..7f11e6d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -514,6 +514,20 @@ struct request_sock;
* @cred_destroy:
* Destroy the credentials attached to a cred structure.
* @cred points to the credentials structure that is to be destroyed.
+ * @cred_kernel_act_as:
+ * Set the credentials for a kernel service to act as (subjective context).
+ * @cred points to the credentials structure to be filled in.
+ * @service names the service making the request.
+ * @daemon: A userspace daemon to be used as a base for the context.
+ * @dentry: A file or dir to be used as a base for the file creation
+ * context.
+ * Return 0 if successful.
+ * @cred_create_files_as:
...
This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches. This patch is not yet upstream, but is required
for cachefile on ia64. It will be pushed upstream when cachefile goes
upstream.Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-Off-By: David Howells <dhowells@redhat.com>
---arch/ia64/kernel/ia64_ksyms.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index bd17190..20c3546 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user);
EXPORT_SYMBOL(__strlen_user);
EXPORT_SYMBOL(__strncpy_from_user);
EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);/* from arch/ia64/lib */
extern void __divsi3(void);-
This patch set is available for download as a tarball from:
http://people.redhat.com/~dhowells/nfs/nfs+fscache-23.tar.bz2
David
-
Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.Signed-off-by: David Howells <dhowells@redhat.com>
---include/linux/wait.h | 1 +
kernel/wait.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+), 0 deletions(-)diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0e68628..4cae7db 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,7 @@ static inline int waitqueue_active(wait_queue_head_t *q)
#define is_sync_wait(wait) (!(wait) || ((wait)->private))extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t * wait));
extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));diff --git a/kernel/wait.c b/kernel/wait.c
index 444ddbf..7acc9cc 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
}
EXPORT_SYMBOL(add_wait_queue);+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+ unsigned long flags;
+
+ wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+ spin_lock_irqsave(&q->lock, flags);
+ __add_wait_queue_tail(q, wait);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;-
I think you want the effective values to show up in /proc.
Casey Schaufler
casey@schaufler-ca.com
-
Perhaps - but bear in mind that in the override case they weren't set by the
process itself.David
-
They are nonetheless in effect and (heaven forbid) should they be
abused you don't want to hide the facts from concerned observers.Casey Schaufler
casey@schaufler-ca.com
-
Because, I suspect, what the observer through /proc should see is what the
process thinks it is doing, not what is transparently going on behind the
scenes.David
-
The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches. The kAFS filesystem will use caching
automatically if it's available.Signed-Off-By: David Howells <dhowells@redhat.com>
---fs/Kconfig | 8 +
fs/afs/Makefile | 3
fs/afs/cache.c | 505 ++++++++++++++++++++++++++++++++++------------------
fs/afs/cache.h | 15 --
fs/afs/cell.c | 16 +-
fs/afs/file.c | 212 +++++++++++++---------
fs/afs/fsclient.c | 32 ++-
fs/afs/inode.c | 25 +--
fs/afs/internal.h | 53 ++---
fs/afs/main.c | 27 +--
fs/afs/mntpt.c | 4
fs/afs/rxrpc.c | 1
fs/afs/vlclient.c | 2
fs/afs/vlocation.c | 23 +-
fs/afs/volume.c | 14 -
fs/afs/write.c | 6 -
16 files changed, 563 insertions(+), 383 deletions(-)diff --git a/fs/Kconfig b/fs/Kconfig
index ebc7341..158a8d8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -2059,6 +2059,14 @@ config AFS_DEBUGIf unsure, say N.
+config AFS_FSCACHE
+ bool "Provide AFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on AFS_FS=m && FSCACHE || AFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want AFS data to be cached locally on disk through
+ the generic filesystem cache manager
+
config 9P_FS
tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)"
depends on INET && NET_9P && EXPERIMENTAL
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
# Makefile for Red Hat Linux AFS client.
#+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
kafs-objs := \
+ $(afs-cache-y) \
callback.o \
cell.o \
cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..a5d6a70 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -9,248 +9,399 @@
* 2 of the License, or (at your option) any later version.
*/-#ifdef AFS_CAC...
Save the operation ID to be used with a call that we're making for display
through /proc/net/rxrpc_calls. This helps debugging stuck operations as we
then know what they are.Signed-off-by: David Howells <dhowells@redhat.com>
---include/net/af_rxrpc.h | 1 +
net/rxrpc/af_rxrpc.c | 3 +++
net/rxrpc/ar-internal.h | 1 +
net/rxrpc/ar-proc.c | 7 ++++---
4 files changed, 9 insertions(+), 3 deletions(-)diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index 00c2eaa..7e99733 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -38,6 +38,7 @@ extern void rxrpc_kernel_intercept_rx_messages(struct socket *,
extern struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *,
struct sockaddr_rxrpc *,
struct key *,
+ u32,
unsigned long,
gfp_t);
extern int rxrpc_kernel_send_data(struct rxrpc_call *, struct msghdr *,
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index c58fa0d..621c1dd 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -251,6 +251,7 @@ static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
* @sock: The socket on which to make the call
* @srx: The address of the peer to contact (defaults to socket setting)
* @key: The security context to use (defaults to socket setting)
+ * @operation_ID: The operation ID for this call (debugging only)
* @user_call_ID: The ID to use
*
* Allow a kernel service to begin a call on the nominated socket. This just
@@ -263,6 +264,7 @@ static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
struct sockaddr_rxrpc *srx,
struct key *key,
+ u32 operation_ID,
unsigned long user_call_ID,
gfp_t gfp)
{
@@ -311,6 +313,7 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
call = rxrpc_get_client_call(rx, trans, bundle, user_call_ID, true,
...
Implement shared-writable mmap for AFS.
The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.Signed-off-by: David Howells <dhowells@redhat.com>
---fs/afs/file.c | 20 +++++++++++++++++++-
fs/afs/internal.h | 1 +
fs/afs/write.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+), 1 deletions(-)diff --git a/fs/afs/file.c b/fs/afs/file.c
index 525f7c5..1323df4 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
static void afs_invalidatepage(struct page *page, unsigned long offset);
static int afs_releasepage(struct page *page, gfp_t gfp_flags);
static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);const struct file_operations afs_file_operations = {
.open = afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
.write = do_sync_write,
.aio_read = generic_file_aio_read,
.aio_write = afs_file_write,
- .mmap = generic_file_readonly_mmap,
+ .mmap = afs_mmap,
.splice_read = generic_file_splice_read,
.fsync = afs_fsync,
.lock = afs_lock,
@@ -56,6 +57,11 @@ const struct address_space_operations afs_fs_aops = {
.writepages = afs_writepages,
};+static struct vm_operations_struct afs_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = afs_page_mkwrite,
+};
+
/*
* open an AFS file or directory and attach a key to it
*/
@@ -295,3 +301,15 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
_leave(" = 0");
return 0;
}
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+...
Improve the handling of the case of a server rejecting an attempt to write back
a cached write. AFS operates a write-back cache, so the following sequence of
events can theoretically occur:CLIENT 1 CLIENT 2
======================= =======================
cat data >/the/file
(sits in pagecache)
fs setacl -dir /the/dir/of/the/file \
-acl system:administrators rlidka
(write permission removed for client 1)
sync
(writeback attempt fails)The way AFS attempts to handle this is:
(1) The affected region will be excised and discarded on the basis that it
can't be written back, yet we don't want it lurking in the page cache
either. The contents of the affected region will be reread from the
server when called for again.(2) The EOF size will be set to the current server-based file size - usually
that which it was before the affected write was made - assuming no
conflicting write has been appended, and assuming the affected write
extended the file.This patch makes the following changes:
(1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS
server puts a silly value as the size of the returned data. This prevents
excised pages beyond the revised EOF being reinstantiated with a surprise
PG_error.(2) Writebacks can now be put into a 'rejected' state in which all further
attempts to write them back will result in excision of the affected pages
instead.(3) Preparing a page for overwriting now reads the whole page instead of just
those parts of it that aren't to be covered by the copy to be made. This
handles the possibility that the copy might fail on EFAULT. Corollary to
this, PG_update can now be set by afs_prepare_page() on behalf of
afs_prepare_write() rather than setting it in afs_commit_write().(4) In the case of a conflicting write, afs_prepare_write() will attempt to
flush the write to the server, and will then wait for P...
Add a function - cancel_rejected_write() - to excise a rejected write from the
pagecache. This function is related to the truncation family of routines. It
permits the pages modified by a network filesystem client (such as AFS) to be
excised and discarded from the pagecache if the attempt to write them back to
the server fails.The dirty and writeback states of the afflicted pages are cancelled and the
pages themselves are detached for recycling. All PTEs referring to those
pages are removed.Note that the locking is tricky as it's very easy to deadlock against
truncate() and other routines once the pages have been unlocked as part of the
writeback process. To this end, the PG_error flag is set, then the
PG_writeback flag is cleared, and only *then* can lock_page() be called.Signed-off-by: David Howells <dhowells@redhat.com>
---include/linux/mm.h | 5 ++-
mm/truncate.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 86 insertions(+), 2 deletions(-)diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1692dd6..49863df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1091,12 +1091,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);extern unsigned long do_brk(unsigned long, unsigned long);
-/* filemap.c */
-extern unsigned long page_unuse(struct page *);
+/* truncate.c */
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
loff_t lstart, loff_t lend);
+extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);+/* filemap.c */
/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);diff --git a/mm/truncate.c b/mm/truncate.c
index 5555cb0..92a68f7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -462,3 +462,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
return inval...
Add a TestSetPageError() macro to the suite of page flag manipulators. This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.Signed-off-by: David Howells <dhowells@redhat.com>
---include/linux/page-flags.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index eaf9854..b59506b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -130,6 +130,7 @@
#define PageError(page) test_bit(PG_error, &(page)->flags)
#define SetPageError(page) set_bit(PG_error, &(page)->flags)
#define ClearPageError(page) clear_bit(PG_error, &(page)->flags)
+#define TestSetPageError(page) test_and_set_bit(PG_error, &(page)->flags)#define PageReferenced(page) test_bit(PG_referenced, &(page)->flags)
#define SetPageReferenced(page) set_bit(PG_referenced, &(page)->flags)-
Display the local caching state in /proc/fs/nfsfs/volumes.
Signed-off-by: David Howells <dhowells@redhat.com>
---fs/nfs/client.c | 7 ++++---
fs/nfs/fscache.h | 12 ++++++++++++
2 files changed, 16 insertions(+), 3 deletions(-)diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 0de4db4..d350668 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1319,7 +1319,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)/* display header on line 1 */
if (v == &nfs_volume_list) {
- seq_puts(m, "NV SERVER PORT DEV FSID\n");
+ seq_puts(m, "NV SERVER PORT DEV FSID FSC\n");
return 0;
}
/* display one transport per line on subsequent lines */
@@ -1333,12 +1333,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
(unsigned long long) server->fsid.major,
(unsigned long long) server->fsid.minor);- seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n",
+ seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n",
clp->cl_nfsversion,
NIPQUAD(clp->cl_addr.sin_addr),
ntohs(clp->cl_addr.sin_port),
dev,
- fsid);
+ fsid,
+ nfs_server_fscache_state(server));return 0;
}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 44bb0d1..77f3450 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -56,6 +56,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
extern int nfs_fscache_release_page(struct page *, gfp_t);/*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+ if (server->nfs_client->fscache &&
+ (server->options & NFS_OPTION_FSCACHE))
+ return "yes";
+ return "no ";
+}
+
+/*
* release the caching state associated with a page if undergoing complete page
* invalidation
*/
@@ -110,6 +121,7 @@ static inline void nfs_fscache_unregister(void) {}
static inline void n...
Changes to the kernel configuration defintions and to the NFS mount options to
allow the local caching support added by the previous patch to be enabled.Signed-off-by: David Howells <dhowells@redhat.com>
---fs/Kconfig | 8 ++++++++
fs/nfs/client.c | 14 ++++++++++----
fs/nfs/internal.h | 2 ++
fs/nfs/super.c | 40 ++++++++++++++++++++++++++++++++++------
4 files changed, 54 insertions(+), 10 deletions(-)diff --git a/fs/Kconfig b/fs/Kconfig
index 8ae7eda..ebc7341 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1597,6 +1597,14 @@ config NFS_V4If unsure, say N.
+config NFS_FSCACHE
+ bool "Provide NFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
config NFS_DIRECTIO
bool "Allow direct I/O on NFS files"
depends on NFS_FS
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f1783b2..0de4db4 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -543,7 +543,8 @@ error:
/*
* Create a version 2 or 3 client
*/
-static int nfs_init_server(struct nfs_server *server, const struct nfs_mount_data *data)
+static int nfs_init_server(struct nfs_server *server, const struct nfs_mount_data *data,
+ unsigned int extra_options)
{
struct nfs_client *clp;
int error, nfsvers = 2;
@@ -580,6 +581,7 @@ static int nfs_init_server(struct nfs_server *server, const struct nfs_mount_dat
server->acregmax = data->acregmax * HZ;
server->acdirmin = data->acdirmin * HZ;
server->acdirmax = data->acdirmax * HZ;
+ server->options = extra_options;/* Start lockd here, before we might error out */
error = nfs_start_lockd(server);
@@ -776,6 +778,7 @@ void nfs_free_server(struct nfs_server *server)
* - keyed on server and FSID
*/
struct nfs_server *nfs_create_server(const struct nfs_mount_data ...
The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).To be able to use this, an updated mount program is required. This can be
obtained from:http://people.redhat.com/steved/fscache/util-linux/
To mount an NFS filesystem to use caching, add an "fsc" option to the mount:
mount warthog:/ /a -o fsc
Signed-Off-By: David Howells <dhowells@redhat.com>
---fs/nfs/Makefile | 1
fs/nfs/client.c | 5 +
fs/nfs/file.c | 51 ++++++
fs/nfs/fscache-def.c | 288 +++++++++++++++++++++++++++++++++++
fs/nfs/fscache.c | 372 +++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 144 +++++++++++++++++
fs/nfs/inode.c | 48 +++++-
fs/nfs/read.c | 28 +++
fs/nfs/sysctl.c | 44 +++++
include/linux/nfs_fs.h | 8 +
include/linux/nfs_fs_sb.h | 7 +
11 files changed, 986 insertions(+), 10 deletions(-)diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index b55cb23..07c9345 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,4 +16,5 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
nfs4namespace.o
nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
nfs-objs := $(nfs-y)
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index a49f9fe..f1783b2 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -41,6 +41,7 @@
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
+#include "fscache.h"#define NFSDBG_FACILITY NFSDBG_CLIENT
@@ -137,6 +138,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname,
clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
#endif+ nfs_fscache_get_client_cookie(clp);
+
return clp;error_3:
@@ -168,6 +171,8 @@ static void nfs_free_client(struct nfs_client *clp)nfs4_shutdown_client(clp);
...
Did I miss the section where the modified semantics about which
mounted file systems can use the cache and which ones can not
was implemented? For example, mounts of the same file system
from the server with "fsc", but with different mount options
such as "rw" or "ro" or NFS dependent mount options, must fail
because of the way that the cache is accessed. Also, perhaps
a little confusing, that mounts of different paths on a server
which land on the same mounted file system on the server, but
with these differing mount options must also fail?Thanx...
ps
-
Yes.
David
-
fs/nfs/super.c:
case Opt_sharecache:
mnt->flags &= ~NFS_MOUNT_UNSHARED;
break;
case Opt_nosharecache:
mnt->flags |= NFS_MOUNT_UNSHARED;
mnt->options &= ~NFS_OPTION_FSCACHE;
break;
case Opt_fscache:
/* sharing is mandatory with fscache */
mnt->options |= NFS_OPTION_FSCACHE;
mnt->flags &= ~NFS_MOUNT_UNSHARED;
break;
case Opt_nofscache:
mnt->options &= ~NFS_OPTION_FSCACHE;
break;Hmmm... Actually, I'm not sure this is sufficient.
David
-
This doesn't seem to take into account any of the other options
which can cause sharing to be disabled. Perhaps SteveD can add
his patch to the mix which does resolve the issues?Thanx...
ps
-
Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin. The source for the daemon can be downloaded from:http://people.redhat.com/~dhowells/cachefs/cachefilesd.c
And an example configuration from:
http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf
The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
communication with the daemon. Only one thing may have this open at once, and
whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.CacheFiles is currently limited to a single cache.
CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.============
REQUIREMENTS
============The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:- dnotify.
- extended attributes (xattrs).
- openat() and friends.
- bmap() support on files in the filesystem (FIBMAP ioctl).
- The use of bmap() to detect a partial page at the end of the file.
It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.=============
C...
Export a number of functions for CacheFiles's use.
Signed-Off-By: David Howells <dhowells@redhat.com>
---fs/super.c | 2 ++
kernel/auditsc.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)diff --git a/fs/super.c b/fs/super.c
index 28e7370..0e8c0e2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -270,6 +270,8 @@ int fsync_super(struct super_block *sb)
return sync_blockdev(sb->s_bdev);
}+EXPORT_SYMBOL_GPL(fsync_super);
+
/**
* generic_shutdown_super - common helper for ->kill_sb()
* @sb: superblock to kill
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 282e041..4448a33 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1531,6 +1531,8 @@ add_names:
}
}+EXPORT_SYMBOL_GPL(__audit_inode_child);
+
/**
* auditsc_get_stamp - get local copies of audit_context values
* @ctx: audit_context for the task-
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.Signed-Off-By: David Howells <dhowells@redhat.com>
---include/linux/pagemap.h | 5 +++++
mm/filemap.c | 19 +++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d1049b6..452fdcf 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -220,6 +220,11 @@ static inline void wait_on_page_fscache_write(struct page *page)
extern void end_page_fscache_write(struct page *page);/*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
*
* This assumes that two userspace pages are always sufficient. That's
diff --git a/mm/filemap.c b/mm/filemap.c
index 21aeee9..e48e862 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -518,6 +518,25 @@ void fastcall wait_on_page_bit(struct page *page, int bit_nr)
EXPORT_SYMBOL(wait_on_page_bit);/**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+ wait_queue_head_t *q = page_waitqueue(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __add_wait_queue(q, waiter);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
* unlock_page - unlock a locked page
* @page: the page
*-
Won't it in any case want to lock the page too? That would be the only
way to ensure that the page is still mapped into the address space when
you're writing it out...-
No. Why would it? All it wants to do is to read the page (copying it to the
I don't understand what you're getting at. Write the page out where? We've
just read it in from the cache, so why would we be writing it back out?David
-
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised).This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.Supply a generic implementation for this that uses the prepare_write() and
commit_write() address_space operations to bound a copy directly into the page
cache.Hook the Ext2 and Ext3 operations to the generic implementation.
Signed-Off-By: David Howells <dhowells@redhat.com>
---fs/ext2/inode.c | 2 +
fs/ext3/inode.c | 3 ++
include/linux/fs.h | 7 ++++
mm/filemap.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 107 insertions(+), 0 deletions(-)diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0079b2c..b3e4b50 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -695,6 +695,7 @@ const struct address_space_operations ext2_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};const struct address_space_operations ext2_aops_xip = {
@@ -713,6 +714,7 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};/*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index de4e316..93809eb 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1713,6 +1713,7 @@ static const struct address_space_operations ext3_ordered_aops = {
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};static const struct address_space_operations ext3_writeback_aops = {
@@ -1727,6 +1728,7 @@ static const struct address_space_operations ext3_writeback_aops = {
.relea...
So why do you need a new address space operation? AFAICS the generic
implementation will work for pretty much everyone who supports the
existing prepare_write()/commit_write().
Furthermore, you don't appear to supply any alternative "optimised"
implementations...-
Because Christoph decreed that I wasn't allowed to call prepare_write() and
commit_write() directly. It's possible that the method should be in theOptimised in what fashion?
David
-
| Artem Bityutskiy | [PATCH 12/44 take 2] [UBI] allocation unit implementation |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
| Christoph Hellwig | pcmcia ioctl removal |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
