Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

Previous thread: [AppArmor 18/41] call lsm hook before unhashing dentry in vfs_rmdir() by jjohansen on Thursday, April 12, 2007 - 2:08 am. (1 message)

Next thread: [AppArmor 21/41] Add struct vfsmount parameters to vfs_rename() by jjohansen on Thursday, April 12, 2007 - 2:08 am. (1 message)
From: jjohansen
Date: Thursday, April 12, 2007 - 2:08 am

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>

---
 fs/namei.c               |    2 +-
 include/linux/security.h |    9 ++++++---
 security/dummy.c         |    2 +-
 security/selinux/hooks.c |    3 ++-
 4 files changed, 10 insertions(+), 6 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1503,7 +1503,7 @@ int vfs_create(struct inode *dir, struct
 		return -EACCES;	/* shouldn't it be ENOSYS? */
 	mode &= S_IALLUGO;
 	mode |= S_IFREG;
-	error = security_inode_create(dir, dentry, mode);
+	error = security_inode_create(dir, dentry, nd ? nd->mnt : NULL, mode);
 	if (error)
 		return error;
 	DQUOT_INIT(dir);
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -283,6 +283,7 @@ struct request_sock;
  *	Check permission to create a regular file.
  *	@dir contains inode structure of the parent of the new file.
  *	@dentry contains the dentry structure for the file to be created.
+ *	@mnt is the vfsmount corresponding to @dentry (may be NULL).
  *	@mode contains the file mode of the file to be created.
  *	Return 0 if permission is granted.
  * @inode_link:
@@ -1204,8 +1205,8 @@ struct security_operations {
 	void (*inode_free_security) (struct inode *inode);
 	int (*inode_init_security) (struct inode *inode, struct inode *dir,
 				    char **name, void **value, size_t *len);
-	int (*inode_create) (struct inode *dir,
-	                     struct dentry *dentry, int mode);
+	int (*inode_create) (struct inode *dir, struct dentry *dentry,
+			     struct vfsmount *mnt, int mode);
 	int (*inode_link) (struct dentry *old_dentry,
 	                   struct inode *dir, struct dentry *new_dentry);
 	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
@@ -1611,11 +1612,12 @@ static inline int security_inode_init_se
 	
 static inline int security_inode_create (struct inode *dir,
 					 struct ...
From: Christoph Hellwig
Date: Thursday, April 12, 2007 - 3:06 am

Once again very strong NACK.  Every conditional passing of vfsmounts get my
veto.  As mentioned last time if you really want this send a patch series
first that passed the vfsmount consistantly.

-

From: Andreas Gruenbacher
Date: Monday, April 16, 2007 - 9:11 am

I don't consider it fair to NACK this patch just because some other part of 
the kernel uses vfs_create() in the wrong way -- those are really independent 
issues. The bug is not that hard to fix though; here is a proposed patch on 
top of the other AppArmor patches.

The offenders are nfsd and the mqueue filesystem. Instantiate a struct 
nameidata there.

The .dentry and .mnt fields can easily be filled in in nfsd. The mqueue 
filesystem is somewhat special: files that mq_open creates are on an internal 
mount and the mqueue filesystem is not generally visible (it may or may not 
be mounted). When passing a regular vfsmount to vfs_create the files would 
appear disconnected while they actually are kernel-internal objects. Use a 
NULL vfsmount there to avoid that -- passing the kernel-internal vfsmount 
there wouldn't help, anyway.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

---
 fs/namei.c    |    7 ++++---
 fs/nfsd/vfs.c |   23 +++++++++++++++++++----
 ipc/mqueue.c  |    6 +++++-
 3 files changed, 28 insertions(+), 8 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1503,7 +1503,7 @@ int vfs_create(struct inode *dir, struct
 		return -EACCES;	/* shouldn't it be ENOSYS? */
 	mode &= S_IALLUGO;
 	mode |= S_IFREG;
-	error = security_inode_create(dir, dentry, nd ? nd->mnt : NULL, mode);
+	error = security_inode_create(dir, dentry, nd->mnt, mode);
 	if (error)
 		return error;
 	DQUOT_INIT(dir);
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1108,6 +1108,18 @@ nfsd_commit(struct svc_rqst *rqstp, stru
 }
 #endif /* CONFIG_NFSD_V3 */
 
+static inline int
+nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt,
+	       int mode)
+{
+	struct nameidata nd = {
+		.dentry = child,
+		.mnt = mnt,
+	};
+
+	return vfs_create(dir, child, mode, &nd);
+}
+
 /*
  * Create a file (regular, directory, device, fifo); UNIX sockets 
  * not yet implemented.
@@ -1192,7 +1204,8 @@ nfsd_create(struct svc_rqst *rqstp, stru
 	err = 0;
 	switch ...
From: Christoph Hellwig
Date: Monday, April 16, 2007 - 9:21 am

Note that it's not just this particular hunk, all these kinds of conditional

But anyway, creating fake nameidata structures is not really helpful.
If there is a nameidata passed people expect it to be complete, and
if you pass them to an LSM people will e.g. try to look into lookup
intents.  But similar to the per-mountpoint r/o work I suspect you're
simply operating on the wrong level, but then again this might not
be fixable due to the braindead apparmor design.
-

From: Andreas Gruenbacher
Date: Monday, April 16, 2007 - 9:40 am

I don't actually agree with that: when nfsd creates a file, it still is a file 
create no matter where it originates from, and so it does make sense to 
provide the appropriate intent information too. Struct nameidata contains 
other crap only needed during an actual lookup too --- that's a mess, and the 
nameidata2 patch shows one possibility how to deal with it. (It's the 
cleanest approach I could think of right now without cluttering the vfs code 
with insane amounts of dereferences.)

Thanks,
Andreas
-

From: Christoph Hellwig
Date: Monday, April 16, 2007 - 9:45 am

You should provide intent information, yes - which your patch didn't :)

And yes, I didn't like doing the ugly intent in nameidata hack, it's
creating loads of problems for various parties, e.g. the stackable
filesystem folks.  Now the basic intent in nameidata mistaken has been
made even worse by passing back a struct file in it conditionally and
doing lots of work in ->lookup that shouldn't be there.  (Which btw,
I expect to cause quite a few problems for apparmor or other lsms,
but I guess so far no one has tried them on NFSv4)

-

From: Andreas Gruenbacher
Date: Tuesday, April 17, 2007 - 5:09 am

Well, the information implicitly provided is "no intent": In do_create() in
ipc/mqueue.c intents would be pretty pointless because the mqueue filesystem
is local. In fs/nfsd/vfs.c, intents would make slightly more sense assuming
that the underlying filesystem eported via nfsd is remote. That's an

Pathname wise, NFSv4 should look like any other filesystem on the client side. 
On the Server side, the concept of pathnames doesn't really fly for nfs: if a 
directory contains more than one link to the same file, there is no way to 
tell those aliases from each other from the file descriptor. In addition, 
computing even the somewhat ambiguous pathnames that can be computed would
require subtree checking. But trying to confine nfsd is pretty pointless 
anyway: the deamon is privileged and can do whatever it wants. It makes more
sense to export the right directories with the right permissions in the first
place.

Thanks,
Andreas
-

From: Andreas Gruenbacher
Date: Friday, May 11, 2007 - 8:59 am

Yes, the vfs uses or passes through a NULL nameidata or vfsmount pointer and 
no intent information in several places:

 (1) vfs_create() is called with a NULL nameidata in two places,

 (2) lookup_one_len() and lookup_one_len_kern() pass a NULL nameidata to
     permission(), d_op->d_revalidate(), and i_op->lookup(),

 (3) file_permission() passes a NULL nameidata to permission().

 (4) permission() is directly called with NULL nameidata in some places, and

In general it is incorrect to use NULL nameidata or vfsmounts because then the
vfsmount->mnt_flags cannot be checked, and no intent information is available. 
Intents are an optional optimization for remote filesystems; we can simply 
ignore them for local filesystems. There are some cases where we are passing 
NULL which are real bugs, but there are also other cases in which the 
operations are not mount related: for example, filesystems like pipefs have 
the MS_NOUSER flag set which indicates that they cannot be mounted at all. On 
filesystems like sysfs, files are created and removed whether or not that 
filesystem is mounted.

The places where we should pass a proper nameidata / vfsmount but don't should 
be fixed. Those are long-standing issues in the vfs though, and at least some 
are not easy to correct. AppArmor is affected by some of the NULL vfsmount 
passing, and we found a few more problems when looking into this more 
closely, but most of that NULL passing doesn't affect AppArmor:

 * In some places, vfs functions are called without nameidata / vfsmount,
   followed by calling lsm hooks with a vfsmount. In those cases, the vfs
   functions cannot check the vfsmount flags, but the lsm hooks will still
   do the appropriate checks.

 * In the AppArmor security model, directory searching is always allowed, and
   the access check are done once the leaf dentry + vfsmount has been looked
   up. For example, granting write access to /var/tmp/foo in a profile is
   sufficient to allow that access; search access ...
From: Matthew Wilcox
Date: Monday, April 16, 2007 - 9:25 am

Wouldn't it normally result in fewer instructions (on most architectures
... maybe not x86) to keep the same argument order as vfs_create?  ie:

static inline int nfsd_do_create(struct inode *dir, struct dentry *child,
				 int mode, struct vfsmount *mnt)

-

From: Andreas Gruenbacher
Date: Monday, April 16, 2007 - 9:29 am

Here is a patch with request for comment.

The create, lookup, and permission inode operations are all passed a full
nameidata.  This is unfortunate because in nfsd and the mqueue filesystem we
must instantiate a struct nameidata, but cannot provide all of the same
information of a regular lookup.  The unused fields take up space on the
stack, but more importantly, it is not obvious which fields have meaningful
values and which don't, and so things might easily break.

This patch introduces struct nameidata2 with only the fields that make sense
independent of an actual lookup, and uses that struct in those places where a
full nameidata is not needed.

The entire patch is a little big so I'm only including the key parts here. The
complete version is here:

 http://forgeftp.novell.com/apparmor/LKML_Submission-April_07/split-up-nameidata.diff

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -14,21 +14,39 @@ struct open_intent {
 
 enum { MAX_NESTED_LINKS = 8 };
 
+/**
+ * Fields shared between nameidata and nameidata2 -- nameidata2 could
+ * be embedded in nameidata, but then the vfs code would become
+ * cluttered with dereferences.
+ */
+#define __NAMEIDATA2				\
+	struct dentry	*dentry;		\
+	struct vfsmount *mnt;			\
+	unsigned int	flags;			\
+						\
+	union {					\
+		struct open_intent open;	\
+	} intent;
+
 struct nameidata {
-	struct dentry	*dentry;
-	struct vfsmount *mnt;
+	__NAMEIDATA2
 	struct qstr	last;
-	unsigned int	flags;
 	int		last_type;
 	unsigned	depth;
 	char *saved_names[MAX_NESTED_LINKS + 1];
+};
 
-	/* Intent data */
-	union {
-		struct open_intent open;
-	} intent;
+struct nameidata2 {
+	__NAMEIDATA2
 };
 
+#undef __NAMEIDATA2
+
+static inline struct nameidata2 *ND2(struct nameidata *nd)
+{
+	return container_of(&nd->dentry, struct nameidata2, dentry);
+}
+
 struct path {
 	struct vfsmount *mnt;
 	struct dentry *dentry;
@@ -76,10 +94,9 @@ extern void ...
From: Christoph Hellwig
Date: Monday, April 16, 2007 - 9:39 am

Or better just pass argument directly.  We really
should only pass down the the dentry and the
intent to the filesystem.  The filesystem has
not business looking at mnt in the operations,
and the relevant bits of flags (mostly whether it's
O_EXCLUSIVE) should be stored in the intent, because
that's the only way it should be used.

Doing it that way also allows to fix some braindead calling conventions


or this one.

etc..
-

From: Randy Dunlap
Date: Monday, April 16, 2007 - 9:42 am

Please don't use the kernel-doc begin-marker "/**" when the comment block

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

From: Andreas Gruenbacher
Date: Monday, April 16, 2007 - 9:44 am

Sigh, kernel-doc should improve things, not get in the way ...
-

From: Randy Dunlap
Date: Monday, April 16, 2007 - 9:50 am

Well, I see it as sort of a language that is embedded in C source code.

But what are you suggesting?

-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

From: Al Viro
Date: Thursday, April 12, 2007 - 3:12 am

is a clear sign that interface is wrong.  Leaving aside the general
idiocy of "we prohibit to do something with file if mounted here,
but if there's another mountpoint, well, we just miss", an API of that
kind is just plain wrong.  Either you can live without seeing vfsmount
in that method (in which case you shouldn't pass it at all), or you
have a hole.

NACK.
-

From: Andreas Gruenbacher
Date: Wednesday, May 23, 2007 - 12:06 pm

The fundamental model of AppArmor is to perform access checks based on the 
location of a file in the filesystem namespace, i.e., the pathname, and this 
can only be done with both the dentry and vfsmount. My understanding was that 
there was agreement at the last kernel summit that pathname based approaches 

No. There are callers of vfs_create() that use a NULL nameidata, and that's 
what causes the problem here. Struct nameidata is pretty big, and so we don't 
want to allocate temporary nameidata objects all over the place. So to me the 
above NULL check seems the lesser evil.

One way to deal with the nameidata size problem is to split it up into one 
part for the real path lookups, and a much smaller part that only contains 
the dentry, vfsmount, and intent flags. This would allow to pass around the 
smaller nameidata much more consistently.

John has posted patches for that on May 14; subject [RFD Patch 0/4]. Feedback 
appreciated.

In several places, the NULL nameidata is wrong because we can't check the 
vfsmount flags or intent. Not having this information is causing problems for 

This is backwards from what AppArmor does. The policy defines which paths may 
be accessed; all paths not explicitly listed are denied. If files are mounted 
at multiple locations, then the policy may allow access to some locations but 
not to others. That's not a hole.

In fact this is not much different from traditional permissions on parent 
directories: even if the same files are mounted at several locations, parent 
directory permissions may allow accessing only some of those locations.

Thanks,
Andreas
-

From: James Morris
Date: Wednesday, May 23, 2007 - 6:28 pm

I don't know what else you'd call it.

Would you mind providing some concrete examples of how such a model would 
be useful?


- James
-- 
James Morris
<jmorris@namei.org>
-

From: Andreas Gruenbacher
Date: Thursday, May 24, 2007 - 2:16 am

AppArmor doesn't label files; it's a different model from SELinux. Its policy 
defines which processes may access which paths. Even if for some reson the 
same files were visible elsewhere, the policy wouldn't cover those other 
paths, and so accessing them would be denied. So again, that's not a security 

The model is explained, with examples, in the technical documentation at 
http://forgeftp.novell.com//apparmor/LKML_Submission-May_07/.

Thanks,
Andreas
-

From: Tetsuo Handa
Date: Thursday, May 24, 2007 - 5:51 am

Hello.

I think bind mounts were discussed when shared subtree
( http://lwn.net/Articles/159092/ ) was introduced.

For systems that allow users mount their CD/DVDs freely,
bind mounts are used and labeling files is a convenient way
to deny accessing somebody else's files.

But systems that don't allow users mount their CD/DVDs freely,
bind mounts needn't to be used and using pathnames is a convenient way
to deny accessing somebody else's files.

Pathname based access control/auditing system
works if the system doesn't use bind mounts.

However, there are distributions (e.g. Debian Etch)
that always use bind mounts. In such distributions,
pathname based access control/auditing system doesn't work.

This is not the fault of distributions nor
pathname based access control/auditing system.
It is possible to solve by passing vfsmount to VFS and LSM functions.

SELinux users are having a lot of trouble because pathnames in audit logs
are not always complete.
AppArmor users are having a lot of trouble because pathnames which
a process requested are ambiguous when bind mounts are used.

Being able to report pathnames that a process requested is not surprising
when considering user friendliness.
I beleive passing vfsmount makes both users happy.

Thanks.
-

Previous thread: [AppArmor 18/41] call lsm hook before unhashing dentry in vfs_rmdir() by jjohansen on Thursday, April 12, 2007 - 2:08 am. (1 message)

Next thread: [AppArmor 21/41] Add struct vfsmount parameters to vfs_rename() by jjohansen on Thursday, April 12, 2007 - 2:08 am. (1 message)