Hello,
I'm sending rediffed patch implementing sending of quota messages via netlink
interface (some rationale in patch description). I've already posted it to
LKML some time ago and there were no objections, so I guess it's fine to put
it to -mm. Andrew, would you be so kind? Thanks.
Userspace deamon reading the messages from the kernel and sending them to
dbus and/or user console is also written (it's part of quota-tools). The
only remaining problem is there are a few changes needed to libnl needed for
the userspace daemon. They were basically acked by the maintainer but it
seems he has not merged the patches yet. So this will take a bit more time.Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
The access to seq is racy, isn't it?
If so, that can be solved with a lock, or with atomic_add_return().
-
Attached is an incremental patch solving the issues you've spotted.
Thanks for review. The result of the discussion with namespace guys is
that the id used as an identify for filesystem operations should be fine.
If it will ever be something different than a number, we can change the
protocol which should be no problem...
Also after some more reading, I've found out that we can even easily find
out, which attributes have been sent in the netlink message. So I don't see
a real reason for some versioning of the protocol - either the message has
all the attributes we are interested and then we report it, or it does not
and then we complain that tools are too old and don't understand the
protocol...Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
Mutt knows how to send patches inline vs. attachments... :(
Anyway, on to the patch. Thanks for adding the new doc file.
+This command is used to send a notification about any of the above mentioned
+events. Each message has six attributes. These are (type of the argument is
+in braces):s/braces/parentheses/
+ QUOTA_NL_A_QTYPE (u32)
+ - type of quota beging exceeded (one of USRQUOTA, GRPQUOTA)s/beging/being/
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
Hmm, I thought Andrew either does not mind or prefers attachments. If
it isn't the case, I can inline patches. Andrew? BTW: I personally prefer
attachments - mutt inlines text attachments for me anyway and sometimes
Thanks for reading it. Andrew, should I resend the patch or will you
substitute it in the patch?Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
inlined is a bit better, mainly because one can reply to it and the email
I'll sort it out, thanks.
-
You're right. I've made atomic_t from seq. Thanks for spotting this.
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
So it's a new kernel->userspace interface.
This is it. Normally netlink payloads are represented as a struct. How
come this one is built-by-hand?It doesn't appear to be versioned. Should it be?
Does it have (or need) reserved-set-to-zero space for expansion? Again,
hard to tell..I guess it's OK to send a major and minor out of the kernel like this.
What's it for? To represent a filesytem? I wonder if there's a more
modern and useful way of describing the fs. Path to mountpoint or
something?I suspect the namespace virtualisation guys would be interested in a new
interface which is sending current->user->uid up to userspace. uids are
per-namespace now. What are the implications? (cc's added)Is it worth adding a comment explaining why GFP_NOFS is used here?
-
No netlink fields (unless I'm confused) are represented as a struct,
Well. If it is using netlink properly each field should have a tag.
So it should not need to be versioned, because each field is strictlyThat we definitely would be. Although the user namespaces is rather
-
And could we have some description of the context under which all the message
exchanges take place. When are these messages sent out -- what eventOne problem, we've been is losing notifications. It does not happen for us
due to the cpumask interface (which allows us to have parallel socketsHave you looked at ensuring that the data structure works across 32 bit
and 64 bit systems (in terms of binary compatibility)? That's usuallyThe memory controller or VM would also be interested in notifications
of OOM. At OLS this year interest was shown in getting OOM notifications
and allow the user space a chance to handle the notification and take
action (especially for containers). We already have containerstats for
containers (which I was planning to reuse), but I was told that we would--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
The user is notified about either exceeding his quota softlimit or
Generic netlink should take care of this - arguments are typed so it
knows how much bits numbers have. So this should be no issue. Are there any
Generic netlink can be used to pass this information (although in OOM
situation, it may be a bit hairy to get the network stack working...). But
I guess it's not related to my patch.Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
Yes, but apart from that, if I remember Jamal Hadi's initial comments
on taskstats, he recommended that we align everything to 64 bit so
that the data is well aligned for 64 bit systems. You could also consider
creating a data structure, document it's members, align them and useWe could have a pre-allocated buffer stored at startup and use that for
OOM notification. In the case of container OOM, we are likely to have
free global memory. Working towards an infrastructure so that anybody can
build on top of it and sending notifications on interesting events becomes
easier would be nice. We can reuse code that way and add fewer bugs :-)--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
But each attribute is just one number (either 32 or 64 bit) so there's
not much to align. Also each attribute has its netlink header so alignment
is anyway hard to predict. Finally, this is by no means performance
critical - average system using quotas may get say 1 notification per user
I don't like sending one structure - by doing that you loose the
How does generic netlink support versioning? I have not found this
feature. Looking into Documentation/accounting/taskstats.txt it seems that
taskstats are versioning only the structure taskstats itself but not the
Yes, but generic netlink itself is such an infrastructure, isn't it? It
is about 70 lines of code to implement notification for quota subsystem so
it's really simple...Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
Oops, forgotten about it. I'll write one. Do we have some standard place
where to document such interfaces? I could create some file in
I use "generic netlink", which is in fact a layer built on top of
netlink. As far as I've read it's documentation, creating a message
argument by argument is the preferred way. As David writes, this way
we can add new arguments without worries about backward compatibility,
We don't need a version for future additions. Also each attribute sent
has its identifier (e.g. QUOTA_NL_A_CAUSED_ID) and userspace checks these
identifiers and unknown attributes are ignored. But in case we would like
to remove some attribute, versioning would be probably useful so that
I also find major/minor pair a bit old-fashioned. But the identifying it
by a mountpoint is problematic - quota does not care about namespaces and
such and so it works with superblocks. It's not trivial to get a mountpoint
from a superblock (and generally it's frown upon, isn't it?). Also if a
filesystem is mounted on several places, we have to pick one (OK, userspace
I know there's something going on in this area but I don't know any
details. If somebody has some advice what should be passed into userspace
Probably yes. Added.Thanks for all your comments.
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
For non networking stuff netlink is a pain to use in this area.
Although if we are very careful we may be ok. But this requires
some thinking through.In principle the uid that corresponds to a struct user depends
on which user namespace you are in.Now there is a cheap trick we can play. A traditional filesystem
belongs to exactly one user namespace. So we can return the uid
in the filesystems user namespace.Wait you are returning current->user->uid? Shouldn't we return
the user who's quota is exceeded? I.e. if alice owns a file
and makes it world writable. And bob writes to the file wouldn't
that file still be billed to alice's quota? So shouldn't we complain
about alice and not bob?Anyway if the goal is to return a user who maps to the filesystem we
can just always return uids in the filesystems uid namespace.Although if filesystems start supporting multiple user namespaces
natively we might have a challenge on our hands.Let me see if I can think of a concrete example here.
We have a nfs server with quotas.
We have clients who mount the nfs filesystem without synchronizing
their /etc/password files, so we have separate user namespaces.What are the ways to make this work?
- Everyone who has right access to the NFS mount on all
machines must have their uid synchronized across all machines
(the easiest case).- Each different kernel has a mapping from it's local uids to
the uids of the nfs filesystem. (ick if we do much more the
root squash).- The nfs filesystem knows about the situation and remembers the
uid source (the uid namespace) as well as the uid when storing
owners of files. NFSv4 allows for this by treating users
as user@domain.Generally synchronizing uid namespaces (with possibly a root squash
exception) is the sanest and simplest thing to do in a case like this,
but it isn't always what is done.As long as we are returning the filesystems idea of users we
shouldn't have to worry much about uid namespaces....
Yes, the quota will still be billed to Alice and originally we complained
only about Alice. Now, we are actually passing identities of two users: The
one who actually caused the quota to be exceeded and the one whose quota is
exceeded. Userspace app can then decide what to do with the information...
For example it makes sence to display the message to both Alice and Bob in
OK, quota kind of works for NFSv4 - we simply enforce quotas on the
server on a traditional filesystem and there are some RPC calls to get
quota status. For 9p, it does not work. But we should probably design the
interface generic enough so that it accommodates those untraditional
I see it's a complicated matter :). What I need to somehow pass to
userspace is something (and I don't really care whether it will be number,
string or whatever) that userspace can read and e.g. find a terminal
window or desktop the affected user has open and also translate the
identity to some user-understandable name (average user Joe has to
understand that he should quickly cleanup his home directory ;).
Thinking more about it, we could probably pass a string to userspace in
the format:
<namespace type>:<user identification>So for example we can have something like:
unix:1000 (traditional unix UIDs)
nfs4:joe@machineThe problem is: Are we able to find out in which "namespace type" we are
and send enough identifying information from a context of unpriviledged
user?Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
Ok. This provides enough context to understand what you are trying to do.
You do want the unix user id, not the filesystem notion. Because you
are looking for the user.So we have to figure out how to do the hard thing which is look at
who opened our netlink broadcast see if they are in the same user
namespace as current->user. Which is a pain and we don't currently
have the infrastructure for.Eric
-
Provision also needs to be made for things that are listening to the
netlink broadcasts that don't match the user doing the operation or
the owner of the file - similar to the way auditd wants events.
There can be arbitrary number of listeners (potentially from different
namespaces if I understand it correctly) listening to broadcasts. So I
think we should pass some universal identifier rather than try to find out
who is listening etc. I think such identifiers would be useful for other
things too, won't they?
BTW: Do you have some idea, when would be the infrastructure clearer?
Whether it makes sence to currently proceed with UIDs and later change it
to something generic or whether I should wait before you sort it out :).Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
So internal to the kernel we have such a universal identifier.
struct user.There are to practical questions.
1) How do we present that information to user space?
2) How does user space want to process this information?If we only want user space to be able to look up a user and send
him a message. It probably makes sense to do the struct user to
uid conversion in the proper context in the kernel because we have
that information.If this is a general feature that happens to allows us to look up
the user given the filesystems view of what is going on would be
easier in the kernel, and not require translation. But it means
that we can't support 9p and nfs for now. But since we don't support
quotas on the client end anyway that doesn't sound like a big deal.The problem with the filesystem view is that there will be occasions
where we simply can not map a user into it, because the filesystem
won't have a concept of that particular user.So we could run into the situation where alice owns the file. Bob
writes to the file and pushes it over quota. But the filesystem
has no concept of who bob is. So we won't be able to report thatSo the plan is to get to the point where are uid comparisons in the
kernel are (user namespace, uid) comparisons. Or possibly struct
user comparisons (depending on the context. And struct mount will
contain the user namespace of whoever mounted the filesystem.Adding infrastructure to netlink to allow us to do conversions
as the packets are enqueued for a specific user is something I
would rather avoid, but that is a path we can go down if we haveA good question. I think things are clear enough that it at least
makes sense to sketch a solution to the problem even if we don't
implement it at this point.I have been hoping Cedric or Serge would jump in because I think those
are the guys who have been working on the implementation.Eric
-
Just fyi Eric,
Note that given the amount of churn going on due to pid and network
namespaces, I was seeing completion of user namespaces as something to
be done sometime next year. In the meantime I was only going to do
something with capabilities to restrict root in user namespaces (which I
think will take the form of per-process non-expandable cap_bsets, which
I plan to start basically right now).But I'll gladly do the userns enhancements earlier if it's actually
wanted. They promise to be great fun :)-
Sorry, I've lost the original patch from two separate mailboxes...
The proper behavior depends on how we end up tying filesystems to user
namespaces, which isn't actually decided yet.The way I was recommending doing that was:
A filesystem is tied to a user namespace. If a uid in another naemspace
is to be allowed to access the filesystem, it will actually - through a
key in it's keyring (which acts like a capability) - be mapped to a uid
in the filesystem's uid namespace. So in Eric's example, if Alice
brings Bob over quota, Alice would have done so through some user
Charlie who she is authorized to act as through her keyring. So Charlie
should be the id which would be logged over netlink.Of course there is currently no support for this. So I'd recommend one
of two options: either just punt on uid namespace for now and we'll fix
it when we improve user namespaces - so log Alice's userid. Or we can
try to do it somewhat correct now, which might be done as follows:1. introduce get_uid_in_userns(tsk). For now this just returns
tsk->uid if current->userns == tsk->userns, else it returns
0.
This way in Eric's scenario, Bob would be told that root,
not an invalid user (Alice) had brought him over quota.
Eventually, this would walk tsk's keychain for a uid entry
in current's active user namespace.2. Add the userns to the netlink message.
Again I need to find Jan's orginal patch, but I'll take a look at this.
-serge
-
Currently that is true, but i think isolating netlink sockets is going
to have to be done pretty soon.On the one hand cloning a new netlink socket ns when you unshare
CLONE_NEWNET may seem 'obvious', but I think doing so when you unshare
CLONE_NEWUSER make much more sense considering netlink's use for auditEven with isolating netlink we still may want to send out an identifier.
However, just as with mounts extensions we're printing out the memory
address of vfsmounts, we might just want to print out the memory address
of the userns. It's not universal, but should be good enough.-
Maybe before proceeding further with the discussion I'd like to
understand following: What are these user namespaces supposed to be good
for?
I imagine it so that you have a machine and on it several virtual
machines which are sharing a filesystem (or it could be a cluster). Now you
want UIDs to be independent between these virtual machines. That's it,
right?
Now to continue the example: Alice has UID 100 on machineA, Bob has
UID 100 on machineB. These translate to UIDs 1000 and 1001 on the common
filesystem. Process of Alice writes to a file and Bob becomes to be over
quota. In this situation, there would be probably two processes (from
machineA and machineB) listening on the netlink socket. We want to send a
message so that on Alice's desktop we can show a message: "You caused
Bob to exceed his quotas" and of Bob's desktop: "Alice has caused that you
are over quota.".
Because there may be is not a notion of Bob on machineA or of Alice on
machineB, we are in trouble, right? What I like the most is to use the
filesystem identities (as you suggested in some other email). I. e. because
both Alice and Bob share a filesystem, identities of both have to make sense
to it (for example for purposes of permission checking). So we can probably
send via netlink these (in our example ids 1000 and 1001) and hope that
inside machineA and machineB there will be a way to translate these
identities to names "Alice" and "Bob". So that user can understand what
is happenning. Does this sound plausible?
If we go this route, then we only need a kernel function, that will
for a pair ($filesystem, $task) return indentity of that $task used
for operations on $filesystem...Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
(Please skip to the message end first, as I think you may not care about
the next bit of my blathering)Right now they are only good for providing some separate accounting for
uid 1000 in one user namespace versus uid 1000 in another namespace.
All security enforcement must be done by actually providing separate
filesystems and separate pid namespaces and, hopefully, with a selinux
policy.Eventually the idea will be that uid 1000 in one user namespace and uid
1000 in another namespace will be completely separate entities. A
mounted filesystem will be tied to a particuler user namespace, and
the kernel will provide any cross-userns access perhaps the way I
described, with uid equivalence implemented through the keyring.But note that this isn't really relevant when we get to NFS. Two user
namespaces on one machine should have different network namespaces and
network addresses as well, and so should look to the NFS server like two
separate machines.So the user namespaces are only really relevant when talking about local
Since this is over NFS, you handle it the way you would any other time
Right, so long as we're talking about local filesystems that's the way
to go. If a file write was allowed which brought bob over quota,
clearly the person responsible had some uid valid on the filesystem toOk, now I see. This is again unrelated to user namespaces, it's an
issue regardless.Is there no way to just report Alice as the guilty party to Bob on his
machine as (host=nfsserver,uid=1000)?-serge
-
I meant this would actually happen over a local filesystem (imagine
Fine. So I'll keep UID in the quota netlink protocol with the meaning
You know, in fact this contains all the information but it is quite useless
for an ordinary user. The message should be understandable to average desktop
user so it should contain some name rather than UID - but resolving the
"filesystem" UID to some meaningful name is completely different issue
and I'd probably leave that for the moment when the kernel infrastructure
and use cases would be clearer...Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
Ok, then that is where I was previously suggesting that we use an api to
report a uid meaningful in bob's context, where we currently (in the
absense of meaningful mount uids and uid equivalence) tell Bob that root
was the one who brought him over quota. From a user pov 'nobody' would
make more sense, but I don't think we want the kernel to know about user
nobody, right?So if the msg weren't broadcast, or netlink sockets were tied to one
user namespace, we could call a
int uid_in_user_ns(struct user *, struct user_ns *)
sending in Alice's user struct and Bob's userns, and use the result in
the netlink message. Otherwise I'm not sure what is the right answer.
We just might need the equivalent of 'struct pid' to struct user, or
persistant global user namespace ids (persistant after user namespace
destruction, not across reboot) so we can safely send the user_ns * in aI think that's ok.
Hopefully when that changes to accomodate user namespaces, we can use
netlink field versioning to make that transition pretty seamless?If not, then we probably should in fact make some decision now so as not
What is the ordinary user going to do about it? If the user didn't set
up the nfsserver and/or the second client, the only thing he can do is
report the guilty user to an admin. In which case the tuplethanks,
-serge
-
But what is the problem with using the filesystem ids? All virtual
Yes, we'd just assign the attribute a different number and teach
Maybe write him an email or go and bang him with a baseball bat ;)
Seriously, if someone (like admin) is able to find a physical identity of the
guilty user, then we should be able to do this in a software too, shouldn't
we?Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
I don't know what you mean by filesystem ids. Do you mean the uid
stored on the fs? I imagine a network fs could get fancy and store
something more detailed than the unix uid, based on the user's keys.Do you mean the inode->i_uid? Nothing wrong with that. Then we just
assume that either you are in the superblock or mount's user namespace
(depending on how we implement it, probably superblock), or can figureSure, and in many ways. But if working with NFS, as far as I know the
most common way to solve it is to enforce a common /etc/passwd across
all the valid NFS clients :)-serge
-
I meant the identity the process uses to access the filesystem (to
identify the user who caused the limit excess) and also the identity stored
in the quota file (to identify whose quota was exceeded).
Anyway, any identity more complicated than just a number needs changes in
both quota file format and filesystems so at that moment, we can also
Then one wonders whether user namespaces are really what users want ;).Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
Absolutely.
You use nfs to share filesystems among separate machines that you want
to have look similar.You use user namespaces to pretend one machine is a bunch of separate
machines. So if you're just going to split up your machine into 5
vms and then have them all share disk over nfs, you may just want to
keep it as one machine :)Ideally each vm would have completely separate disk space, so file
access across user namespaces wouldn't happen. More realistically,
file trees will be shared read-only - i.e. /lib, /usr, etc. Some of
that can be handled simply using read-only bind mounts. We'd like
to allow users to create vm's as well, so then we want uid 500 in
the initial user namespace to be uid 0 in a newly created user
namespace.So what Eric and I are worried about are corner cases and admin
mistakes, not regular function.(And again I really do think we'll want to tie netlink sockets to a user
namespace, not a network namespace, so there may be no issue at all
so long as proper filesystem access checks are implemented so that every
action on some filesystem is done with credentials valid in that
filesystems' user namespace)-serge
-
It looks like other quota documentation is in Documentation/filesystems/,
and that seems reasonable to me for the other quota docs & this one.---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
From: Andrew Morton <akpm@linux-foundation.org>
He is using attributes, which is perfect and arbitrarily
extensible with zero backwards compatability concerns.If he wants to provide a new attribute, he just adds it
without any issues.When new attributes are added, older apps simply ignore the attributes
they don't understand.
-
| Dave Hansen | [RFC][PATCH 0/4] kernel-based checkpoint restart |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Eric Paris | [RFC 0/5] [TALPA] Intro to a linux interface for on access scanning |
git: | |
| David Miller | Re: [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
