Hello, I'm sending rediffed patch implementing sending of quota messages via netlink interface (some rationale in patch description). I've already posted it to LKML some time ago and there were no objections, so I guess it's fine to put it to -mm. Andrew, would you be so kind? Thanks. Userspace deamon reading the messages from the kernel and sending them to dbus and/or user console is also written (it's part of quota-tools). The only remaining problem is there are a few changes needed to libnl needed for the userspace daemon. They were basically acked by the maintainer but it seems he has not merged the patches yet. So this will take a bit more time. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs
So it's a new kernel->userspace interface. This is it. Normally netlink payloads are represented as a struct. How come this one is built-by-hand? It doesn't appear to be versioned. Should it be? Does it have (or need) reserved-set-to-zero space for expansion? Again, hard to tell.. I guess it's OK to send a major and minor out of the kernel like this. What's it for? To represent a filesytem? I wonder if there's a more modern and useful way of describing the fs. Path to mountpoint or something? I suspect the namespace virtualisation guys would be interested in a new interface which is sending current->user->uid up to userspace. uids are per-namespace now. What are the implications? (cc's added) Is it worth adding a comment explaining why GFP_NOFS is used here? -
From: Andrew Morton <akpm@linux-foundation.org> He is using attributes, which is perfect and arbitrarily extensible with zero backwards compatability concerns. If he wants to provide a new attribute, he just adds it without any issues. When new attributes are added, older apps simply ignore the attributes they don't understand. -
Oops, forgotten about it. I'll write one. Do we have some standard place where to document such interfaces? I could create some file in I use "generic netlink", which is in fact a layer built on top of netlink. As far as I've read it's documentation, creating a message argument by argument is the preferred way. As David writes, this way we can add new arguments without worries about backward compatibility, We don't need a version for future additions. Also each attribute sent has its identifier (e.g. QUOTA_NL_A_CAUSED_ID) and userspace checks these identifiers and unknown attributes are ignored. But in case we would like to remove some attribute, versioning would be probably useful so that I also find major/minor pair a bit old-fashioned. But the identifying it by a mountpoint is problematic - quota does not care about namespaces and such and so it works with superblocks. It's not trivial to get a mountpoint from a superblock (and generally it's frown upon, isn't it?). Also if a filesystem is mounted on several places, we have to pick one (OK, userspace I know there's something going on in this area but I don't know any details. If somebody has some advice what should be passed into userspace Probably yes. Added. Thanks for all your comments. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
It looks like other quota documentation is in Documentation/filesystems/, and that seems reasonable to me for the other quota docs & this one. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
For non networking stuff netlink is a pain to use in this area. Although if we are very careful we may be ok. But this requires some thinking through. In principle the uid that corresponds to a struct user depends on which user namespace you are in. Now there is a cheap trick we can play. A traditional filesystem belongs to exactly one user namespace. So we can return the uid in the filesystems user namespace. Wait you are returning current->user->uid? Shouldn't we return the user who's quota is exceeded? I.e. if alice owns a file and makes it world writable. And bob writes to the file wouldn't that file still be billed to alice's quota? So shouldn't we complain about alice and not bob? Anyway if the goal is to return a user who maps to the filesystem we can just always return uids in the filesystems uid namespace. Although if filesystems start supporting multiple user namespaces natively we might have a challenge on our hands. Let me see if I can think of a concrete example here. We have a nfs server with quotas. We have clients who mount the nfs filesystem without synchronizing their /etc/password files, so we have separate user namespaces. What are the ways to make this work? - Everyone who has right access to the NFS mount on all machines must have their uid synchronized across all machines (the easiest case). - Each different kernel has a mapping from it's local uids to the uids of the nfs filesystem. (ick if we do much more the root squash). - The nfs filesystem knows about the situation and remembers the uid source (the uid namespace) as well as the uid when storing owners of files. NFSv4 allows for this by treating users as user@domain. Generally synchronizing uid namespaces (with possibly a root squash exception) is the sanest and simplest thing to do in a case like this, but it isn't always what is done. As long as we are returning the filesystems idea of users we shouldn't have to worry much about uid namespaces. ...
Yes, the quota will still be billed to Alice and originally we complained only about Alice. Now, we are actually passing identities of two users: The one who actually caused the quota to be exceeded and the one whose quota is exceeded. Userspace app can then decide what to do with the information... For example it makes sence to display the message to both Alice and Bob in OK, quota kind of works for NFSv4 - we simply enforce quotas on the server on a traditional filesystem and there are some RPC calls to get quota status. For 9p, it does not work. But we should probably design the interface generic enough so that it accommodates those untraditional I see it's a complicated matter :). What I need to somehow pass to userspace is something (and I don't really care whether it will be number, string or whatever) that userspace can read and e.g. find a terminal window or desktop the affected user has open and also translate the identity to some user-understandable name (average user Joe has to understand that he should quickly cleanup his home directory ;). Thinking more about it, we could probably pass a string to userspace in the format: <namespace type>:<user identification> So for example we can have something like: unix:1000 (traditional unix UIDs) nfs4:joe@machine The problem is: Are we able to find out in which "namespace type" we are and send enough identifying information from a context of unpriviledged user? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Ok. This provides enough context to understand what you are trying to do. You do want the unix user id, not the filesystem notion. Because you are looking for the user. So we have to figure out how to do the hard thing which is look at who opened our netlink broadcast see if they are in the same user namespace as current->user. Which is a pain and we don't currently have the infrastructure for. Eric -
There can be arbitrary number of listeners (potentially from different namespaces if I understand it correctly) listening to broadcasts. So I think we should pass some universal identifier rather than try to find out who is listening etc. I think such identifiers would be useful for other things too, won't they? BTW: Do you have some idea, when would be the infrastructure clearer? Whether it makes sence to currently proceed with UIDs and later change it to something generic or whether I should wait before you sort it out :). Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Currently that is true, but i think isolating netlink sockets is going to have to be done pretty soon. On the one hand cloning a new netlink socket ns when you unshare CLONE_NEWNET may seem 'obvious', but I think doing so when you unshare CLONE_NEWUSER make much more sense considering netlink's use for audit Even with isolating netlink we still may want to send out an identifier. However, just as with mounts extensions we're printing out the memory address of vfsmounts, we might just want to print out the memory address of the userns. It's not universal, but should be good enough. -
Maybe before proceeding further with the discussion I'd like to understand following: What are these user namespaces supposed to be good for? I imagine it so that you have a machine and on it several virtual machines which are sharing a filesystem (or it could be a cluster). Now you want UIDs to be independent between these virtual machines. That's it, right? Now to continue the example: Alice has UID 100 on machineA, Bob has UID 100 on machineB. These translate to UIDs 1000 and 1001 on the common filesystem. Process of Alice writes to a file and Bob becomes to be over quota. In this situation, there would be probably two processes (from machineA and machineB) listening on the netlink socket. We want to send a message so that on Alice's desktop we can show a message: "You caused Bob to exceed his quotas" and of Bob's desktop: "Alice has caused that you are over quota.". Because there may be is not a notion of Bob on machineA or of Alice on machineB, we are in trouble, right? What I like the most is to use the filesystem identities (as you suggested in some other email). I. e. because both Alice and Bob share a filesystem, identities of both have to make sense to it (for example for purposes of permission checking). So we can probably send via netlink these (in our example ids 1000 and 1001) and hope that inside machineA and machineB there will be a way to translate these identities to names "Alice" and "Bob". So that user can understand what is happenning. Does this sound plausible? If we go this route, then we only need a kernel function, that will for a pair ($filesystem, $task) return indentity of that $task used for operations on $filesystem... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
(Please skip to the message end first, as I think you may not care about the next bit of my blathering) Right now they are only good for providing some separate accounting for uid 1000 in one user namespace versus uid 1000 in another namespace. All security enforcement must be done by actually providing separate filesystems and separate pid namespaces and, hopefully, with a selinux policy. Eventually the idea will be that uid 1000 in one user namespace and uid 1000 in another namespace will be completely separate entities. A mounted filesystem will be tied to a particuler user namespace, and the kernel will provide any cross-userns access perhaps the way I described, with uid equivalence implemented through the keyring. But note that this isn't really relevant when we get to NFS. Two user namespaces on one machine should have different network namespaces and network addresses as well, and so should look to the NFS server like two separate machines. So the user namespaces are only really relevant when talking about local Since this is over NFS, you handle it the way you would any other time Right, so long as we're talking about local filesystems that's the way to go. If a file write was allowed which brought bob over quota, clearly the person responsible had some uid valid on the filesystem to Ok, now I see. This is again unrelated to user namespaces, it's an issue regardless. Is there no way to just report Alice as the guilty party to Bob on his machine as (host=nfsserver,uid=1000)? -serge -
I meant this would actually happen over a local filesystem (imagine Fine. So I'll keep UID in the quota netlink protocol with the meaning You know, in fact this contains all the information but it is quite useless for an ordinary user. The message should be understandable to average desktop user so it should contain some name rather than UID - but resolving the "filesystem" UID to some meaningful name is completely different issue and I'd probably leave that for the moment when the kernel infrastructure and use cases would be clearer... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Ok, then that is where I was previously suggesting that we use an api to report a uid meaningful in bob's context, where we currently (in the absense of meaningful mount uids and uid equivalence) tell Bob that root was the one who brought him over quota. From a user pov 'nobody' would make more sense, but I don't think we want the kernel to know about user nobody, right? So if the msg weren't broadcast, or netlink sockets were tied to one user namespace, we could call a int uid_in_user_ns(struct user *, struct user_ns *) sending in Alice's user struct and Bob's userns, and use the result in the netlink message. Otherwise I'm not sure what is the right answer. We just might need the equivalent of 'struct pid' to struct user, or persistant global user namespace ids (persistant after user namespace destruction, not across reboot) so we can safely send the user_ns * in a I think that's ok. Hopefully when that changes to accomodate user namespaces, we can use netlink field versioning to make that transition pretty seamless? If not, then we probably should in fact make some decision now so as not What is the ordinary user going to do about it? If the user didn't set up the nfsserver and/or the second client, the only thing he can do is report the guilty user to an admin. In which case the tuple thanks, -serge -
But what is the problem with using the filesystem ids? All virtual Yes, we'd just assign the attribute a different number and teach Maybe write him an email or go and bang him with a baseball bat ;) Seriously, if someone (like admin) is able to find a physical identity of the guilty user, then we should be able to do this in a software too, shouldn't we? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
I don't know what you mean by filesystem ids. Do you mean the uid stored on the fs? I imagine a network fs could get fancy and store something more detailed than the unix uid, based on the user's keys. Do you mean the inode->i_uid? Nothing wrong with that. Then we just assume that either you are in the superblock or mount's user namespace (depending on how we implement it, probably superblock), or can figure Sure, and in many ways. But if working with NFS, as far as I know the most common way to solve it is to enforce a common /etc/passwd across all the valid NFS clients :) -serge -
I meant the identity the process uses to access the filesystem (to identify the user who caused the limit excess) and also the identity stored in the quota file (to identify whose quota was exceeded). Anyway, any identity more complicated than just a number needs changes in both quota file format and filesystems so at that moment, we can also Then one wonders whether user namespaces are really what users want ;). Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Absolutely. You use nfs to share filesystems among separate machines that you want to have look similar. You use user namespaces to pretend one machine is a bunch of separate machines. So if you're just going to split up your machine into 5 vms and then have them all share disk over nfs, you may just want to keep it as one machine :) Ideally each vm would have completely separate disk space, so file access across user namespaces wouldn't happen. More realistically, file trees will be shared read-only - i.e. /lib, /usr, etc. Some of that can be handled simply using read-only bind mounts. We'd like to allow users to create vm's as well, so then we want uid 500 in the initial user namespace to be uid 0 in a newly created user namespace. So what Eric and I are worried about are corner cases and admin mistakes, not regular function. (And again I really do think we'll want to tie netlink sockets to a user namespace, not a network namespace, so there may be no issue at all so long as proper filesystem access checks are implemented so that every action on some filesystem is done with credentials valid in that filesystems' user namespace) -serge -
So internal to the kernel we have such a universal identifier. struct user. There are to practical questions. 1) How do we present that information to user space? 2) How does user space want to process this information? If we only want user space to be able to look up a user and send him a message. It probably makes sense to do the struct user to uid conversion in the proper context in the kernel because we have that information. If this is a general feature that happens to allows us to look up the user given the filesystems view of what is going on would be easier in the kernel, and not require translation. But it means that we can't support 9p and nfs for now. But since we don't support quotas on the client end anyway that doesn't sound like a big deal. The problem with the filesystem view is that there will be occasions where we simply can not map a user into it, because the filesystem won't have a concept of that particular user. So we could run into the situation where alice owns the file. Bob writes to the file and pushes it over quota. But the filesystem has no concept of who bob is. So we won't be able to report that So the plan is to get to the point where are uid comparisons in the kernel are (user namespace, uid) comparisons. Or possibly struct user comparisons (depending on the context. And struct mount will contain the user namespace of whoever mounted the filesystem. Adding infrastructure to netlink to allow us to do conversions as the packets are enqueued for a specific user is something I would rather avoid, but that is a path we can go down if we have A good question. I think things are clear enough that it at least makes sense to sketch a solution to the problem even if we don't implement it at this point. I have been hoping Cedric or Serge would jump in because I think those are the guys who have been working on the implementation. Eric -
Sorry, I've lost the original patch from two separate mailboxes... The proper behavior depends on how we end up tying filesystems to user namespaces, which isn't actually decided yet. The way I was recommending doing that was: A filesystem is tied to a user namespace. If a uid in another naemspace is to be allowed to access the filesystem, it will actually - through a key in it's keyring (which acts like a capability) - be mapped to a uid in the filesystem's uid namespace. So in Eric's example, if Alice brings Bob over quota, Alice would have done so through some user Charlie who she is authorized to act as through her keyring. So Charlie should be the id which would be logged over netlink. Of course there is currently no support for this. So I'd recommend one of two options: either just punt on uid namespace for now and we'll fix it when we improve user namespaces - so log Alice's userid. Or we can try to do it somewhat correct now, which might be done as follows: 1. introduce get_uid_in_userns(tsk). For now this just returns tsk->uid if current->userns == tsk->userns, else it returns 0. This way in Eric's scenario, Bob would be told that root, not an invalid user (Alice) had brought him over quota. Eventually, this would walk tsk's keychain for a uid entry in current's active user namespace. 2. Add the userns to the netlink message. Again I need to find Jan's orginal patch, but I'll take a look at this. -serge -
Just fyi Eric, Note that given the amount of churn going on due to pid and network namespaces, I was seeing completion of user namespaces as something to be done sometime next year. In the meantime I was only going to do something with capabilities to restrict root in user namespaces (which I think will take the form of per-process non-expandable cap_bsets, which I plan to start basically right now). But I'll gladly do the userns enhancements earlier if it's actually wanted. They promise to be great fun :) -
Provision also needs to be made for things that are listening to the netlink broadcasts that don't match the user doing the operation or the owner of the file - similar to the way auditd wants events.
And could we have some description of the context under which all the message exchanges take place. When are these messages sent out -- what event One problem, we've been is losing notifications. It does not happen for us due to the cpumask interface (which allows us to have parallel sockets Have you looked at ensuring that the data structure works across 32 bit and 64 bit systems (in terms of binary compatibility)? That's usually The memory controller or VM would also be interested in notifications of OOM. At OLS this year interest was shown in getting OOM notifications and allow the user space a chance to handle the notification and take action (especially for containers). We already have containerstats for containers (which I was planning to reuse), but I was told that we would -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
The user is notified about either exceeding his quota softlimit or Generic netlink should take care of this - arguments are typed so it knows how much bits numbers have. So this should be no issue. Are there any Generic netlink can be used to pass this information (although in OOM situation, it may be a bit hairy to get the network stack working...). But I guess it's not related to my patch. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Yes, but apart from that, if I remember Jamal Hadi's initial comments on taskstats, he recommended that we align everything to 64 bit so that the data is well aligned for 64 bit systems. You could also consider creating a data structure, document it's members, align them and use We could have a pre-allocated buffer stored at startup and use that for OOM notification. In the case of container OOM, we are likely to have free global memory. Working towards an infrastructure so that anybody can build on top of it and sending notifications on interesting events becomes easier would be nice. We can reuse code that way and add fewer bugs :-) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
But each attribute is just one number (either 32 or 64 bit) so there's not much to align. Also each attribute has its netlink header so alignment is anyway hard to predict. Finally, this is by no means performance critical - average system using quotas may get say 1 notification per user I don't like sending one structure - by doing that you loose the How does generic netlink support versioning? I have not found this feature. Looking into Documentation/accounting/taskstats.txt it seems that taskstats are versioning only the structure taskstats itself but not the Yes, but generic netlink itself is such an infrastructure, isn't it? It is about 70 lines of code to implement notification for quota subsystem so it's really simple... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
No netlink fields (unless I'm confused) are represented as a struct, Well. If it is using netlink properly each field should have a tag. So it should not need to be versioned, because each field is strictly That we definitely would be. Although the user namespaces is rather -
The access to seq is racy, isn't it? If so, that can be solved with a lock, or with atomic_add_return(). -
You're right. I've made atomic_t from seq. Thanks for spotting this. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Attached is an incremental patch solving the issues you've spotted. Thanks for review. The result of the discussion with namespace guys is that the id used as an identify for filesystem operations should be fine. If it will ever be something different than a number, we can change the protocol which should be no problem... Also after some more reading, I've found out that we can even easily find out, which attributes have been sent in the netlink message. So I don't see a real reason for some versioning of the protocol - either the message has all the attributes we are interested and then we report it, or it does not and then we complain that tools are too old and don't understand the protocol... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs
Mutt knows how to send patches inline vs. attachments... :( Anyway, on to the patch. Thanks for adding the new doc file. +This command is used to send a notification about any of the above mentioned +events. Each message has six attributes. These are (type of the argument is +in braces): s/braces/parentheses/ + QUOTA_NL_A_QTYPE (u32) + - type of quota beging exceeded (one of USRQUOTA, GRPQUOTA) s/beging/being/ --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
Hmm, I thought Andrew either does not mind or prefers attachments. If it isn't the case, I can inline patches. Andrew? BTW: I personally prefer attachments - mutt inlines text attachments for me anyway and sometimes Thanks for reading it. Andrew, should I resend the patch or will you substitute it in the patch? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
inlined is a bit better, mainly because one can reply to it and the email I'll sort it out, thanks. -
