Since get_user_pages_fast() made it in, I thought to give this another try. Lightly tested by disabling the private futexes and running some java proglets. --
hm, very interesting. Since this is an important futex usecase i started testing it in tip/core/futexes: cd33272: futex: cleanup fshared a135356: futex: use fast_gup() 39ce77b: futex: reduce mmap_sem usage 0d7a336: futex: rely on get_user_pages() for shared futexes Nick, it would be nice to get an Acked-by/Reviewed-by from you, before we think about whether it should go upstream. Ingo --
Yeah, these all look pretty good. It's nice to get rid of mmap sem here. Which reminds me, we need to put a might_lock mmap_sem into get_user_pages_fast... But these patches look good to me (last time we discussed them I thought there was a race with page truncate, but it looks like you've closed that by holding page lock over the whole operation...) Nice work, Peter. BTW. what kinds of things use inter-process futexes as of now? --
Just to be sure, I only hold the page lock over the get_futex_key() op, and drop it after getting a ref on the futex key. I then drop the futex key ref after the futex op is complete. This assumes the futex key ref is suffucient to guarantee whatever is needed - which is the point I'm still not quite sure about myself. The futex key ref was used between futex ops, with I assume the intent to ensure the futex backing stays valid. However, the key ref only takes a ref on either the inode or the mm, neither which avoid the specific address of the futex to get unmapped between ops. So in that respect we're not worse off than before, and any application doing: futex_wait(), munmap(), futex_wake() is going to suffer. And as far as I understand it get the waiting task stuck in D state for ever-more or somesuch. By now not holding the mmap_sem over the full futex op, but only over the get_futex_key(), that munmap() race gets larger and the actual futex could disappear while we're working on it, but in all cases I looked at that will make the futex op return -EFAULT, so we should be good there. Gah, now that I look at it, it looks like I made get_futex_key() asymetric wrt private futexes, they don't take a ref on the key, but On a regular modern Linux system, not much. But I've been told there are applications out there that do indeed make heavy use of them - as they're part of POSIX etc.. blah blah :-) Also some legacy stuff that's stuck on an ancient glibc (but somehow did manage to upgrade the kernel) might benefit. --- Subject: futex: fixup get_futex_key() for private futexes From: Peter Zijlstra <a.p.zijlstra@chello.nl> With the get_user_pages_fast() patches we made get_futex_key() obtain a reference on the returned key, but failed to do so for private futexes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/kernel/futex.c b/kernel/futex.c index 197fdab..beee9af 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -227,6 ...
applied to tip/core/futexes, thanks Peter! Ingo --
It is enough to guarantee enough to function normally I guess. Actually futex syscalls are interesting in that they logically perform 2 totally different operations and actually want 2 keys but it manages to mash them into one (the user address). They need a futex identifier, on which to wait/wake/etc. For anonymous futexes, this is basically an arbitrary number which is taken from the user address (but note that if that address is subsequently mremap()ed for example, then the futex identifier does not follow the address). For shared futexes, the pagecache address is used for the futex, which is derived from the address. Then they also need a memory address which is the target of the particular operation requested. This doesn't actually logically have to be directly related to the futex identifier... I guess for all practical purposes, this is fine. "if you mremap a live mutex, you lose" and similar probably doesn't constrain too many people (although I don't think any of that is actually documented). But anyway, just to point out that we do already have constraints on how existing Yeah, I think you're fine there (the sleep is interruptible, so it should not be a DoS or anything). Just so long as you always pin things like the inode correctly while taking a ref on it... --
inter-process futexes are still used for pthread creation/join (aka clear_child_tid / CLONE_CHILD_CLEARTID) kernel/fork.c, functions mm_release() & sys_set_tid_address() I am not sure how it could be converted to private futexes, since Sorry I am lost... --
Ah, right - its a NOP, that's why it didn't show up in testing. The thing is, I changed the semantics of get_futex_key() to return a key with reference taken. And then noticed I didn't take one in the private futex path, and failed to notice the ref ops are nops for private futexes. So yeah, the below patch is basically a NOP, but we might consider --
We considered this back when but any effort seems too much. We'd either need a clone flag (a scarce resource) or replace the set_tid _address syscall. Given that the futex is woken once per thread lifetime it shouldn't be an issue. If the semaphore shows up even after this patch feel free to introduce a new set_tid_address syscall. --
great - i've added your Acked-by to the patches and have activated the tip/core/futexes branch for tip/auto-core-next linux-next integration. here are the commits: 42569c3: futex: fixup get_futex_key() for private futexes c2f9f20: futex: cleanup fshared 734b05b: futex: use fast_gup() 6127070: futex: reduce mmap_sem usage 38d47c1: futex: rely on get_user_pages() for shared futexes Thanks, Ingo --
