copy_to/from_user and all its variants (except the atomic ones) can take a
page fault and perform non-trivial work like taking mmap_sem and entering
the filesyste/pagecache.
Unfortunately, this often escapes lockdep because a common pattern is to
use it to read in some arguments just set up from userspace, or write data
back to a hot buffer. In those cases, it will be unlikely for page reclaim
to get a window in to cause copy_*_user to fault.
With the new might_lock primitives, add some annotations to x86. I don't
know if I caught all possible faulting points (it's a bit of a maze, and I
didn't really look at 32-bit). But this is a starting point.
Boots and runs OK so far.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/uaccess_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess_64.h
+++ linux-2.6/include/asm-x86/uaccess_64.h
@@ -28,6 +28,10 @@ static __always_inline __must_check
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_sleep();
+ if (current->mm)
+ might_lock_read(&current->mm->mmap_sem);
if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
@@ -70,6 +74,10 @@ static __always_inline __must_check
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
int ret = 0;
+
+ might_sleep();
+ if (current->mm)
+ might_lock_read(&current->mm->mmap_sem);
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst, src, size);
switch (size) {
@@ -112,6 +120,10 @@ static __always_inline __must_check
int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_sleep();
+ if (current->mm)
+ might_lock_read(&current->mm->mmap_sem);
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst,
(__force void *)src, size);
Index: ...i dont think so - those have their own special __atomic user-copy primitives which Nick didnt touch. but i think it should be a single primitive sprinkled around, that way it's a lot less visually intrusive as well. Ingo --
in any case i've applied the current patches to tip/core/locking: c10d38d: x86: some lock annotations for user copy paths 76b189e: lockdep: add might_lock() / might_lock_read() to see the fallout. The cleanup can be done after that. ( Nick, i've added your Acked-by to 76b189e - let me know if you dont want that. ) Ingo --
hm, but the prefetch bits were touched, and -tip testing stumbled upon this false positive: [ 66.187448] PM: Adding info for No Bus:vcsa12 [ 67.138685] evbug.c: Event. Dev: input0, Type: 20, Code: 0, Value: 500 [ 80.149087] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:122 [ 80.157344] in_atomic(): 1, irqs_disabled(): 0, pid: 6811, name: gam_server [ 80.164315] no locks held by gam_server/6811. [ 80.168685] Pid: 6811, comm: gam_server Not tainted 2.6.27-rc6-tip-00187-gbfd4ed7-dirty #2 [ 80.177013] Call Trace: [ 80.179454] [<ffffffff8026c1e8>] ? __debug_show_held_locks+0x22/0x24 [ 80.185876] [<ffffffff80242c4d>] __might_sleep+0x104/0x109 [ 80.191430] [<ffffffff8022fd25>] is_prefetch+0xe8/0x228 [ 80.196760] [<ffffffff8023050b>] do_page_fault+0x5d8/0x9a9 [ 80.202361] [<ffffffff802c8218>] ? sys_newstat+0x36/0x41 [ 80.207774] [<ffffffff80c3b2fa>] error_exit+0x0/0xb9 so i've excluded these commits for now. (once there's a fix i can continue testing it) Ingo --
here's another one found on another box:
[ 574.380319] =======================================================
[ 574.380381] [ INFO: possible circular locking dependency detected ]
[ 574.380413] 2.6.27-rc6-tip #30918
[ 574.380440] -------------------------------------------------------
[ 574.380472] sshd/8255 is trying to acquire lock:
[ 574.380500] (&sb->s_type->i_mutex_key#9){--..}, at: [<ffffffff802d5aab>] do_truncate+0x59/0x83
[ 574.380628]
[ 574.380629] but task is already holding lock:
[ 574.380679] (&mm->mmap_sem){----}, at: [<ffffffff80249775>] sys32_mmap2+0x64/0xa7
[ 574.380781]
[ 574.380781] which lock already depends on the new lock.
[ 574.380782]
[ 574.380857]
[ 574.380857] the existing dependency chain (in reverse order) is:
[ 574.380910]
[ 574.380911] -> #1 (&mm->mmap_sem){----}:
[ 574.381025] [<ffffffff80281979>] __lock_acquire+0x9b4/0xb3b
[ 574.381081] [<ffffffff80282499>] lock_acquire+0x8d/0xba
[ 574.381219] [<ffffffff802a94ff>] iov_iter_fault_in_readable+0x84/0x169
[ 574.381359] [<ffffffff802aae18>] generic_file_buffered_write+0x119/0x5ad
[ 574.381584] [<ffffffff802ab6cb>] __generic_file_aio_write_nolock+0x260/0x294
[ 574.381809] [<ffffffff802ab768>] generic_file_aio_write+0x69/0xc5
[ 574.381948] [<ffffffff802d68e2>] do_sync_write+0xeb/0x132
[ 574.382086] [<ffffffff802d6f28>] vfs_write+0xa7/0xe1
[ 574.382222] [<ffffffff802d701c>] sys_write+0x47/0x6d
[ 574.382359] [<ffffffff802481b7>] cstar_dispatch+0x7/0x4b
[ 574.382496] [<ffffffffffffffff>] 0xffffffffffffffff
[ 574.382635]
[ 574.382636] -> #0 (&sb->s_type->i_mutex_key#9){--..}:
[ 574.382942] [<ffffffff8028181b>] __lock_acquire+0x856/0xb3b
[ 574.383080] [<ffffffff80282499>] lock_acquire+0x8d/0xba
[ 574.383218] [<ffffffff80b77600>] __mutex_lock_common+0xe4/0x333
[ 574.383358] [<ffffffff80b778e9>] mutex_lock_nested+0x30/0x35
[ 574.383495] ...Oh, nice one. It had me scratching my head until I realised you must be using tiny shmem. Patch attached (which brings closely into line with shmem) -- Index: linux-2.6/mm/tiny-shmem.c =================================================================== --- linux-2.6.orig/mm/tiny-shmem.c +++ linux-2.6/mm/tiny-shmem.c @@ -65,31 +65,25 @@ struct file *shmem_file_setup(char *name if (!dentry) goto put_memory; + error = -ENFILE; + file = get_empty_filp(); + if (!file) + goto put_dentry; + error = -ENOSPC; inode = ramfs_get_inode(root->d_sb, S_IFREG | S_IRWXUGO, 0); if (!inode) - goto put_dentry; - - d_instantiate(dentry, inode); - error = -ENFILE; - file = alloc_file(shm_mnt, dentry, FMODE_WRITE | FMODE_READ, - &ramfs_file_operations); - if (!file) - goto put_dentry; - - inode->i_nlink = 0; /* It is unlinked */ - - /* notify everyone as to the change of file size */ - error = do_truncate(dentry, size, 0, file); - if (error < 0) goto close_file; + d_instantiate(dentry, inode); + inode->i_size = size; + inode->i_nlink = 0; /* It is unlinked */ + init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ, + &ramfs_file_operations); return file; close_file: put_filp(file); - return ERR_PTR(error); - put_dentry: dput(dentry); put_memory: --
Right, the prefetch stuff does get_user inside preempt_disable. Possibly
other code too, although ideally we would have a get_user_atomic, I will
give an intermediate soluation which is just what Peter suggested.
I've also moved it out of line to avoid sched.h and bad include bloating
and possible breakage.
---
copy_to/from_user and all its variants (except the atomic ones) can take a
page fault and perform non-trivial work like taking mmap_sem and entering
the filesyste/pagecache.
Unfortunately, this often escapes lockdep because a common pattern is to
use it to read in some arguments just set up from userspace, or write data
back to a hot buffer. In those cases, it will be unlikely for page reclaim
to get a window in to cause copy_*_user to fault.
With the new might_lock primitives, add some annotations to x86. I don't
know if I caught all possible faulting points (it's a bit of a maze, and I
didn't really look at 32-bit). But this is a starting point.
Boots and runs OK so far.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/uaccess_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess_64.h
+++ linux-2.6/include/asm-x86/uaccess_64.h
@@ -28,6 +28,8 @@ static __always_inline __must_check
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
@@ -70,6 +72,8 @@ static __always_inline __must_check
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst, src, size);
switch (size) {
@@ -112,6 +116,8 @@ static __always_inline __must_check
int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if ...void might_fault(void)
{
if (in_atomic())
return;
might_sleep();
if (!current->mm)
might_lock_read(&current->mm->mmap_sem);
}
This forgets that in_atomic() again - possibly triggering might_sleep()
where not appropriate.
I'm not sure its worth it to out-of-line the thing though (its only big
on debug builds), and CONFIG_LOCKDEP is the wrong CONFIG_* variable, I
think CONFIG_PROVE_LOCKING would be the appropriate one.
--
Well... yeah. I find it doesn't matter unless the function is complex or OK, drat... maybe we'll just noop that path and lose a couple of might_sleep() points -- testing and developing should be done mostly It is so that we don't need to pull sched.h and lockdep.h into those low level user access functions. --
OK, last attempt. If this breaks, then I give up for the day :)
--
copy_to/from_user and all its variants (except the atomic ones) can take a
page fault and perform non-trivial work like taking mmap_sem and entering
the filesyste/pagecache.
Unfortunately, this often escapes lockdep because a common pattern is to
use it to read in some arguments just set up from userspace, or write data
back to a hot buffer. In those cases, it will be unlikely for page reclaim
to get a window in to cause copy_*_user to fault.
With the new might_lock primitives, add some annotations to x86. I don't
know if I caught all possible faulting points (it's a bit of a maze, and I
didn't really look at 32-bit). But this is a starting point.
Boots and runs OK so far.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/uaccess_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess_64.h
+++ linux-2.6/include/asm-x86/uaccess_64.h
@@ -28,6 +28,8 @@ static __always_inline __must_check
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
@@ -70,6 +72,8 @@ static __always_inline __must_check
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst, src, size);
switch (size) {
@@ -112,6 +116,8 @@ static __always_inline __must_check
int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst,
(__force void *)src, size);
Index: linux-2.6/include/asm-x86/uaccess.h
===================================================================
--- ...i've tidied it up a bit:
- moved the might_sleep() check outside the in_atomic() check,
- fixed a spelling mistake
- fixed a build error on !LOCKDEP
- changed the CONFIG_LOCKDEP dependency to CONFIG_PROVE_LOCKING
and it's working fine on most boxes. One testbox found this new locking
scenario:
PM: Adding info for No Bus:vcsa7
EDAC DEBUG: MC0: i82860_check()
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc6-tip #1
-------------------------------------------------------
X/4873 is trying to acquire lock:
(&bb->mutex){--..}, at: [<c020ba20>] mmap+0x40/0xa0
but task is already holding lock:
(&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){----}:
[<c017dc96>] validate_chain+0xa96/0xf50
[<c017ef2b>] __lock_acquire+0x2cb/0x5b0
[<c017f299>] lock_acquire+0x89/0xc0
[<c01aa8fb>] might_fault+0x6b/0x90
[<c040b618>] copy_to_user+0x38/0x60
[<c020bcfb>] read+0xfb/0x170
[<c01c09a5>] vfs_read+0x95/0x110
[<c01c1443>] sys_pread64+0x63/0x80
[<c012146f>] sysenter_do_call+0x12/0x43
[<ffffffff>] 0xffffffff
-> #0 (&bb->mutex){--..}:
[<c017d8b7>] validate_chain+0x6b7/0xf50
[<c017ef2b>] __lock_acquire+0x2cb/0x5b0
[<c017f299>] lock_acquire+0x89/0xc0
[<c0d6f2ab>] __mutex_lock_common+0xab/0x3c0
[<c0d6f698>] mutex_lock_nested+0x38/0x50
[<c020ba20>] mmap+0x40/0xa0
[<c01b111e>] mmap_region+0x14e/0x450
[<c01b170f>] do_mmap_pgoff+0x2ef/0x310
[<c0125a3d>] sys_mmap2+0xad/0xc0
[<c012146f>] sysenter_do_call+0x12/0x43
[<ffffffff>] 0xffffffff
other info that might help us debug this:
1 lock held by X/4873:
#0: (&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0
stack backtrace:
Pid: 4873, comm: X Not tainted ...Hmm... but then it has the same failure case again in the is_preempt()
code, does it not?
I guess we should just convert that guy to either use get_user_atomic,
Yes, it is a real bug by the looks. bin.c takes bb->mutex under mmap_sem
Yeah it's nice. I'm just hoping we don't come across one that is as
difficult to fix as prepare_write/commit_write were ;)
Here is a basic fix for the sysfs lor.
---
Index: linux-2.6/fs/sysfs/bin.c
===================================================================
--- linux-2.6.orig/fs/sysfs/bin.c
+++ linux-2.6/fs/sysfs/bin.c
@@ -61,6 +61,7 @@ read(struct file *file, char __user *use
int size = dentry->d_inode->i_size;
loff_t offs = *off;
int count = min_t(size_t, bytes, PAGE_SIZE);
+ char *temp;
if (size) {
if (offs > size)
@@ -69,23 +70,33 @@ read(struct file *file, char __user *use
count = size - offs;
}
+ temp = kmalloc(count, GFP_KERNEL);
+ if (!temp)
+ return -ENOMEM;
+
mutex_lock(&bb->mutex);
count = fill_read(dentry, bb->buffer, offs, count);
- if (count < 0)
- goto out_unlock;
+ if (count < 0) {
+ mutex_unlock(&bb->mutex);
+ goto out_free;
+ }
- if (copy_to_user(userbuf, bb->buffer, count)) {
+ memcpy(temp, bb->buffer, count);
+
+ mutex_unlock(&bb->mutex);
+
+ if (copy_to_user(userbuf, temp, count)) {
count = -EFAULT;
- goto out_unlock;
+ goto out_free;
}
pr_debug("offs = %lld, *off = %lld, count = %d\n", offs, *off, count);
*off = offs + count;
- out_unlock:
- mutex_unlock(&bb->mutex);
+ out_free:
+ kfree(temp);
return count;
}
@@ -118,6 +129,7 @@ static ssize_t write(struct file *file,
int size = dentry->d_inode->i_size;
loff_t offs = *off;
int count = min_t(size_t, bytes, PAGE_SIZE);
+ char *temp;
if (size) {
if (offs > size)
@@ -126,19 +138,27 @@ static ssize_t write(struct file *file,
count = size - offs;
}
- mutex_lock(&bb->mutex);
+ temp = kmalloc(count, GFP_KERNEL);
+ if (!temp)
+ return ...[ Greg, please see the sysfs fix further below. ]
i've done the v3 patch below - that seems to have passed all my testing
without any new bugs found. I've reinstated your the clear_user()
might_fault() check, plus i removed it from __[get|put]_user_size, which
the _inatomic() API variants use. That enabled me to utilize the
_inatomic() API in probe_kernel_address().
we still have the checks in put_user()/get_user() and in all the
copy_*_user() APIs, which should be strong enough. [ I havent fully
checked whether __get_user_size() might be used by some less frequent
ok - second patch attached below, Greg, could you please apply? This is
and that did the trick here - the patch with a tidied up changelog is
attached further below. [ the second patch is standalone and does not
need the first patch which is relative to tip/master ]
thanks Nick, i think this is a great addition to lockdep! It already
found two real locking bugs within a day. If you can think of any other
proactive methods to widen our lock hierarchy knowledge that would be
great to add. I think what we want is to insert knowledge about other
unlikely lock acquire events, for locks that have a historic pattern of
producing regular locking bugs.
Ingo
----------------->
From 1d18ef489509314506328b9e464dd47c24c1d68f Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 11 Sep 2008 20:53:21 +0200
Subject: [PATCH] x86: some lock annotations for user copy paths, v3
- add annotation back to clear_user()
- change probe_kernel_address() to _inatomic*() method
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/lib/usercopy_32.c | 1 +
include/asm-x86/uaccess.h | 2 --
include/linux/uaccess.h | 2 +-
3 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 7393152..fab5fab 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -148,6 +148,7 @@ do { \
...Well thanks to Peter as well. Actually I don't suppose this will throw off the lockstat statistics a bit? (although I guess serious lockstat profiling might not have prove locking turned on?). The user fault I guess is the main thing like this in the VM that I can think of. The user fault I guess is the main thing like this in the VM that I can think of. --
It will actually - /me cribs it on his todo-when-getting-back-home list. --
with the new mmap_sem lockdep annotations, -tip testing found a third
lockdep assert, see below. Config attached.
Ingo
----------------->
[ 6460.634452]
[ 6460.634465] =======================================================
[ 6460.634494] [ INFO: possible circular locking dependency detected ]
[ 6460.634517] 2.6.27-rc6-tip-00290-g8e229c3-dirty #1
[ 6460.634535] -------------------------------------------------------
[ 6460.634555] gdm-simple-gree/4778 is trying to acquire lock:
[ 6460.634574] (&mm->mmap_sem){----}, at: [<c018fe33>] might_fault+0x36/0x73
[ 6460.634639]
[ 6460.634645] but task is already holding lock:
[ 6460.634662] (&dev->ev_mutex){--..}, at: [<c01c7a76>] inotify_read+0xd8/0x16e
[ 6460.634715]
[ 6460.634721] which lock already depends on the new lock.
[ 6460.634731]
[ 6460.634745]
[ 6460.634750] the existing dependency chain (in reverse order) is:
[ 6460.634769]
[ 6460.634774] -> #3 (&dev->ev_mutex){--..}:
[ 6460.634807] [<c0165edd>] __lock_acquire+0x914/0xa8a
[ 6460.634848] [<c01660c3>] lock_acquire+0x70/0x97
[ 6460.634876] [<c0b44e00>] __mutex_lock_common+0xe1/0x341
[ 6460.634921] [<c0b450f8>] mutex_lock_nested+0x2e/0x36
[ 6460.634950] [<c01c77d5>] inotify_dev_queue_event+0x25/0x127
[ 6460.634979] [<c01c6ad3>] inotify_inode_queue_event+0x87/0xb6
[ 6460.635010] [<c01a7c64>] fsnotify_create+0x2c/0x35
[ 6460.635049] [<c01a85ff>] vfs_create+0x77/0x88
[ 6460.635077] [<c01a9dc5>] do_filp_open+0x1aa/0x5ad
[ 6460.635107] [<c019f469>] do_sys_open+0x47/0xc1
[ 6460.635145] [<c019f52f>] sys_open+0x23/0x2b
[ 6460.635171] [<c011a20a>] syscall_call+0x7/0xb
[ 6460.635209] [<ffffffff>] 0xffffffff
[ 6460.635233]
[ 6460.635238] -> #2 (&ih->mutex){--..}:
[ 6460.635271] [<c0165edd>] __lock_acquire+0x914/0xa8a
[ 6460.635301] [<c01660c3>] lock_acquire+0x70/0x97
[ 6460.635328] [<c0b44e00>] __mutex_lock_common+0xe1/0x341
[ 6460.635358] ...Yes, there's a thread in my intray called "inotify_read's ev_mutex vs do_page_fault's mmap_sem...". It's a bit flakey-looking, but there's a patch in there. --
ah, thx. I picked up the patch into tip/out-of-tree. (see below for a
tided up changelog) Please queue it up as v2.6.27 material. (i'll report
it if anything breaks due to the patch)
Ingo
--------------->
From 1eb0a42e4eb3283521ee1de99adbf567874b622f Mon Sep 17 00:00:00 2001
From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Thu, 11 Sep 2008 06:12:51 +1000
Subject: [PATCH] mm: fix locking, inotify_read's ev_mutex vs do_page_fault's mmap_sem...
Fix inotify lock order reversal with mmap_sem due to holding locks over
copy_to_user.
Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
fs/inotify_user.c | 27 ++++++++++++++++++++-------
include/asm-x86/uaccess_64.h | 1 +
2 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/fs/inotify_user.c b/fs/inotify_user.c
index 6024942..d85c7d9 100644
--- a/fs/inotify_user.c
+++ b/fs/inotify_user.c
@@ -323,7 +323,7 @@ out:
}
/*
- * remove_kevent - cleans up and ultimately frees the given kevent
+ * remove_kevent - cleans up the given kevent
*
* Caller must hold dev->ev_mutex.
*/
@@ -334,7 +334,13 @@ static void remove_kevent(struct inotify_device *dev,
dev->event_count--;
dev->queue_size -= sizeof(struct inotify_event) + kevent->event.len;
+}
+/*
+ * free_kevent - frees the given kevent.
+ */
+static void free_kevent(struct inotify_kernel_event *kevent)
+{
kfree(kevent->name);
kmem_cache_free(event_cachep, kevent);
}
@@ -350,6 +356,7 @@ static void inotify_dev_event_dequeue(struct inotify_device *dev)
struct inotify_kernel_event *kevent;
kevent = inotify_dev_get_event(dev);
remove_kevent(dev, kevent);
+ free_kevent(kevent);
}
}
@@ -433,17 +440,15 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
dev = file->private_data;
while (1) {
- int events;
prepare_to_wait(&dev->wq, &wait, TASK_INTERRUPTIBLE);
...I would call this "fs: fix inotify locking....", it's not really an mm bug if another subsystem misuses mm's APIs. But that's a nitpick. Here is the other patch I did too. tiny-shmem calls do_truncate in shmem_file_setup. do_truncate takes i_mutex, and shmem_file_setup is called with mmap_sem held. However i_mutex nests outside mmap_sem. Copy the code in shmem.c to avoid this problem. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/tiny-shmem.c =================================================================== --- linux-2.6.orig/mm/tiny-shmem.c +++ linux-2.6/mm/tiny-shmem.c @@ -65,31 +65,25 @@ struct file *shmem_file_setup(char *name if (!dentry) goto put_memory; + error = -ENFILE; + file = get_empty_filp(); + if (!file) + goto put_dentry; + error = -ENOSPC; inode = ramfs_get_inode(root->d_sb, S_IFREG | S_IRWXUGO, 0); if (!inode) - goto put_dentry; - - d_instantiate(dentry, inode); - error = -ENFILE; - file = alloc_file(shm_mnt, dentry, FMODE_WRITE | FMODE_READ, - &ramfs_file_operations); - if (!file) - goto put_dentry; - - inode->i_nlink = 0; /* It is unlinked */ - - /* notify everyone as to the change of file size */ - error = do_truncate(dentry, size, 0, file); - if (error < 0) goto close_file; + d_instantiate(dentry, inode); + inode->i_size = size; + inode->i_nlink = 0; /* It is unlinked */ + init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ, + &ramfs_file_operations); return file; close_file: put_filp(file); - return ERR_PTR(error); - put_dentry: dput(dentry); put_memory: --
On Mon, 15 Sep 2008 00:12:31 +0200 It's a bit unfortunate (as in: arse-about) that we end up creating new files deep within the mmap code, but I guess we won't be changing that That's a fairly substantial change. Was it runtime tested? --
Hugh and I talked about merging it back into one file ala the x86 32/64 merge. I've put it on my todo list. -- Mathematics is the supreme nostalgia of our time. --
yes, -tip testing. I queued it up in tip/out-of-tree a week ago:
commit 20e27c7b26792dbd9af0543c4bc86b5de5653a89
Author: Nick Piggin <npiggin@suse.de>
AuthorDate: Wed Sep 10 17:12:45 2008 +0200
Commit: Ingo Molnar <mingo@elte.hu>
CommitDate: Thu Sep 11 09:13:36 2008 +0200
mm: fix tiny-shmem circular locking
in 7 days that's about 7000 random bootups, 20% of which had TINY_SHMEM
enabled, half 32-bit, half 64-bit x86. It did not blow up in any way
that would have prevented the kernel from building its next random
version from within itself and it did not produce any kernel messages
with various random kernel debug, compile and boot options.
So i think it's a candidate for v2.6.27.
Ingo
--
Does anything in that workload actually use shared memory?
J
--
[adding Dave] For the record, Hugh tracked down the history of this bug and it went something like this: - I forked shmem.c and trimmed it down, keeping the function in question intact - Dave Hansen made divergent changes to shmem and tiny-shmem for reasons that aren't immediately obvious - Al Viro fixed a resultant dput bug - Nick fixed a resultant deadlock bug by undoing the divergence As far as Hugh and I can see, there was no reason for the divergent change. Really, we should probably re-unify the files to prevent such further confusion. -- Mathematics is the supreme nostalgia of our time. --
Man, my memory sucks. All I see that 'git blame' can pin on me are some of the i_nlink helper additions. Those weren't made in tiny-shmem.c simply because it doesn't manipulate i_nlink like shmem.c does. Was there something else you had in mind? -- Dave --
Please read on, the story takes a twist...
Below is the commit I meant: prior to this, the shmem_file_read()s in
mm/shmem.c and mm/tiny-shmem.c were much the same, as Matt intended;
but in this commit you made a minor change in mm/shmem.c, but a
rearrangement in tiny-shmem.c (perhaps following your rearrangement
in fs/hugetlbfs/inode.c). Later Al Viro fixed the double dput() on
failure which came from that rearrangement (is hugetlbfs okay?).
It's not immediately obvious why two such similar functions needed
two such dissimilar patches; and we'd all (Nick, Matt and I) prefer
to restore the similarity, especially now the tiny-shmem.c variant
has shown a locking problem. Do you see any reason against that?
But now looking into it further, I see this is all a red herring,
your rearrangement is not the significant difference: before that
there was David Howells' Jan 2006 commit
b0e15190ead07056ab0c3844a499ff35e66d27cc
[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU
which is the one which adds do_truncate() into tiny-shmem.c's
shmem_file_setup() but not into shmem.c's - presumably because
config SHMEM depends on MMU so it was irrelevant in shmem.c.
*That* is the relevant commit, which introduced the bad i_mutex
within mmap_sem lock ordering, and it seems that Nick's current
patch is wrong just to remove that do_truncate(), a significant
change hidden inside his restoration of the original arrangement.
I confess I've made no effort to work what the right answer is
(an #ifndef CONFIG_MMU might hack it for now, but hardly satisfies),
and I'm about to vanish for a couple of days: so please,
don't wait on a reply from me...
Hugh
commit ce8d2cdf3d2b73e346c82e6f0a46da331df6364c
Author: Dave Hansen <haveblue@us.ibm.com>
Date: Tue Oct 16 23:31:13 2007 -0700
r/o bind mounts: filesystem helpers for custom 'struct file's
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem. In the
...That would break SYSV IPC SHM under CONFIG_MMU=n conditions.
The truncate is necessary as I explained in my patch:
(2) ramfs files now need resizing using do_truncate() rather than by
modifying the inode size directly (see shmem_file_setup()). This
causes ramfs to attempt to bind a block of pages of sufficient size to
the inode.
What I didn't belabour in the patch, and perhaps I should have, is that to do
SYSV IPC SHM under NOMMU conditions, it is necessary to allocate a *contiguous*
set of pages - something that ramfs has been taught to do under NOMMU when
truncating a file upwards from zero size. This makes POSIX SHM on ramfs files
viable also.
David
--
The code how it is breaks tiny-shmem under all conditions. We have lock
OK what about the following patch? Either way, the shmem_zero_setup is
somewhat of a hack in the mmap code.
What really should happen is that the shmem zero setup should happen
before the get_unmapped_area, so the correct get_unmapped_area for
the file gets called to allocate contiguous pages. This could also
lift the whole file creation out from under mmap_sem (not that you would
need to call do_truncate there *anyway* in that case, but it still makes
the code cleaner).
RFC Quick patch to fix nommu anonymous shared memory without breaking
locking...
---
Index: linux-2.6/include/linux/ramfs.h
===================================================================
--- linux-2.6.orig/include/linux/ramfs.h
+++ linux-2.6/include/linux/ramfs.h
@@ -6,6 +6,7 @@ extern int ramfs_get_sb(struct file_syst
int flags, const char *dev_name, void *data, struct vfsmount *mnt);
#ifndef CONFIG_MMU
+extern int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize);
extern unsigned long ramfs_nommu_get_unmapped_area(struct file *file,
unsigned long addr,
unsigned long len,
Index: linux-2.6/mm/tiny-shmem.c
===================================================================
--- linux-2.6.orig/mm/tiny-shmem.c
+++ linux-2.6/mm/tiny-shmem.c
@@ -80,6 +80,12 @@ struct file *shmem_file_setup(char *name
inode->i_nlink = 0; /* It is unlinked */
init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ,
&ramfs_file_operations);
+
+#ifndef CONFIG_MMU
+ error = ramfs_nommu_expand_for_mapping(inode, size);
+ if (error)
+ goto close_file;
+#endif
return file;
close_file:
Index: linux-2.6/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -58,7 +58,7 @@ const struct inode_operations ramfs_file
* size 0 on the ...That works: # /doshm sysv Command: sysv shmid: 0 memory: 0xc3b00000 c3060000-c306a788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3100000-c315ede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so c3938000-c393c000 rw-p 00000000 00:00 0 c3a68000-c3a6c000 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so c3a78000-c3a7c000 rw-p 00000000 00:0b 13763417 /doshm c3a80000-c3aa0000 rwxp 00000000 00:00 0 c3aa0000-c3aac000 rw-p 00000000 00:00 0 c3ab8000-c3abc000 r-xs 00000000 00:0b 13763417 /doshm c3b00000-c3bfa000 rw-S 00000000 00:07 0 /SYSV00000000 (deleted) c3b00000-c3bfa000 rw-S 00000000 00:07 0 /SYSV00000000 (deleted) nattch 2 Acked-by: David Howells <dhowells@redhat.com> --
Great. Is there any reason I shouldn't use truncate() in regular shmem.c if I were to unify the two files? -- Mathematics is the supreme nostalgia of our time. --
It would be nice to keep tiny-shmem.c, well, tiny for NOMMU systems that can't have swap and can't have virtual memory. Note that regular shmem does not now build on top of ramfs, so you'd have to port the ramfs internals to it. David --
As the original author of tiny-shmem.c, I will naturally try to bear this in mind. The goal would be to build both a regular and a tiny version from one C file so that they won't diverge again in the future. -- Mathematics is the supreme nostalgia of our time. --
The lock ordering bug is a good reason not to use do_truncate() in either. Once Nick's fixup is in mainline, try the unification based on that, including his #ifndef CONFIG_MMU which is good for documentation. But I don't know how ugly all the #ifdef'ing will end up: at the time you created mm/tiny-shmem.c, we had a stronger embargo on #ifdefs in *.c than is fashionable today. If we're hell-bent on #ifdefs throughout mm/shmem.c, I wouldn't mind scattering some CONFIG_SWAPs in there too, would cut out lots of overhead when swap unconfigured. But again, how ugly? Hugh --
That might be necessary: NOMMU doesn't support swap. David --
Matt would be dealing with that aspect in his unification: SHMEM depends on MMU, so !MMU gives you TINY_SHMEM - I'm sure he wouldn't have any of the SWAP stuff in the TINY_SHMEM part of his unification. But there's still value in adding CONFIG_SWAPs too (if the #ifdefs were tolerable, by no means clear). Hugh --
Looks good to me, thanks Nick. And I see David's Ack that it works, so Acked-by: Hugh Dickins <hugh@veritas.com> Please push to Andrew and/or Linus now git is broken for !MMU. --
On Wed, 24 Sep 2008 20:29:39 +0100 (BST) I'm about to vanish for a week, so please send urgent stuff straight to Linus. --
The only reason I diverged them was that I was trying to encourage the use of alloc_file() and discourage the use of init_file() due to some guidance from Christoph H. But, you're certainly right, being able to find bugs between the two implementations certainly trumps that, so I see no reason not to reunify them. -- Dave --
yes, X and a couple of other things i assume. It's the same test setup that found the lockdep assert - and lockdep needs actual use of the affected facilities. Ingo --
This is the first one I fixed up, I think? What's flakey about it? I sent 2 versions, and the first had an obvious bug but the second worked for me and reporter... --
Right, it's a bit of a maze there, but yes we are safe because except for
where I added some missing ones, they all had might_sleep() there too,
That's really the right way to do it, it makes it easier for other archs
to pick up, and it means we can add other things to it easily if needed.
Thanks,
Nick
---
copy_to/from_user and all its variants (except the atomic ones) can take a
page fault and perform non-trivial work like taking mmap_sem and entering
the filesyste/pagecache.
Unfortunately, this often escapes lockdep because a common pattern is to
use it to read in some arguments just set up from userspace, or write data
back to a hot buffer. In those cases, it will be unlikely for page reclaim
to get a window in to cause copy_*_user to fault.
With the new might_lock primitives, add some annotations to x86. I don't
know if I caught all possible faulting points (it's a bit of a maze, and I
didn't really look at 32-bit). But this is a starting point.
Boots and runs OK so far.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/uaccess_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess_64.h
+++ linux-2.6/include/asm-x86/uaccess_64.h
@@ -28,6 +28,8 @@ static __always_inline __must_check
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
@@ -70,6 +72,8 @@ static __always_inline __must_check
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst, src, size);
switch (size) {
@@ -112,6 +116,8 @@ static __always_inline __must_check
int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
{
int ret = 0;
+
+ might_fault();
if ...It would be nicer if you defined a standard function for this, e.g. a might_page_fault(); This would also make it easier to potentially add more locks. -Andi -- ak@linux.intel.com --
