I've seen some strange things happen USB keys connected to my laptop
since I upgraded the kernel the last time.
The system:
- Thinkpad X41 / Pentium M
- Debian/Lenny
- libc6 2.7-10
- gcc 4.2.4
The scenario:
- I insert the USB key or CF card to a USB reader
- udev mounts it for me (it's a vfat fs)
- I move a file from the key to my main ext3 disk
- the file is created on disk and all contents are copied
- the mv command never exits and cannot be killed
- after that any syscall attempting to access the USB key will lockup
Running an strace shows that the command hangs on the unlink() of the
file being moved. Here are the last two few lines...
close(4) = 0
close(3) = 0
lstat64("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fstatat64(AT_FDCWD, "/media/usb0/dcim/100ncd70/20080817-175746-361.jpg", {st_mode=S_IFREG|0660, st_size=830429, ...}, AT_SYMLINK_NOFOLLOW) = 0
unlinkat(AT_FDCWD, "/media/usb0/dcim/100ncd70/20080817-175746-361.jpg", 0
I was able to bisect it to the commit 8f5934278d1d86590244c2791b28f77d67466007
which claims to "Replace BKL with superblock lock in fat/msdos/vfat".
When I run with lock debugging I get...
=============================================
[ INFO: possible recursive locking detected ]
2.6.27-rc3-bisect-00448-ga7f5aaf #16
---------------------------------------------
mv/4020 is trying to acquire lock:
(&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
but task is already holding lock:
(&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
other info that might help us debug this:
3 locks held by mv/4020:
#0: (&sb->s_type->i_mutex_key#9/1){--..}, at: [<c01b2336>] do_unlinkat+0x66/0x140
#1: (&sb->s_type->i_mutex_key#9){--..}, at: [<c01b0954>] vfs_unlink+0x84/0x110
#2: (&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
stack backtrace:
Pid: 4020, ...This fixes a regression introduced when BKL was removed from the
vfat driver in commit 8f5934278d1d86590244c2791b28f77d67466007.
With vfat, the unlink syscall would result in the following call chain
that caused deadlock.
- do_unlinkat
- vfs_unlink
- vfat_unlink
* lock_super
- fat_remove_entries
- fat_sync_inode
- fat_write_inode
* lock_super
This is not the ideal fix, but it should reverse some regressions
caused by BKL removal in code that depended on BKL being recursive.
Signed-off-by: Bart Trojanowski <bart@jukie.net>
---
fs/super.c | 20 ++++++++++++++++++--
include/linux/fs.h | 2 ++
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index e931ae9..8f813db 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -73,6 +73,8 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_dentry_lru);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
+ s->s_lock_cookie = NULL;
+ s->s_lock_refcnt = 0;
lockdep_set_class(&s->s_umount, &type->s_umount_key);
/*
* The locking rules for s_lock are up to the
@@ -227,14 +229,28 @@ static int grab_super(struct super_block *s) __releases(sb_lock)
*/
void lock_super(struct super_block * sb)
{
+ if (sb->s_lock_heldby == current) {
+ sb->s_lock_refcnt ++;
+ return;
+ }
+
get_fs_excl();
mutex_lock(&sb->s_lock);
+
+ sb->s_lock_heldby = current;
+ sb->s_lock_refcnt = 1;
}
void unlock_super(struct super_block * sb)
{
- put_fs_excl();
- mutex_unlock(&sb->s_lock);
+ BUG_ON(sb->s_lock_heldby != current);
+
+ if (! -- sb->s_lock_refcnt) {
+ sb->s_lock_heldby = NULL;
+
+ put_fs_excl();
+ mutex_unlock(&sb->s_lock);
+ }
}
EXPORT_SYMBOL(lock_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..d88178e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ ...I agree that it's going to almost certainly fix the regression, but could
you test the following patch instead as an alternative? I'd rather remove
the broken recursive lockign than introduce it as an acceptable concept.
Linus
---
fs/fat/inode.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 6d266d7..80ff338 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -562,26 +562,23 @@ static int fat_write_inode(struct inode *inode, int wait)
struct buffer_head *bh;
struct msdos_dir_entry *raw_entry;
loff_t i_pos;
- int err = 0;
+ int err;
retry:
i_pos = MSDOS_I(inode)->i_pos;
if (inode->i_ino == MSDOS_ROOT_INO || !i_pos)
return 0;
- lock_super(sb);
bh = sb_bread(sb, i_pos >> sbi->dir_per_block_bits);
if (!bh) {
printk(KERN_ERR "FAT: unable to read inode block "
"for updating (i_pos %lld)\n", i_pos);
- err = -EIO;
- goto out;
+ return -EIO;
}
spin_lock(&sbi->inode_hash_lock);
if (i_pos != MSDOS_I(inode)->i_pos) {
spin_unlock(&sbi->inode_hash_lock);
brelse(bh);
- unlock_super(sb);
goto retry;
}
@@ -607,11 +604,10 @@ retry:
}
spin_unlock(&sbi->inode_hash_lock);
mark_buffer_dirty(bh);
+ err = 0;
if (wait)
err = sync_dirty_buffer(bh);
brelse(bh);
-out:
- unlock_super(sb);
return err;
}
--
Looks good. I ran 10 parallel processes creating 1M files truncating them, writing to them again and then deleting them. This patch fixes the issue I ran into. -- WebSig: http://www.jukie.net/~bart/sig/ --
Very much so. And I think you hit this issue because you probably mounted the USB stick as a "sync" (or dirsync) mount - which is what some distros do by default, even if it is known to cause problems for some flash cards that don't do a good job at wear levelling. But it's good that you did that, because all _my_ testing (which was admittedly very deficient) had been done with a default mount without that Yeah, we had some other cases like that. It's the main source of BKL problems by far (if it wasn't for the recursion, BKL removal would generally be trivial). The other example of this was 9c20616c385ebeaa30257ef5d35e8f346db4ee32, Thanks. Btw, quite often, the right solution may be to remove one of the locks entirely. FAT should actually have been largely BKL free, and my conversion of BKL to super-lock was "overly eager" exactly because it's easier to find deadlocks (and debug things carefully and handle them as they pop up) than it is to find races (which are almost impossible to debug and pinpoint). In particular, I think fat_write_inode() really is safe. It already uses spin_lock(&sbi->inode_hash_lock); .. spin_unlock(&sbi->inode_hash_lock); to protect its internal data structures, and all that [un]lock_super() protects is really just local variables and code that is already SMP-safe (ie "sb_bread()" certainly doesn't need locking. So I'm pretty sure the right fix is to just remove [un]lock_super() entirely from fat_write_inode(). Linus --
You're right. I turned on sync a while back so I could just pull the stick out w/o the need to do a manual sync... probably not a good idea I totally agree. After spending a couple of hours (while doing other things) bisecting the tree, I found the commit that introduced the regression. In contrast, I was able to find the double-lock in 5 Ok, I will test your patch and let you know in a few minutes. -Bart -- WebSig: http://www.jukie.net/~bart/sig/ --
Btw, while you're at it, can you try doing some random writes to that thing, since the whole sync mount is quite possibly going to find a few other cases like this. IOW, _maybe_ you hit the only case that is ever going to be an issue for sync mounts, and maybe you didn't. In my eternal quest to never actually test anything myself, I'm hoping you can try some writing (and over-writing) of files, since writes to the filesystem is where the whole "sync" thing is going to show up (both metadata ie file creation and removal, and "real" data ie normal write/truncate calls). Linus --
My bisect test script moved files back and forth between the device and my hard disk. If the patch works, I'll modify the script to also do some truncate and overwrite tests, and maybe do a few things in I am glad to help you in your eternal quest. :) -Bart -- WebSig: http://www.jukie.net/~bart/sig/ --
Hmm... I was considering the costs and benefits of sync mounts. I don't care about the performance hit, as I don't move that much data around. Primarily, I like the fact that I can pull the media out w/o any special action (which sometimes requires that I unlock my screensaver first). But, on the other hand, it's nice to have the media last longer. So, maybe it would be a good idea to have a 'delaysync=60' to force a sync after 60 seconds of inactivity. Unless, there is something else that would do that for me already. Thoughts? -Bart -- WebSig: http://www.jukie.net/~bart/sig/ --
Oh, it could be even shorter. The problem with using 'sync' is that it easily ends up overwriting things like the sector that contains a particular inode thousands of times for even trivial operations. Or things like the file allocation table etc. For example, something as trivial as copying a single big file, if the copy program just copies it a few kB at a time, then a file that is a few megabytes in size will actually end up rewriting the inode block (just because the size grows) thousands of time. With any kind of half-way decent wear leveling, this isn't a problem at all, and most flash drives have that. But if they don't, then that means that the file allocation table sectors and the inode sectors get rewritten over and over and over again thousands of times. Just making it do the sync once per _second_ or something like that would already make the "thousands of times" go away. The sectors would probably be rewritten a few times per big file, and just once per couple of tens of files for small files being written. So we don't even need anything like 60 seconds, we literally would just need some trivial delays. But no, we don't have that kind of "half-sync" behavior. Right now, it's pretty much all or nothing. Either we're fully synchronous (and that really is bad for crappy flash), or we end up depending on bdflush writing things back in the background. Of course, pdflush already syncs within 60s (in fact, 30s by default, iirc), but then things like "laptop_mode" will actually make that potentially much less frequent (I think the default value for that is 5 minutes). I do think this is something we could do better, no question about it. But I don't know exactly what the timeout should be, though (although I suspect that it should involve _ignoring_ non-data writes like the atime updates, and trigger a timeout on data writes so that when you actually write a file, you'll know that the sync will happen within five seconds of ...
.. and by "finished the write" I mean "closed the file", not "end of write() system call". Ie it's one of those things where it really doesn't mostly make much sense to try to give any kinds of flush guarantees until the user has basically shown that he's all done with writing. If you have something like removable media, and you actually remove it while you have a "cp -R" in progress, it damn well won't matter whether we were synchronous or not. But if you remove it after the "cp" has actually finished, it's a lot more understandable if somebody expects it to be on disk. So one thing we could perhaps consider is to make FAT in particular consider "sync" mounts to be about open/close consistency, not about per-write-system-call consistency. So the "close()" wouldn't return until the file is on disk, but we wouldn't force a synchronous rewrite the inode or the file allocation table thousands of times just because the file was big. FAT really is kind of different. I suspect we could just change what "sync" means for it. But it would probably be good to have a VFS-level notion of open-close consistency. It is, after all, what NFS is already supposed to give you, so there is precedence for that being a useful IO serialization model. Al, what do you think? Linus --
I was reading the vfat code, and it turns out that vfat has a "flush"
mount option. Which is documented in the code (not in the manpage) as:
struct fat_mount_options {
...
unsigned
...
flush:1, /* write things quickly */
Since that was very informative I looked at the usage. It's used
in fat_file_release() to do almost what you describe. But it seems to
be a best effort thing. If my data doesn't hit the disk (or flash) in
HZ/10, then all bets are off.
-Bart
--
WebSig: http://www.jukie.net/~bart/sig/
--
Yeah, probably the one place where it was documented was http://kernelnewbies.org/Linux_2_6_19 (commit ae78bf9c4f5fde3c67e2829505f195d7347ce3e4). Apparently at least one person must have read it, because there's a lot of dmesgs in the interweb with this string in it: "FAT: Unrecognized mount option "flush" or missing value" due to HAL trying to mount vfat filesystems with that option in kernels that don't support it. Which broke USB stick support for them, BTW, and generated a lot of posts in forums asking for help. IOW: not only this option exist, it's also being used by default in all the vfat mounts on the linux systems that use HAL, ie. almost all of them. --
While debugging a sync mount regression on vfat I noticed that there were mount options parsed by the driver that were not documented. Signed-off-by: Bart Trojanowski <bart@jukie.net> Cc: trivial@kernel.org Cc: hirofumi@mail.parknet.co.jp --- Documentation/filesystems/vfat.txt | 31 +++++++++++++++++++++++++++++-- 1 files changed, 29 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index bbac4f1..4e26744 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -8,6 +8,14 @@ if you want to format from within Linux. VFAT MOUNT OPTIONS ---------------------------------------------------------------------- +uid=### -- Explicitly set ownership of all files on this + filesystem to this user ID number. Default is to use + the UID of the mounting process. + +gid=### -- Explisitly set ownership of all files on thsi + filesystem to this group ID number. Default is to use + the GID of the mounting process. + umask=### -- The permission mask (for files and directories, see umask(1)). The default is the umask of current process. @@ -36,7 +44,7 @@ codepage=### -- Sets the codepage number for converting to shortname characters on FAT filesystem. By default, FAT_DEFAULT_CODEPAGE setting is used. -iocharset=name -- Character set to use for converting between the +iocharset=<name> -- Character set to use for converting between the encoding is used for user visible filename and 16 bit Unicode characters. Long filenames are stored on disk in Unicode format, but Unix for the most part doesn't @@ -86,6 +94,10 @@ check=s|r|n -- Case sensitivity checking setting. r: relaxed, case insensitive n: normal, default setting, currently case insensitive +nocase -- If set, files and directories read from device will ...
I actually did catch it, but I didn't regenerate the patch before sending it. Thanks for reading it over, Grant. -- WebSig: http://www.jukie.net/~bart/sig/ --
While debugging a sync mount regression on vfat I noticed that there were mount options parsed by the driver that were not documented. Signed-off-by: Bart Trojanowski <bart@jukie.net> Cc: trivial@kernel.org Cc: hirofumi@mail.parknet.co.jp --- Documentation/filesystems/vfat.txt | 31 +++++++++++++++++++++++++++++-- 1 files changed, 29 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index bbac4f1..5ed7918 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -8,6 +8,14 @@ if you want to format from within Linux. VFAT MOUNT OPTIONS ---------------------------------------------------------------------- +uid=### -- Explicitly set ownership of all files on this + filesystem to this user ID number. Default is to use + the UID of the mounting process. + +gid=### -- Explisitly set ownership of all files on this + filesystem to this group ID number. Default is to use + the GID of the mounting process. + umask=### -- The permission mask (for files and directories, see umask(1)). The default is the umask of current process. @@ -36,7 +44,7 @@ codepage=### -- Sets the codepage number for converting to shortname characters on FAT filesystem. By default, FAT_DEFAULT_CODEPAGE setting is used. -iocharset=name -- Character set to use for converting between the +iocharset=<name> -- Character set to use for converting between the encoding is used for user visible filename and 16 bit Unicode characters. Long filenames are stored on disk in Unicode format, but Unix for the most part doesn't @@ -86,6 +94,10 @@ check=s|r|n -- Case sensitivity checking setting. r: relaxed, case insensitive n: normal, default setting, currently case insensitive +nocase -- If set, files and directories read from device will ...
Thanks. Fixed. Current patch is the following.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
From: Bart Trojanowski <bart@jukie.net>
While debugging a sync mount regression on vfat I noticed that there
were mount options parsed by the driver that were not documented.
[hirofumi@mail.parknet.co.jp: fix some parts]
Signed-off-by: Bart Trojanowski <bart@jukie.net>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---
Documentation/filesystems/vfat.txt | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff -puN Documentation/filesystems/vfat.txt~fat-doc-fixes Documentation/filesystems/vfat.txt
--- linux-2.6/Documentation/filesystems/vfat.txt~fat-doc-fixes 2008-08-23 12:15:06.000000000 +0900
+++ linux-2.6-hirofumi/Documentation/filesystems/vfat.txt 2008-08-23 12:23:15.000000000 +0900
@@ -8,6 +8,14 @@ if you want to format from within Linux.
VFAT MOUNT OPTIONS
----------------------------------------------------------------------
+uid=### -- Explicitly set ownership of all files on this
+ filesystem to this user ID number. Default is to use
+ the UID of the mounting process.
+
+gid=### -- Explisitly set ownership of all files on this
+ filesystem to this group ID number. Default is to use
+ the GID of the mounting process.
+
umask=### -- The permission mask (for files and directories, see umask(1)).
The default is the umask of current process.
@@ -36,7 +44,7 @@ codepage=### -- Sets the codepage numbe
characters on FAT filesystem.
By default, FAT_DEFAULT_CODEPAGE setting is used.
-iocharset=name -- Character set to use for converting between the
+iocharset=<name> -- Character set to use for converting between the
encoding is used for user visible filename and 16 bit
Unicode characters. Long filenames are stored on disk
in Unicode format, but Unix for the most part doesn't
@@ -86,6 +94,8 @@ ...Indeed, I have made some mistakes.
... it has a 'explicit' typo, spotted by Ian Ward, and it's not setting
ownership, but rather group ownership. So how about this
gid=### -- Explicitly set group ownership of all files on this
filesystem to this group ID number. Default is to use
the GID of the mounting process.
-Bart
--
WebSig: http://www.jukie.net/~bart/sig/
--
Looks good. However, I've noticed manpage's one may be simpler and
consistent. So, the updated patch is following.
If you found something wrong, please let me know. Thanks
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
From: Bart Trojanowski <bart@jukie.net>
While debugging a sync mount regression on vfat I noticed that there
were mount options parsed by the driver that were not documented.
[hirofumi@mail.parknet.co.jp: fix some parts]
Signed-off-by: Bart Trojanowski <bart@jukie.net>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---
Documentation/filesystems/vfat.txt | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff -puN Documentation/filesystems/vfat.txt~fat-doc-fixes Documentation/filesystems/vfat.txt
--- linux-2.6/Documentation/filesystems/vfat.txt~fat-doc-fixes 2008-08-23 12:15:06.000000000 +0900
+++ linux-2.6-hirofumi/Documentation/filesystems/vfat.txt 2008-08-23 23:38:32.000000000 +0900
@@ -8,6 +8,12 @@ if you want to format from within Linux.
VFAT MOUNT OPTIONS
----------------------------------------------------------------------
+uid=### -- Set the owner of all files on this filesystem.
+ The default is the uid of current process.
+
+gid=### -- Set the group of all files on this filesystem.
+ The default is the gid of current process.
+
umask=### -- The permission mask (for files and directories, see umask(1)).
The default is the umask of current process.
@@ -36,7 +42,7 @@ codepage=### -- Sets the codepage numbe
characters on FAT filesystem.
By default, FAT_DEFAULT_CODEPAGE setting is used.
-iocharset=name -- Character set to use for converting between the
+iocharset=<name> -- Character set to use for converting between the
encoding is used for user visible filename and 16 bit
Unicode characters. Long filenames are stored on disk
in Unicode format, but Unix for the most part doesn't
@@ -86,6 +92,8 @@ check=s|r|n -- Case ...Bart Trojanowski <bart@jukie.net> writes: showexec -- If set, the execute permission bits of the file will be allowed only if the extension part of the name is .EXE, sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as flush -- If set, the filesystem will try to flush to disk more early than normal. Not set by default. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> --
