You know the drill by now: another week, another -rc. There's a lot of small stuff in here, most people won't even notice. The most noticeable thing is for all you 32-bit x86 people who use PAE (enabled by the HIGHMEM64G config option) due to having too much memory in your machine - mprotect() was broken due to some of the PAT fix/cleanup patches, causing the NX bit to be not set correctly. So if you had PAE enabled _and_ a recent enough CPU to have NX, but not recent enough to be 64-bit (or you were just perverse and wanted to run a 32-bit kernel despite having a chip that could do 64-bit and enough memory that you _really_ should have used a 64-bit kernel), you'd get various random program failures with SIGSEGV. It ranged from X not starting up to apparently OpenOffice not working if it did. But most of the changes, as usual, are in drivers, at 60%, with some DRI changes leading the way (fixing a number of other regressions, mainly by reverting the under-cooked vblank update). Network, MMC, USB, watchdog and IDE drivers also got updates. We had CIFS and NFS updates, and some arch updates as usual. The dirstat gives the overview: 2.0% arch/blackfin/ 3.3% arch/powerpc/configs/ 3.7% arch/powerpc/ 10.4% arch/ 4.5% drivers/ata/ 15.4% drivers/char/drm/ 15.8% drivers/char/ 9.6% drivers/hwmon/ 6.4% drivers/mmc/card/ 6.5% drivers/mmc/ 2.6% drivers/net/sfc/ 8.4% drivers/net/ 5.8% drivers/usb/class/ 6.1% drivers/usb/ 3.5% drivers/watchdog/ 60.9% drivers/ 7.7% fs/cifs/ 2.5% fs/nfs/ 12.7% fs/ 4.5% include/ 2.4% net/ipv4/ 2.0% net/sunrpc/xprtrdma/ 2.3% net/sunrpc/ 6.7% net/ and I append the shortlog for a sense of the details. Nothing really hugely exciting, I think we're doing pretty ok in the release cycle, and I'm getting the feeling that things are calming down. Hopefully I'm not wrong about that "calming down" feeling, but on the whole this release cycle doesn't seem to ...
I did get this one (which I didn't on 2.6.25.2) [42949399.810959] ck804xrom ck804xrom_init_one(): Unable to register resource 0x0000000000000000-0x00000000ffffffff - kernel bug? [42949399.979924] ------------[ cut here ]------------ [42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159 __ioremap_caller+0x299/0x330() [42949399.979924] Modules linked in: ck804xrom(+) mtd i2c_nforce2(+) niu(+) i2c_core serio_raw button(+) chipreg map_funcs pcspkr k8temp shpchp pci_hotplug evdev joydev ext3 jbd mbcache pata_amd sr_mod cdrom sg sd_mod usb_storage libusual pata_acpi usbhid hid mptsas ata_generic mptspi scsi_transport_sas mptscsih mptbase libata scsi_transport_spi ehci_hcd e1000 scsi_mod dock ohci_hcd usbcore dm_mirror dm_log dm_snapshot dm_mod thermal processor fan fuse [42949399.979924] Pid: 5660, comm: modprobe Not tainted 2.6.26-rc4 #1 [42949399.979924] [42949399.979924] Call Trace: [42949399.979924] [<ffffffff80236aa4>] warn_on_slowpath+0x64/0xa0 [42949399.979924] [<ffffffff8034b697>] idr_get_empty_slot+0xf7/0x280 [42949399.979924] [<ffffffff80237b4e>] printk+0x4e/0x60 [42949399.979924] [<ffffffff80465e52>] klist_iter_exit+0x12/0x20 [42949399.979924] [<ffffffff80223139>] __ioremap_caller+0x299/0x330 [42949399.979924] [<ffffffff8025c23a>] sys_init_module+0x17a/0x1d00 [42949399.979924] [<ffffffff8028c228>] vma_prio_tree_insert+0x28/0x60 [42949399.979924] [<ffffffff8020c2bb>] system_call_after_swapgs+0x7b/0x80 [42949399.979924] [42949399.979924] ---[ end trace 5bb785355abc57e6 ]--- [42949399.979924] ck804xrom: ioremap(00000000, 100000000) failed -- Jesper --
Something is trying to register a 4GB resource. That sounds unlikely (possible on a 64-bit PCI setup, but I think it's more likely to be some overflow of 0 in "unsigned int"). In fact, this seems to be due to some driver bug. It looks like we have window->size = 0xffffffffUL - window->phys + 1UL; and in order for window->size to be 0x100000000, that means that window->phys has to be 0. Which looks impossible, or at least like ent->driver_data is neither DEV_CK804 nor DEV_MCP55. Very odd. is then just a result of the driver blindly continuing and trying to "ioremap()" the resource even though it's bogus and the resource allocation failed. In other words, that driver init routine is really bad about error handling. Carl-Daniel? David? Linus --
On Mon, 26 May 2008 14:42:37 -0700 (PDT) btw this guy has shown up on kerneloops.org a lot: http://www.kerneloops.org/searchweek.php?search=__ioremap_caller where it's trying to map memory as uncachable, which is.. well nasty (it seems to map not just the piece it needs, but more, and then turns that "more" uncachable, even if the kernel is using it for "normal" things) --
On Mon, 26 May 2008 17:25:19 -0700 one thing to note: it only shows up on 64 bit kernels somehow... interesting. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
The driver needs that 'more' to reach the lock registers for the flash chip. If it's being used for other things, shouldn't the request_region() fail? On a vaguely related note, there's a lot to be said for _not_ using the standard PCI driver setup on the BIOS flash drivers, and going back to having them manually loaded rather than being automatically loaded whenever the appropriate southbridge is present. It would be nicer if the only people who have write access to their BIOS flash are the ones who _really_ wanted it. -- dwmw2 --
On Tue, 27 May 2008 06:43:51 +0100
it does fail....
... and then the driver continues anyway!
if (request_resource(&iomem_resource, &window->rsrc)) {
window->rsrc.parent = NULL;
printk(KERN_ERR MOD_NAME
" %s(): Unable to register resource"
" 0x%.016llx-0x%.016llx - kernel bug?\n",
__func__,
(unsigned long long)window->rsrc.start,
(unsigned long long)window->rsrc.end);
}
absolutely
--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
Heh. That's.... naughty. There are kind of valid reasons for that kind Just dropping the MODULE_DEVICE_TABLE() should suffice. We should probably also use the code which is currently in #if 0 which uses pci_register_driver() instead of doing things for itself; I'm not entirely sure why that is commented out. -- dwmw2 --
It hurts to look at this:
static struct pci_device_id ck804xrom_pci_tbl[] = {
{ PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
{ PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
{ 0, }
};
considering how struct pci_device_id looks like:
struct pci_device_id {
__u32 vendor, device; /* Vendor and device ID or PCI_ANY_ID*/
__u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */
__u32 class, class_mask; /* (class,subclass,prog-if) triplet */
kernel_ulong_t driver_data; /* Data private to the driver */
};
DEV_CK804 and DEV_MCP55 actually end up in class instead of driver_data.
I'd send a patch, but I'm traveling and my only code access is gitweb.
New code should look like
static struct pci_device_id ck804xrom_pci_tbl[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
{ 0, }
};
Regards,
Carl-Daniel
--
The change is Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> in case someone makes a patch from it. Regards, Carl-Daniel --
Here you go.
-- ams
diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c
index 59d8fb4..effaf7c 100644
--- a/drivers/mtd/maps/ck804xrom.c
+++ b/drivers/mtd/maps/ck804xrom.c
@@ -331,15 +331,15 @@ static void __devexit ck804xrom_remove_one (struct pci_dev *pdev)
}
static struct pci_device_id ck804xrom_pci_tbl[] = {
- { PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
- { PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
{ 0, }
};
--
Abhijit Menon-Sen wrote: > At 2008-05-27 03:23:19 +0200, c-d.hailfinger.devel.2006@gmx.net wrote: >> The change is >> Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> >> in case someone makes a patch from it. > > Here you go. > > -- ams > > diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c > index 59d8fb4..effaf7c 100644 Ok. Patch applied. The stacktrace goes away, but I still get these: [42949399.211630] ck804xrom ck804xrom_init_one(): Unable to register resource 0x00000000ffb00000-0x00000000ffffffff - kernel bug? [42949399.399754] CFI: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] CFI: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] CFI: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location zero [42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero [42949399.409759] JEDEC: Found no ck804xrom @ffc10000 device at location zero [42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero [42949399.409759] JEDEC: Found no ck804xrom @ffc10000 device at location zero [42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero ... and many more. The system works Ok anyway. -- Jesper --
There's a reason why using C99 initialisers even in the supposedly
trivial structs is a good idea.
Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c
index 59d8fb4..effaf7c 100644
--- a/drivers/mtd/maps/ck804xrom.c
+++ b/drivers/mtd/maps/ck804xrom.c
@@ -331,15 +331,15 @@ static void __devexit ck804xrom_remove_one (struct pci_dev *pdev)
}
static struct pci_device_id ck804xrom_pci_tbl[] = {
- { PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
- { PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
- { PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
+ { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
{ 0, }
};
--
dwmw2
--
PREEMPT_RCU is in use, again. And die counter is 2 because of CFQ oops I haven't noticed earlier. 0xffffffff802447cb is in find_pid_ns (kernel/pid.c:297). 292 struct hlist_node *elem; 293 struct upid *pnr; 294 295 hlist_for_each_entry_rcu(pnr, elem, 296 &pid_hash[pid_hashfn(nr, ns)], pid_chain) 297 if (pnr->nr == nr && pnr->ns == ns) 298 return container_of(pnr, struct pid, 299 numbers[ns->level]); 300 301 return NULL; general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom Pid: 15599, comm: profil01 Tainted: G D 2.6.26-rc4 #1 RIP: 0010:[<ffffffff802447cb>] [<ffffffff802447cb>] find_pid_ns+0x6b/0xa0 RSP: 0018:ffff810129021ea8 EFLAGS: 00010202 RAX: ffff810130580948 RBX: 0000000000003cef RCX: ffff81017d865278 RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff80566760 RDI: 0000000000003cef RBP: ffff810129021ea8 R08: 0000000000000000 R09: 00007f9a93987b70 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f9a9397c6f0(0000) GS:ffffffff805c6000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000257f2e8 CR3: 0000000102479000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process profil01 (pid: 15599, threadinfo ffff810129020000, task ffff81004bc24500) Stack: ffff810129021eb8 ffffffff8024487d ffff810129021f78 ffffffff8023f275 0000000000000011 0000000000000000 0000000000003cef ffff810129020000 ffffffff8061b140 00007f9a93989bc0 00007fff9b98a410 ffffffff8045fd63 Call Trace: ...
Is this reproducible? In theory find_pid() is not safe without rcu_read_lock() if CONFIG_PREEMPT_RCU. But we have a lot of "read_lock(tasklist_lock) + find_pid()", this was legal and documented. It was actually broken, but happened to work because read_lock() implied rcu_read_lock(). Could you look at [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU http://marc.info/?t=120162615300012 ? I am not sure this is the actual reason though, the race is very unlikely. Oleg. --
That repeated 0x6b is POISON_FREE, and the code is cmp -0x10(%rdx),%edi which is the load of "pnr->nr". So 'pnr' has been free'd. That is a *very* unlikely race, especially as that bad_fork_free_pid case would only happen if pid_ns_prepare_proc() fails. And if it fails, it's still very unlikely to hit, I think. That said, it does smell like a bug. But I *really* would be much much happier if even SRCU at least waited for a grace period, so that it would always be safe to just disable preemption for a "rcu_read_lock()". That way, things that take spinlocks are safe even with SRCU. Paul? How hard would it be to make preemptable RCU just honor that classic RCU behavior? Linus --
SRCU does wait for all CPUs to schedule, and thus already waits for all Hmmm... Might not be too hard, I will look into this. Should just be another stage in the rcu_try_flip state machine, along with a few of the changes already in the queue for call_rcu_sched(). But this will only help until preemptible spinlocks arrive, right? Thanx, Paul --
I don't think we will ever have preemptible spinlocks. If you preempt spinlocks, you have serious issues with contention and priority inversion etc, and you basically need to turn them into sleeping mutexes. So now you also need to do interrupts as sleepable threads etc etc. And it would break the existing non-preempt RCU usage anyway. Yeah, maybe the RT people try to do that, but quite frankly, it is insane. Spinlocks are *different* from sleeping locks, for a damn good reason. Linus --
Well, I guess I never claimed to be sane... Anyway, will look at a preemptable RCU that waits for preempt-disable sections of code. Thanx, Paul --
And here is a just-now hacked up patch. Untested, probably fails to compile.
Just kicked off a light test run, will let you know how it goes.
Thanx, Paul
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
include/linux/rcupreempt.h | 15 ++++++++-
kernel/Kconfig.preempt | 15 +++++++++
kernel/rcupreempt.c | 71 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 99 insertions(+), 2 deletions(-)
diff -urpNa -X dontdiff linux-2.6.26-rc3/include/linux/rcupreempt.h linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h
--- linux-2.6.26-rc3/include/linux/rcupreempt.h 2008-05-23 02:26:06.000000000 -0700
+++ linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h 2008-05-27 21:27:35.000000000 -0700
@@ -40,7 +40,20 @@
#include <linux/cpumask.h>
#include <linux/seqlock.h>
-#define rcu_qsctr_inc(cpu)
+struct rcu_dyntick_sched {
+ int qs;
+ int rcu_qs_snap;
+};
+
+DECLARE_PER_CPU(struct rcu_dyntick_sched, rcu_dyntick_sched);
+
+static inline void rcu_qsctr_inc(int cpu)
+{
+ struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
+
+ rdssp->qs++;
+}
+
#define rcu_bh_qsctr_inc(cpu)
#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
diff -urpNa -X dontdiff linux-2.6.26-rc3/kernel/Kconfig.preempt linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt
--- linux-2.6.26-rc3/kernel/Kconfig.preempt 2008-04-16 19:49:44.000000000 -0700
+++ linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt 2008-05-27 21:27:39.000000000 -0700
@@ -77,3 +77,18 @@ config RCU_TRACE
Say Y here if you want to enable RCU tracing
Say N if you are unsure.
+
+config PREEMPT_RCU_WAIT_PREEMPT_DISABLE
+ bool "Cause preemptible RCU to wait for preempt_disable code"
+ depends on PREEMPT_RCU
+ default y
+ help
+ This option causes preemptible RCU's grace periods to wait
+ on preempt_disable() code sections (such as spinlock critical
+ sections in CONFIG_PREEMPT kernels) as well as for RCU
+ read-side critical sections. This ...And it passes light testing on a 4-CPU x86 box. --
To be precise, this case also happens when fork() fails due to signal_pending(). But I agree, this race is pretty much theoretical. Oleg. --
Perhaps we have the unbalanced put_pid(), in that case "struct pid" will
be freed without waiting for a grace period.
Alexey, could you re-test with the patch below?
Oleg.
Add the temporary debugging code to catch the unbalanced put_pid()'s.
At least those which can free the "live" pid.
--- MM/kernel/pid.c~ 2008-02-20 18:29:40.000000000 +0300
+++ MM/kernel/pid.c 2008-02-20 18:35:15.000000000 +0300
@@ -208,6 +208,10 @@ void put_pid(struct pid *pid)
ns = pid->numbers[pid->level].ns;
if ((atomic_read(&pid->count) == 1) ||
atomic_dec_and_test(&pid->count)) {
+ int type = PIDTYPE_MAX;
+ while (--type >= 0)
+ if (WARN_ON(!hlist_empty(&pid->tasks[type])))
+ return;
kmem_cache_free(ns->pid_cachep, pid);
put_pid_ns(ns);
}
--
I have this patchsets collected from LKML, that still apply ontop of -rc4. Are they not so urgent or are they not needed any more ? JBD[2] races http://marc.info/?l=linux-ext4&m=121141319601650&w=2 http://marc.info/?l=linux-ext4&m=121141319701660&w=2 libata EH timeout handling http://marc.info/?l=linux-ide&m=121121761530723&w=2 alignment in block DMA http://marc.info/?l=linux-ide&m=121125981930670&w=2 -- J.A. Magallon <jamagallon()ono!com> \ Software is like sex: \ It's better when it's free Mandriva Linux release 2008.1 (Cooker) for i586 Linux 2.6.23-jam05 (gcc 4.2.2 20071128 (4.2.2-2mdv2008.1)) SMP PREEMPT --
Unless I misread the thread, akpm had a raft of issues with the first one, all of which were spelling except one initialization which he thought was not needed. All of those could be fixed in less effort that it takes to mention them, perhaps, but even so I think the patch could be ready in a short time. The problem looks worth fixing to me, but probably not a must have at the rc4 level. Maybe in 2.6.27 we could get rid of the potential race? -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Hi. I'm getting this one. The mount.nfs call is being called by the autofs-daemon. It is not directly reproducible, but happens around once a day on a single node on a 48 node cluster. I reported it to the NFS-guys on a .22 kernel, and was directed to the fsdevel list(CC'ed again) back then. Now the problem is reproduces on .26-rc4 Jesper Jun 3 09:46:58 node40 kernel: [17191699.952564] PGD 9f6f5067 PUD baf4f067 PMD 0 Jun 3 09:46:58 node40 kernel: [17191699.952564] CPU 1 Jun 3 09:46:58 node40 kernel: [17191699.952564] Modules linked in: nfs lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd parport_pc lp parport amd_rng i2c_amd756 container psmouse serio_raw button pcspkr evdev i2c_core k8temp shpchp pci_hotplug ext3 jbd mbcache sg sd_mod ide_cd_mod cdrom amd74xx ide_core floppy mptspi mptscsih mptbase scsi_transport_spi tg3 ata_generic libata scsi_mod dock ohci_hcd usbcore thermal processor fan thermal_sys fuse Jun 3 09:46:58 node40 kernel: [17191699.952564] Pid: 15602, comm: mount.nfs Not tainted 2.6.26-rc4 #1 Jun 3 09:46:58 node40 kernel: [17191700.727822] RIP: 0010:[graft_tree+77/288] [graft_tree+77/288] graft_tree+0x4d/0x120 Jun 3 09:46:58 node40 kernel: [17191700.727822] RSP: 0000:ffff810068683e08 EFLAGS: 00010246 Jun 3 09:46:58 node40 kernel: [17191700.727822] RAX: ffff8100bf836a90 RBX: 00000000ffffffec RCX: 0000000000000000 Jun 3 09:46:58 node40 kernel: [17191700.983576] RDX: ffff8100fa378500 RSI: ffff810068683e68 RDI: ffff810029252b00 Jun 3 09:46:58 node40 kernel: [17191700.983576] RBP: ffff810029252b00 R08: 0000000000000000 R09: 0000000000000001 Jun 3 09:46:58 node40 kernel: [17191700.983576] R10: 0000000000000001 R11: ffffffff803011c0 R12: ffff810068683e68 Jun 3 09:46:58 node40 kernel: [17191700.983576] R13: 0000000000000000 R14: 000000000000000b R15: 000000000000000b Jun 3 09:46:58 node40 kernel: [17191700.983576] FS: 00007f525595b6e0(0000) GS:ffff8100fbb12280(0000) knlGS:00000000cb307b90 Jun 3 09:46:58 node40 ...
This is all I got from the logs. I'll try go get more from the serial console. But since it is not directly reproducible, it is a bit hard. Suggestions are welcome? Jesper -- Jesper Krogh --
Probably the same as this one: http://www.kerneloops.org/raw.php?rawid=12419&msgid= Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode), which would be due to NFS not properly filling in its root dentry? Miklos --
On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which means it's an autofs thing. CC-ing Ian. Miklos --
It is path->dentry, all right, but the question is how'd it get that way.
Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
as flags. Then we'd passed it through do_new_mount() to do_add_mount()
without changes. And went through
/* Something was mounted here while we slept */
while (d_mountpoint(nd->path.dentry) &&
follow_down(&nd->path.mnt, &nd->path.dentry))
;
--
And this relates to previous in that a mount isn't done by autofs until until after the directory is created, at which time the (->mkdir()) dentry is hashed. Ian --
Look more carefully. It's path->dentry; aside of the fact that dentry pointer is fetched at offset 8 from one of the arguments (fits path->dentry, too low for mnt->mnt_root), do_add_mount() itself has just done S_ISLNK on the very same thing, so it'd die before getting to graft_tree(). No, it's either path_lookup() somehow returning a negative dentry in do_mount() (which shouldn't be possible, unless it's some crap around return_reval in __link_path_walk()) or it's follow_down() giving us a negative dentry. Which almost certainly would've exploded prior to that... --
I think it must be autofs4 doing something weird. Like this in
autofs4_lookup_unhashed():
/*
* Make the rehashed dentry negative so the VFS
* behaves as it should.
*/
if (inode) {
dentry->d_inode = NULL;
Miklos
--
Lovely. If we ever step into that with somebody else (no matter who) holding a reference to that dentry, we are certainly well and truly buggered. It's not just mount(2) - everything in the tree assumes that holding a reference to positive dentry guarantees that it remains positive. Ian? --
The intent here is that, the dentry above is unhashed at this point, and if hasn't been reclaimed by the VFS, it is made negative and replaces the unhashed negative dentry passed to ->lookup(). The reference count is incremented to account for the reference held by the path walk. What am I doing wrong here? Ian --
Uhhuh. Yeah, that's not allowed. A dentry inode can start _out_ as NULL, but it can never later become NULL Indeed. Things like regular file ops won't even test the inode, since they know that "open()" will only open a dentry with a positive entry, so they know that the dentry->inode is non-NULL. [ Although some code-paths do test - but that is just because people are What's wrong is that you can't do that "dentry->d_inode = NULL". EVER. Why would you want to? If the dentry is already unhashed, then no _new_ lookups will ever find it anyway, so it's effectively unfindable anyway. Except by people who *have* to find it, ie the people who already hold it open (because, for example, they opened it earlier, or because they chdir()'ed into a subdirectory). So why don't you just return a NULL dentry instead, for a unhashed dentry? Or do the "goto next" thing? Linus --
The code we're talking about deals with a race between expiring and mounting an autofs mount point at the same time. I'll have a closer look and see if I can make it work without turning That just won't work for the case this is meant to deal with. Ian --
Hmm. Can you walk me through this? If the dentry is unhashed, it means that it _either_ - has already been deleted (rmdir'ed) or d_invalidate()'d. Right? I don't see why you should ever return the dentry in this case.. - or it has not yet been hashed at all But then d_inode should be NULL too, no? Anyway, as far as I can tell, you should handle the race between expiring and re-mounting not by unhashing at expire time (which causes these kinds of problems), but by setting a bit in the dentry and using the dentry "revalidate()" callback to wait for the revalidate. But I don't know autofs4, so you probably have some reason. Could you explain it? Linus --
From my reading of that code looks like it's been rmdir'ed. And no, I don't understand what the hell is that code trying to do. Ian, could you describe the race you are talking about? --
BTW, this stuff is definitely broken regardless of mount - if something had the directory in question opened before that rmdir and we'd hit your lookup_unhashed while another CPU had been in the middle of getdents(2) on that opened descriptor, we'll get vfs_readdir() grabs i_mutex vfs_readdir() checks that it's dead autofs4_lookup_unhashed() calls iput() inode is freed vfs_readdir() releases i_mutex - in already freed struct inode. Hell, just getdents() right *after* dentry->d_inode = NULL will oops, plain and simple. --
Can this really happen, since autofs4_lookup_unhashed() is only called Yeah, I'll look into why I believed I needed to turn the dentry negative. I'll need to keep the dentry positive through out this process. Ian --
i_mutex on a different inode (obviously - it frees the inode in question, so if caller held i_mutex on it, you would be in trouble every time you hit that codepath). --
OK, I'll need to look at vfs_readdir(). I thought vfs_readdir() would take the containing directory mutex as does ->lookup(). --
vfs_readdir() takes i_mutex on directory it reads. I.e. on the victim in this case. lookup has i_mutex on directory it does lookup in, i.e. root in this case... --
Hmm. Looking closer, I think that code is meant to handle the
d_invalidate() that it did in autofs4_tree_busy().
However, that should never trigger for a directory entry that can be
reached some other way, because that code has done a "dget()" on the
dentry, and d_invalidate() does
if (atomic_read(&dentry->d_count) > 1) {
if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
..unlock..
return -EBUSY;
}
}
so I dunno. I still think the expire code shouldn't even use
d_invalidate() at all, and just revalidate() at lookup.
Linus
--
In the current code the only way a dentry gets onto this list is by one of the two operations ->unlink() or ->rmdir(), it is d_drop'ed and left positive by both of these operations (a carry over from 2.4 when d_lookup() returned unhashed dentrys, I missed that detail for quite a It's been a while now but the original patch description should help. "What happens is that during an expire the situation can arise that a directory is removed and another lookup is done before the expire issues a completion status to the kernel module. In this case, since the the lookup gets a new dentry, it doesn't know that there is an expire in progress and when it posts its mount request, matches the existing expire request and waits for its completion. ENOENT is then returned to user space from lookup (as the dentry passed in is now unhashed) without having performed the mount request. The solution used here is to keep track of dentrys in this unhashed state and reuse them, if possible, in order to preserve the flags. Additionally, this infrastructure will provide the framework for the reintroduction of caching of mount fails removed earlier in development." I wasn't able to do an acceptable re-implementation of the negative caching we had in 2.4 with this framework, so just ignore the last Unfortunately no, but I thought that once the dentry became unhashed (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of course there may be descriptors open on the dentry, which I think is the Yes, that would be ideal but the reason we arrived here is that, because we must release the directory mutex before calling back to the daemon (the heart of the problem, actually having to drop the mutex) to perform the mount, we can get a deadlock. The cause of the problem was that for "create" like operations the mutex is held for ->lookup() and ->revalidate() but for a "path walks" the mutex is only held for ->lookup(), so if the mutex is held when we're in ->revalidate(), we could ...
... or we could have had a pending mount(2) sitting there with a reference I am. Oh, well... Looks like RTFS time for me for now... Additional parts of braindump would be appreciated - the last time I've seriously looked at autofs4 internal had been ~2005 or so ;-/ --
You will find other problems. The other bit to this is the patch to resolve the deadlock issue I spoke about just above. This is likely where most of the current problems started and the fact that we have always had to drop the mutex to call back the daemon. I can post the patches as well if that helps. The description accompanying that patch was (the inconsistent locking referred to here is what was described above): "Due to inconsistent locking in the VFS between calls to lookup and revalidate deadlock can occur in the automounter. The inconsistency is that the directory inode mutex is held for both lookup and revalidate calls when called via lookup_hash whereas it is held only for lookup during a path walk. Consequently, if the mutex is held during a call to revalidate autofs4 can't release the mutex to callback the daemon as it can't know whether it owns the mutex. This situation happens when a process tries to create a directory within an automount and a second process also tries to create the same directory between the lookup and the mkdir. Since the first process has dropped the mutex for the daemon callback, the second process takes it during revalidate leading to deadlock between the autofs daemon and the second process when the daemon tries to create the mount point directory. After spending quite a bit of time trying to resolve this on more than one occassion, using rather complex and ulgy approaches, it turns out that just delaying the hashing of the dentry until the create operation work fine." Ian --
commit 1864f7bd58351732593def024e73eca1f75bc352
Author: Ian Kent <raven@themaw.net>
Date: Wed Aug 22 14:01:54 2007 -0700
autofs4: deadlock during create
Due to inconsistent locking in the VFS between calls to lookup and
revalidate deadlock can occur in the automounter.
The inconsistency is that the directory inode mutex is held for both lookup
and revalidate calls when called via lookup_hash whereas it is held only
for lookup during a path walk. Consequently, if the mutex is held during a
call to revalidate autofs4 can't release the mutex to callback the daemon
as it can't know whether it owns the mutex.
This situation happens when a process tries to create a directory within an
automount and a second process also tries to create the same directory
between the lookup and the mkdir. Since the first process has dropped the
mutex for the daemon callback, the second process takes it during
revalidate leading to deadlock between the autofs daemon and the second
process when the daemon tries to create the mount point directory.
After spending quite a bit of time trying to resolve this on more than one
occassion, using rather complex and ulgy approaches, it turns out that just
Well, let me know what level of dump you'd like. I can give the 50,000
foot view, or I can give you the history of things that happened to get
us to where we are today, or anything inbetween. The more specific
your request, the quicker I can respond. A full brain-dump would take
some time!
Cheers,
Jeff
--
a) what the hell is going on in autofs4_free_ino()? It checks for
ino->dentry, when the only caller has just set it to NULL.
b) while we are at it, what's ino->inode doing there? AFAICS, it's
a write-only field...
c) what are possible states of autofs4 dentry and what's the supposed
life cycle of these beasts?
d)
/* For dentries of directories in the root dir */
static struct dentry_operations autofs4_root_dentry_operations = {
.d_revalidate = autofs4_revalidate,
.d_release = autofs4_dentry_release,
};
/* For other dentries */
static struct dentry_operations autofs4_dentry_operations = {
.d_revalidate = autofs4_revalidate,
.d_release = autofs4_dentry_release,
};
Just what is the difference?
e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
What's going to keep these suckers consistent with each other in any useful
way?
--
The life cycle of a dentry for an indirect, non-browsable mount goes something like this: autofs4_lookup is called on behalf a process trying to walk into an automounted directory. That dentry's d_flags is set to DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created, indexed off of the name of the dentry. A callout is made to the automount daemon (via autofs4_wait). The daemon looks up the directory name in its configuration. If it finds a valid map entry, it will then create the directory using sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode == 1) will return NULL, and then the mkdir call will be made. The autofs4_mkdir function then instantiates the dentry which, by the way, is different from the original dentry passed to autofs4_lookup. (This dentry also does not get the PENDING flag set, which is a bug addressed by a patch set that Ian and I have been working on; specifically, the idea is to reuse the dentry from the original lookup, but I digress). The daemon then mounts the share on the given directory and issues an ioctl to wakeup the waiter. When awakened, the waiter clears the DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the dcache and returns that dentry if found. Later, the dentry gets expired via another ioctl. That path sets the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry. It then calls out to the daemon to perform the unmount and rmdir. The rmdir unhashes the dentry (and places it on the rehash list). The dentry is removed from the rehash list if there was a racing expire and mount or if the dentry is released. This description is valid for the tree as it stands today. Ian and I have been working on fixing some other race conditions which will change Nothing. There used to be, and I'm guessing Ian kept this around for, I'm afraid I'm not familiar enough with that part of the code to give you a good answer. Ian? Cheers, Jeff --
So what happens if new lookup hits between umount and rmdir? Another thing: would be nice to write down the expected state of dentry (positive/negative, flags, has/hasn't ->d_fsdata, flags on ->d_fsdata) for all stages. I'll go through the code and do that once I get some sleep, but if you'll have time to do it before that... FWIW, I wonder if it would be better to leave the directory alone and just have the daemon mount the sucker elsewhere and let the kernel side move the damn thing in place itself, along with making dentry positive and waking the sleepers up. Then we might get away with not unlocking anything at all... That obviously doesn't help the current systems with existing daemon, but it might be interesting for the next autofs version... Note that we don't even have to mount it anywhere - mount2() is close to the top of the pile for the next couple of cycles and it'd separate "activate fs" from "attach fully set up fs to given place", with the former resulting in a descriptor and the latter being mount2(Attach, dir_fd, fs_fd); Kernel side of autofs might receive the file descriptor in question and do the rest itself... --
It will wait for the expire to complete and then wait for a mount request to the daemon. This is an example of how I've broken the lookup by delaying the hashing of the dentry without providing a way for ->lookup() to pickup the same unhashed dentry prior the directory dentry being hashed. Currently only the first lookup after the d_drop will get this dentry. Keeping track of the dentry between the first lookup and it's subsequent hashing (or release) is what I want to do. But, as you point out, I also A dentry gets an info struct when it gets an inode and it should retain it until the dentry is released. When a dentry is selected for umount the AUTOFS_INF_EXPIRING (ino->flags) is set and cleared upon return (synchronous expire). The DCACHE_AUTOFS_PENDING (dentry->d_flags) flag should be set when a mount request is to be issued to the daemon and cleared when the request completes. I've introduced some inconsistency in setting and clearing Perhaps, if we didn't use /etc/mtab anywhere. It would make a difference if we could "mount" /proc/mounts onto a file such as /etc/mtab and everyone always did that. --
That's actually remarkably close to being possible. Loopback mounts have already been fixed not to rely on /etc/mtab. The only major piece missing is "user" mounts, and there's already a patchset for that waiting for review by the VFS maintainers. After that, it's just a "simple" issue of fixing up all the userspace pieces. Could be finished in the next decade, possibly ;) Miklos --
Actually, that explanation is a bit simple minded. It should wait for the expire in ->revalidate(). Following the expire completion d_invalidate() should return 0, since the dentry is now unhashed, which causes ->revalidate() to return 0. do_lookup() should see this and call a ->lookup(). But maybe I've missed something as I'm seeing a problem now. Ian --
Ok. Ive been running on the patch for a few days now .. and didn't see any problems. But that being said, I also turned off the --ghost option to autofs so if it actually is the patch or the different codepaths being used, I dont know. Since this is a production system, I'm a bit reluctant to just change a working setup to test it out. Jesper -- Jesper --
No need to change anything. My comment above relates to difficulties I'm having with patches that I'm working on that follow this one and the specific question that Al Viro asked "what happens if new lookup hits between umount and rmdir". But, clearly we need to know if I (autofs4) caused the specific problem you reported and if the patch resolves it. And that sounds promising from what you've seen so far. Ian --
Mmmm .. that comment might not be accurate either. It's beginning to look like my original approach, a post from back in Feb 2007, to fix a deadlock bug, wasn't right at all. But we don't really have time to determine that for sure now as it can take several days for the bug to trigger. Ian --
There is a problem with the patch I posted.
It will allow an incorrect ENOENT return in some cases.
The patch below is sufficiently different to the original patch I posted
to warrant a replacement rather than a correction.
If you can find a way to test this out that would be great.
autofs4 - don't make expiring dentry negative
From: Ian Kent <raven@themaw.net>
Correct the error of making a positive dentry negative after it has been
instantiated.
This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).
Signed-off-by: Ian Kent <raven@themaw.net>
---
fs/autofs4/autofs_i.h | 6 +--
fs/autofs4/inode.c | 6 +--
fs/autofs4/root.c | 115 ++++++++++++++++++-------------------------------
3 files changed, 49 insertions(+), 78 deletions(-)
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {
int flags;
- struct list_head rehash;
+ struct list_head expiring;
struct autofs_sb_info *sbi;
unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
- spinlock_t rehash_lock;
- struct list_head rehash_list;
+ spinlock_t lookup_lock;
+ struct list_head expiring_list;
};
static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
ino->dentry = NULL;
ino->size = 0;
...I'll patch that in now and see what it gives. (Is it preferred to use this --ghost option or not?) What would you suggest? Jesper -- Jesper Krogh --
It is a matter of choice mostly. If you have a large number of entries in your map then don't use it. It will result in a performance hit especially during expires. Ian --
Oops, I must have not set the Preformat option in Evolution, let me try
again.
autofs4 - don't make expiring dentry negative
From: Ian Kent <raven@themaw.net>
Correct the error of making a positive dentry negative after it has been
instantiated.
This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).
Signed-off-by: Ian Kent <raven@themaw.net>
---
fs/autofs4/autofs_i.h | 6 +--
fs/autofs4/inode.c | 6 +--
fs/autofs4/root.c | 115 ++++++++++++++++++-------------------------------
3 files changed, 49 insertions(+), 78 deletions(-)
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {
int flags;
- struct list_head rehash;
+ struct list_head expiring;
struct autofs_sb_info *sbi;
unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
- spinlock_t rehash_lock;
- struct list_head rehash_list;
+ spinlock_t lookup_lock;
+ struct list_head expiring_list;
};
static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
ino->dentry = NULL;
ino->size = 0;
- INIT_LIST_HEAD(&ino->rehash);
+ INIT_LIST_HEAD(&ino->expiring);
ino->last_used = jiffies;
atomic_set(&ino->count, 0);
@@ -338,8 +338,8 @@ int autofs4_fill_super(struct super_block *s, void *data, int ...I know. I know. And I think it has never been used anywhere either but I haven't removed There isn't any difference. There's no real reason to keep them different except that there are two The only time ino->count is changed is in ->mkdir(), ->rmdir and ->symlink() and ->unlink(). So it is supposed to represent the minimal reference count. The code in autofs4_free_ino() should go but that may be a bug, I need to check. --
Here is a patch for autofs4 to, hopefully, resolve this.
Keep in mind this doesn't address any other autofs4 issues but I it
should allow us to identify if this was in fact the root cause of the
problem Jesper reported.
autofs4 - leave rehashed dentry positive
From: Ian Kent <raven@themaw.net>
Correct the error of making a positive dentry negative after it has been
instantiated.
This involves removing the code in autofs4_lookup_unhashed() that
makes the dentry negative and updating autofs4_dir_symlink() and
autofs4_dir_mkdir() to recognise they have been given a postive
dentry (previously the dentry was always negative) and deal with
it. In addition the dentry info struct initialization, autofs4_init_ino(),
and the symlink free function, ino_lnkfree(), have been made aware
of this possible re-use. This is needed because the current case
re-uses a dentry in order to preserve it's flags as described in
commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.
Signed-off-by: Ian Kent <raven@themaw.net>
---
fs/autofs4/inode.c | 23 +++++++------
fs/autofs4/root.c | 95 ++++++++++++++++++++++++++++++----------------------
2 files changed, 67 insertions(+), 51 deletions(-)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..ec9a641 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -24,8 +24,10 @@
static void ino_lnkfree(struct autofs_info *ino)
{
- kfree(ino->u.symlink);
- ino->u.symlink = NULL;
+ if (ino->u.symlink) {
+ kfree(ino->u.symlink);
+ ino->u.symlink = NULL;
+ }
}
struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
@@ -41,16 +43,17 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
if (ino == NULL)
return NULL;
- ino->flags = 0;
- ino->mode = mode;
- ino->inode = NULL;
- ino->dentry = NULL;
- ino->size = 0;
-
- INIT_LIST_HEAD(&ino->rehash);
+ if (!reinit) {
+ ino->flags = 0;
+ ino->mode = mode;
+ ino->inode = NULL;
+ ino->dentry = NULL;
+ ino->size = ...Jesper, can you test this? I think you said you could trigger this in ~24 hours or so? I'd love to have some testing of some heavy autofs user. Even if it's not a guarantee of a fix (due to reproducing the bug not being entirely trivial), at least I'd like to know that it doesn't introduce any obvious new problems either.. Linus --
I'm continuing to test this also. My initial testing was fine but late last night, with some of my other patches applied as well, I started seeing some strange problems. Ian --
On Thu, 05 Jun 2008 15:31:37 +0800 OK, so here we work out that autofs4_init_ino() had to allocate a new This all seems a bit ungainly. I assume that on entry to autofs4_dir_symlink(), ino->size is equal to strlen(symname)? If it's not, that strcpy() will overrun. But if ino->size _is_ equal to strlen(symname) then why did we just recalculate the same thing? I'm suspecting we can zap a lump of code and just do cp = kstrdup(symname, GFP_KERNEL); This all looks very similar to the code in autofs4_dir_symlink(). Some --
Ha, yeah, I think it should be. That is right, but as it is now, this will always be a new allocation. If all goes well (yeah right) I will be allocating the info struct in --
Hi. This isn't a test of the proposed patch. I just got another variatioin of the problem in the log. (I've tried running the automount daemon both with and without the --ghost option) that is the only change I can see. Still 2.6.26-rc4.. Jun 5 16:13:15 node37 kernel: [17388710.169561] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b2 Jun 5 16:13:15 node37 automount[28691]: mount(nfs): nfs: mount failure hest.nzcorp.net:/z/fx1200 on /nfs/fx1200 Jun 5 16:13:15 node37 automount[28691]: failed to mount /nfs/fx1200 Jun 5 16:13:15 node37 kernel: [17388710.217273] IP: [graft_tree+77/288] graft_tree+0x4d/0x120 Jun 5 16:13:15 node37 kernel: [17388710.217273] PGD f9e75067 PUD f681e067 PMD 0 Jun 5 16:13:15 node37 kernel: [17388710.217273] Oops: 0000 [1] SMP Jun 5 16:13:15 node37 kernel: [17388710.217273] CPU 1 Jun 5 16:13:15 node37 kernel: [17388710.217273] Modules linked in: nfs lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd fuse parport_pc lp parport i2c_amd756 serio_raw psmouse pcspkr container i2c_core shpchp k8temp button amd_rng evdev pci_hotplug ext3 jbd mbcache sg sd_mod ide_cd_mod cdrom floppy mptspi mptscsih mptbase scsi_transport_spi ohci_hcd tg3 usbcore amd74xx ide_core ata_generic libata scsi_mod dock thermal processor fan thermal_sys Jun 5 16:13:15 node37 kernel: [17388710.217273] Pid: 28693, comm: mount.nfs Not tainted 2.6.26-rc4 #1 Jun 5 16:13:15 node37 kernel: [17388710.993688] RIP: 0010:[graft_tree+77/288] [graft_tree+77/288] graft_tree+0x4d/0x120 Jun 5 16:13:15 node37 kernel: [17388710.993688] RSP: 0000:ffff8100f9c85e08 EFLAGS: 00010246 Jun 5 16:13:15 node37 kernel: [17388710.993688] RAX: ffff8100bfbc0270 RBX: 00000000ffffffec RCX: 0000000000000000 Jun 5 16:13:15 node37 kernel: [17388711.245666] RDX: ffff8100f9ec5900 RSI: ffff8100f9c85e68 RDI: ffff8100bae1f800 Jun 5 16:13:15 node37 kernel: [17388711.245666] RBP: ffff8100bae1f800 R08: 0000000000000000 R09: 0000000000000001 Jun 5 ...
Right. Whether that would make a difference depends largely on your map configuration. If you have simple indirect or direct maps then then using the --ghost option (or just adding the "browse" option if you're using version 5) should prevent the code that turns the dentry negative from being executed at all. If you're using submounts in your map, or the "hosts" map or you have multi-mount entries in the maps then that --
btw, I'm currently testing with these additional changes.
I don't think they will result in functional differences but it's best
we keep in sync.
autofs4 - leave rehash dentry positive fix
From: Ian Kent <raven@themaw.net>
Change ENOSPC to ENOMEM.
Make autofs4_init_ino() always set mode field.
Signed-off-by: Ian Kent <raven@themaw.net>
---
fs/autofs4/inode.c | 2 +-
fs/autofs4/root.c | 10 +++++-----
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index ec9a641..3221506 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -45,7 +45,6 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
if (!reinit) {
ino->flags = 0;
- ino->mode = mode;
ino->inode = NULL;
ino->dentry = NULL;
ino->size = 0;
@@ -53,6 +52,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
atomic_set(&ino->count, 0);
}
+ ino->mode = mode;
ino->last_used = jiffies;
ino->sbi = sbi;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 6ce603b..f438e6b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -713,13 +713,13 @@ static int autofs4_dir_symlink(struct inode *dir,
ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
if (!ino)
- return -ENOSPC;
+ return -ENOMEM;
cp = kmalloc(ino->size + 1, GFP_KERNEL);
if (!cp) {
if (!dentry->d_fsdata)
kfree(ino);
- return -ENOSPC;
+ return -ENOMEM;
}
strcpy(cp, symname);
@@ -733,7 +733,7 @@ static int autofs4_dir_symlink(struct inode *dir,
kfree(cp);
if (!dentry->d_fsdata)
kfree(ino);
- return -ENOSPC;
+ return -ENOMEM;
}
d_add(dentry, inode);
@@ -866,7 +866,7 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
if (!ino)
- return -ENOSPC;
+ return -ENOMEM;
inode = dentry->d_inode;
if (inode)
@@ -876,7 +876,7 @@ static int autofs4_dir_mkdir(struct ...FWIW, searching for graft_tree on arjan's site: #12419: negative dentry path->dentry #17367: ditto #17463: ditto #13042: probably the same (is that earlier oops you've mentioned?) #18932: WTF is that one doing there? (graft_tree is never mentioned in it) Nuts... I really don't see how that could happen, unless it's NFS revalidation playing silly buggers with dentries and ->d_revalidate() there ends up turning dentry passed to it into negative one... Where does the mountpoint in question live and what gets passed to sys_mount()? --
Hi. Unrelated to the 2 other reports against the 2.6.26-rc4 kernel. I got this message. The system still operates fine so it is not critical in anyway. I thought I'd report it anyway. CC'ing the autofs-mailing list since it seems autofs related. [43071687.620727] INFO: task automount:23131 blocked for more than 120 seconds. [43071687.740715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [43071687.862459] automount D ffff810103a2b380 0 23131 7655 [43071687.862466] ffff8101fb5d9b28 0000000000000082 0000000000000000 0000000000000000 [43071687.862472] ffffffff8069e380 ffffffff8069e380 ffffffff8069a4c0 ffffffff8069e380 [43071687.862475] ffff8101e8d1af00 ffff8101fb5d9ae8 ffff8101fb5d9ad8 ffff8101fb5d9af4 [43071687.862480] Call Trace: [43071687.862581] [<ffffffff804764bf>] __wait_on_bit+0x4f/0x80 [43071687.862600] [<ffffffff8047656a>] out_of_line_wait_on_bit+0x7a/0xa0 [43071687.862606] [<ffffffff8024c060>] wake_bit_function+0x0/0x30 [43071687.862618] [<ffffffffa03c0e13>] :sunrpc:xprt_connect+0x83/0x170 [43071687.862633] [<ffffffffa03c4e69>] :sunrpc:__rpc_execute+0xc9/0x270 [43071687.862645] [<ffffffffa03bded4>] :sunrpc:rpc_run_task+0x34/0x70 [43071687.862657] [<ffffffffa03bdfac>] :sunrpc:rpc_call_sync+0x3c/0x60 [43071687.862667] [<ffffffff802b9379>] __follow_mount+0x29/0xa0 [43071687.862696] [<ffffffffa042220a>] :nfs:nfs3_rpc_wrapper+0x3a/0x60 [43071687.862711] [<ffffffffa04225e1>] :nfs:nfs3_proc_getattr+0x51/0x90 [43071687.862733] [<ffffffff802ca0b1>] mntput_no_expire+0x21/0x140 [43071687.862738] [<ffffffff802ca0b1>] mntput_no_expire+0x21/0x140 [43071687.862744] [<ffffffff802bc2e1>] path_walk+0xb1/0xd0 [43071687.862759] [<ffffffffa0415c9a>] :nfs:nfs_getattr+0x7a/0x120 [43071687.862765] [<ffffffff802b4d53>] vfs_lstat_fd+0x43/0x70 [43071687.862774] [<ffffffff802c3e88>] d_kill+0x48/0x70 [43071687.862780] [<ffffffff802c417f>] dput+0x1f/0xf0 [43071687.862784] [<ffffffff802d1440>] ...
