happenes sporadically, cannot reproduce at will. Nov 30 15:24:55 elektra kernel: [ 6604.258610] ------------[ cut here ]------------ Nov 30 15:24:55 elektra kernel: [ 6604.258628] kernel BUG at mm/truncate.c:475! Nov 30 15:24:55 elektra kernel: [ 6604.258633] invalid opcode: 0000 [#1] PREEMPT SMP Nov 30 15:24:55 elektra kernel: [ 6604.258640] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map Nov 30 15:24:55 elektra kernel: [ 6604.258646] CPU 3 Nov 30 15:24:55 elektra kernel: [ 6604.258649] Modules linked in: veth fuse af_packet bridge 8021q garp stp llc vboxnetadp vboxnetflt vboxdrv nouveau ttm drm_kms_helper drm i2c_algo_bit snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ip6t_REJECT ipt_REJECT ip6t_LOG ipt_LOG xt_limit xt_recent nf_conntrack_ipv6 xt_state xt_tcpudp ip6table_mangle iptable_mangle iptable_nat ip6table_filter ip6_tables iptable_filter nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack ip_tables x_tables nls_utf8 loop arc4 ecb snd_hda_codec_nvhdmi iwlagn iwlcore snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_pcm ohci1394 ieee1394 cfg80211 snd_timer sdhci_pci snd sdhci pcmcia firewire_ohci e1000e mmc_core yenta_socket soundcore firewire_core pcmcia_rsrc crc_itu_t pcmcia_core ppdev mcs7830 dm9601 dell_laptop usbnet rfkill snd_page_alloc dell_wmi shpchp sr_mod parport_pc ! sg cdrom wmi dcdbas intel_ips parport i Nov 30 15:24:55 elektra kernel: ntel_agp i2c_i801 pci_hotplug iTCO_wdt pcspkr iTCO_vendor_support button video battery ac ext4 jbd2 crc16 sha256_generic aesni_intel cryptd aes_x86_64 aes_generic cbc dm_crypt usbhid linear ehci_hcd usbcore sd_mod dm_snapshot dm_mod fan processor ahci libahci libata scsi_mod thermal thermal_sys Nov 30 15:24:55 elektra kernel: [ 6604.258914] Nov 30 15:24:55 elektra kernel: [ 6604.258918] Pid: 31399, comm: cut Not tainted 2.6.36.1 #2 0N5KHN/Latitude E6510 Nov 30 15:24:55 elektra kernel: [ ...
BUG_ON(page_mapped(page)) in invalidate_inode_pages2_range(): that's interesting, it may relate to another BUG_ON(page_mapped(page)) that's been reported recently. This is a 2.6.36.1 kernel you're running: any idea what was the first kernel on which you started seeing such errors? and what was the last good kernel on which you ran the same kind of load but saw no problems? Thanks, Hugh --
The fault handler unlocks the page if vm_ops->page_mkwrite() is
defined. That looks somewhat racy at first glance.
Quick test: does removing the page_mkwrite() callback from fuse make
the problem go away?
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c822458..a445358 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1328,7 +1328,6 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
static const struct vm_operations_struct fuse_file_vm_ops = {
.close = fuse_vma_close,
.fault = filemap_fault,
- .page_mkwrite = fuse_page_mkwrite,
};
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
Thanks,
Miklos
--
On Wed, 01 Dec 2010 11:25:22 +0100 At the moment I've downgraded to 2.6.36 - I cannot remember to have seen this there - which does not need to mean anything, because workload has changed (several unshared mount/network namespaces chrooted into unionfs-fuse mounted roots - cool stuff...). Would you suspect to make 2.6.36 <> 2.6.36.1 a difference here? Later, when I've results from the test with 2.6.36 of course I'll try the quick test you suggested. -- MfG, Michael Leun --
Okay, thanks. Miklos --
On Wed, 01 Dec 2010 18:22:33 +0100 Took until now to happen in 2.6.36 - so it is there also. I cannot really say if it is less frequent in 2.6.36 at the moment, but from Kernel compile 2.6.36.1 with that .page_mkwrite commented out running now, will reboot really soon now (TM). -- MfG, Michael Leun --
On Thu, 2 Dec 2010 08:41:59 +0100 OK - that happened very fast again in 2.6.36.1. Sorry for that tainted kernel, but cannot afford to additionally have graphics lockups all the time - I've shown that it happens with untainted kernel also (long run without fault yesterday also was with nvidia.ko driver). Until I've another suggestion what to try I'll swich back to 2.6.36 to see if it really happens less frequent there. Dec 2 09:08:13 elektra kernel: [ 1376.957887] ------------[ cut here ]------------ Dec 2 09:08:13 elektra kernel: [ 1376.957894] kernel BUG at mm/truncate.c:475! Dec 2 09:08:13 elektra kernel: [ 1376.957896] invalid opcode: 0000 [#1] PREEMPT SMP Dec 2 09:08:13 elektra kernel: [ 1376.957899] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/irq Dec 2 09:08:13 elektra kernel: [ 1376.957901] CPU 0 Dec 2 09:08:13 elektra kernel: [ 1376.957903] Modules linked in: veth ipt_MASQUERADE af_packet iwlagn bridge 8021q garp stp llc fuse vboxnetadp vboxnetflt vboxdrv nvidia(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ip6t_REJECT ipt_REJECT ip6t_LOG ipt_LOG xt_limit xt_recent nf_conntrack_ipv6 xt_state xt_tcpudp ip6table_mangle iptable_mangle iptable_nat ip6table_filter ip6_tables iptable_filter nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack ip_tables x_tables nls_utf8 loop arc4 ecb snd_hda_codec_nvhdmi iwlcore mac80211 snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm cfg80211 ohci1394 ieee1394 sdhci_pci sdhci pcmcia snd_timer snd firewire_ohci mmc_core yenta_socket dell_laptop firewire_core soundcore rfkill snd_page_alloc pcmcia_rsrc pcmcia_core crc_itu_t dm9601 ppdev usbnet dell_wmi shpchp sr_mod e1000e parport_pc parport intel_agp iTCO_wdt ! pci_hotplug cdrom intel_ips i2c_i801 pc Dec 2 09:08:13 elektra kernel: spkr iTCO_vendor_support sg wmi button video battery dcdbas ac ext4 jbd2 crc16 sha256_generic aesni_intel cryptd ...
Can you please describe in detail the workload that's causing this to happen? Thanks, Miklos --
On Thu, 02 Dec 2010 10:42:51 +0100 Thats rather complicated, but I'll try. Basically it boils down to: unshare -n -m /bin/bash unionfs -o cow,suid,allow_other,max_files=65536 /home/netenv/user1-union=RW:/=RO /home/netenv/user1 mount -n -t proc none /home/netenv/user1/proc mount -n -t sysfs none /home/netenv/user1/sys mount -n -t devtmpfs devtmpfs /home/netenv/user1/dev mount -n -t devpts devpts /home/netenv/user1/dev/pts chroot /home/netenv/user1 /bin/su - user1 Then run some shell-scripts in this shell running as user1. Of course there is some more stuff as getting network connectivity in this new namespace and so on, but I guess thats not important for the fuse problem. Then there are some (up to 6 at the moment) more setups like the above one with different users (user2, user3 and so on) running concurrent. In some of this setups two or more environments share the same writable branch, so the files in this environments changed against real root of the machine are the same, e.g.: [...] unionfs -o cow,suid,allow_other,max_files=65536 /home/netenv/commondir=RW:/=RO /home/netenv/user1 [...] and another one [...] unionfs -o cow,suid,allow_other,max_files=65536 /home/netenv/commondir=RW:/=RO /home/netenv/user2 [...] I observed that unionfs process takes much more cpu power than usual before fault happens. elektra:~ # unionfs --version unionfs-fuse version: 0.24 FUSE library version: 2.8.5 fusermount version: 2.8.5 using FUSE kernel interface version 7.12 -- MfG, Michael Leun --
On Thu, 2 Dec 2010 11:57:22 +0100 Additional note: Happens also WITHOUT that "two unionfs mounts use the same branch dir" stuff. This also happens without that "two unionfs mounts use the same branch dir" stuff. -- MfG, Michael Leun --
Thanks. For you the workaround would be to use the "kernel_cache" option which disables cache invalidation on open. I'll try to reproduce the BUG on my machine, and if I don't succeed I'll need som more help from you. Probably just coincidence. Sometimes the frequency a bug shows up depends on code layout (and hence cache layout) differences, which can vary from compile to compile and even from one boot to the next. Thanks, Miklos --
On Mon, 06 Dec 2010 13:36:30 +0100 Maybe / indeed looks like. # # Automatically generated make config: don't edit # Linux kernel version: 2.6.36.1 # Mon Nov 22 23:25:24 2010 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_GPIO=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_EARLY_RES=y CONFIG_HAVE_INTEL_TXT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" # CONFIG_KTIME_SCALAR is not ...
To be honest that somewhat sliddered down on my todo-list due to not ...so I'm very happy you found a way to reproduce yourself. I'll add this patch on my work machine monday morning (happened for me only on that quadcore, I realize now...) and turn off kernel_cache again, of course, and let you know what happens. Thanks. --
On Sat, 11 Dec 2010 15:14:47 +0100 I do think this was premature optimisation. The open-coded lock is hidden from lockdep so we won't find out if this introduces potential deadlocks. It would be better to add a new mutex at least temporarily, then look at replacing it with a MiklosLock later on, when the code is bedded in. At which time, replacing mutexes with MiklosLocks becomes part of a general "shrink the address_space" exercise in which there's no reason to exclusively concentrate on that new mutex! How hard is it to avoid adding a new lock and using an existing one, presumablt i_mutex? Because if we can get i_mutex coverage over unmap_mapping_range() then I suspect all the vm_truncate_count/restart_addr stuff can go away? --
Thanks a lot for working this out, Miklos. (I don't see any explanation here for the madvise fuzzing page_mapped bug, Did you work out how it came about? About 2.6.10, I was observing that unmap_mapping_range() is always called with i_mutex (and usually also i_alloc_sem) held; whereas around the same time you were adding calls to unmap_mapping_range() into invalidate_inode_pages2(), which has a much looser definition than truncation, and does not (necessarily) have i_mutex held. We raced. One fix might be to take i_mutex in invalidate_inode_pages2(); but I suspect a thorough search would show some calls do already hold it. Truncation/invalidation have grown a lot more paths since those days, hard work auditing through them all. generic_error_remove_page() is also exceptional to be truncating without i_mutex, but I can never Yes, I very much agree with you there: valiant effort by Miklos to invalidate_inode_pages2() calls are the ones to check for that; but I That would be lovely, but in fact no: it's guarding against operations on vmas, things like munmap and mprotect, which can shuffle the prio_tree when i_mmap_lock is dropped, without i_mutex ever being taken. However, if we adopt Peter's preemptible mmu_gather patches, i_mmap_lock becomes a mutex, so there's then no need for any of this (I think Peter just did a straight conversion here, leaving it in, but it becomes pointless and would gladly be removed). Hugh --
I'm still trying to sell that series, so if you see any value in it, please reply with positive feedback ;-) Also, the whole vm_truncate_count/restart_addr isn't entirely useless, its still a lock break which might help with long held locks. Imagine someone trying to unmap several TB worth of pages at once (not entirely beyond the realm of possibility today, and we all know tomorrow will be huge). --
Also, bit-spinlocks _suck_.. They're not fair, they're expensive and like already noted they're hidden from lockdep. Ideally we should be removing bit-spinlocks from the kernel, not add more. --
One place I know it's hard to get i_mutex coverage is fuse's d_revalidate. That's because ->d_revalidate might be called with or without i_mutex at the discretion of the VFS. You might ask, why does fuse call invalidate_inode_pages2() from d_revalidate? The answer is, fuse does lookup revalidation and attribute revalidation in one go, and if it finds that the lookup is still valid but the file contents have changed, then it will need to invalidate the page cache. Thanks, Miklos --- fs/gfs2/main.c | 9 +-------- fs/inode.c | 22 +++++++++++++++------- fs/nilfs2/btnode.c | 5 ----- fs/nilfs2/btnode.h | 1 - fs/nilfs2/mdt.c | 4 ++-- fs/nilfs2/page.c | 13 ------------- fs/nilfs2/page.h | 1 - fs/nilfs2/super.c | 2 +- include/linux/fs.h | 2 ++ mm/memory.c | 2 ++ 10 files changed, 23 insertions(+), 38 deletions(-) Index: linux.git/mm/memory.c =================================================================== --- linux.git.orig/mm/memory.c 2010-12-11 14:09:55.000000000 +0100 +++ linux.git/mm/memory.c 2010-12-14 11:20:47.000000000 +0100 @@ -2572,6 +2572,7 @@ void unmap_mapping_range(struct address_ details.last_index = ULONG_MAX; details.i_mmap_lock = &mapping->i_mmap_lock; + mutex_lock(&mapping->unmap_mutex); spin_lock(&mapping->i_mmap_lock); /* Protect against endless unmapping loops */ @@ -2588,6 +2589,7 @@ void unmap_mapping_range(struct address_ if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->unmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range); Index: linux.git/fs/gfs2/main.c =================================================================== --- linux.git.orig/fs/gfs2/main.c 2010-11-26 10:52:16.000000000 +0100 +++ linux.git/fs/gfs2/main.c 2010-12-14 11:15:53.000000000 +0100 @@ -59,14 +59,7 @@ static void gfs2_init_gl_aspace_once(voi ...
Yes, this looks to me like what is needed for now. I'd feel rather happier about it if I thought it would also fix Robert's kernel BUG at /build/buildd/linux-2.6.35/mm/filemap.c:128! but I've still not found time to explain that one. Robert, you said yours is usually repeatable in 12 hours - any chance you could give iknowthis a run with the patch below, to see if it makes any difference to yours? (I admit I don't see how it would.) Thanks, fs/gfs2/main.c | 9 +-------- fs/inode.c | 22 +++++++++++++++------- fs/nilfs2/btnode.c | 5 ----- fs/nilfs2/btnode.h | 1 - fs/nilfs2/mdt.c | 4 ++-- fs/nilfs2/page.c | 13 ------------- fs/nilfs2/page.h | 1 - fs/nilfs2/super.c | 2 +- include/linux/fs.h | 2 ++ mm/memory.c | 2 ++ 10 files changed, 23 insertions(+), 38 deletions(-) Index: linux.git/mm/memory.c =================================================================== --- linux.git.orig/mm/memory.c 2010-12-11 14:09:55.000000000 +0100 +++ linux.git/mm/memory.c 2010-12-14 11:20:47.000000000 +0100 @@ -2572,6 +2572,7 @@ void unmap_mapping_range(struct address_ details.last_index = ULONG_MAX; details.i_mmap_lock = &mapping->i_mmap_lock; + mutex_lock(&mapping->unmap_mutex); spin_lock(&mapping->i_mmap_lock); /* Protect against endless unmapping loops */ @@ -2588,6 +2589,7 @@ void unmap_mapping_range(struct address_ if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->unmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range); Index: linux.git/fs/gfs2/main.c =================================================================== --- linux.git.orig/fs/gfs2/main.c 2010-11-26 10:52:16.000000000 +0100 +++ linux.git/fs/gfs2/main.c 2010-12-14 11:15:53.000000000 +0100 @@ -59,14 +59,7 @@ static void gfs2_init_gl_aspace_once(voi struct address_space *mapping = (struct ...
Me neither, all unmap_mapping_range() calls from shmfs are either with i_mutex or from evict_inode. Hmm, is there anything preventing remap_file_pages() installing a pte at an address that unmap_mapping_range() has already processed? Thanks, Miklos --
Interesting line of thought: nothing I think, but isn't that okay? Though its zap_pte can take out present ptes pointing to actual pages, all populate_range ever installs is non-present pte_file entries: and a fault on one of those goes through the same checks as in a linear mapping. (I thought I was going to find an inconsistency with zap_pte_range there, but no: truncation does not remove pte_file entries beyond end of file, I remember now thinking that we need to keep SIGBUS-beyond-EOF on them, instead of letting truncation silently revert those offsets to linear.) Or am I missing something? (Well, we know I am, because I've not explained Robert's BUG.) Hugh --
Hi Hugh, -- Robert Święcki --
If you can spare the time, yes, please do: it will be valuable information either way. I just don't want to deceive you that we expect this to be the fix. Thanks, Hugh --
