On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the same with a kernel supporting preempt=y like 2.6.26). I think it's a great smp scalability optimization over the global lock you're proposing below. Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at compile time). mmu_notifier_register can take ages. No problem. mmu_notifier_register is fine to be hundred times slower (preempt-rt will turn all locks in spinlocks so no problem). Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with the VM core locking in 2.6.26. mm_lock (or whatever name you like to give it, I admit mm_lock may not be worrysome enough for people to have an idea to call it in a fast path) is going to be the real deal for the long term to allow mmu_notifier_register to serialize against invalidate_page_start/end. If I fail in 2.6.26 I'll offer maintainership to Christoph as promised, and you'll find him pushing for mm_lock to be merged (as XPMEM/GRU aren't technologies running on cellphones where your global wide spinlocks is optimized away at compile time, and he also has to deal with XPMEM where such a spinlock would need to become a rwsem as the anon_vma->sem has to be taken after it), but let's assume you're right entirely right here that mm_lock is going to be dropped and there's a better way: it's still a fine solution for 2.6.26. And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to mmu_notifier.[ch] so you don't get any pollution in the core VM, and mm_lock will be invisible to everything but anybody calling mmu_notifier_register() then and it will be trivial to remove later if you really want to add a global spinlock as there's no way to be more granular than a _global_ numa-wide spinlock taken before any i_mmap_lock/anon_vma->lock, without my mm_lock. --
| Linus Torvalds | Linux 2.6.21-rc4 |
| Jens Axboe | [PATCH 0/8] IO queuing and complete affinity |
| Nicholas A. Bellinger | Re: Integration of SCST in the mainstream Linux kernel |
| Robin Lee Powell | NFS hang + umount -f: better behaviour requested. |
git: | |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Ingo Molnar | Re: [crash] BUG: unable to handle kernel NULL pointer dereference at 0000000000000... |
| Gerrit Renker | [PATCH 19/37] dccp: Header option insertion routine for feature-negotiation |
| Gary Thomas | Marvell 88E609x switch? |
| Jamie Lokier | Re: silent semantic changes with reiser4 |
| Jan Kara | [PATCH 10/16] ext4: Remove syncing logic from ext4_file_write |
| Jack Stone | Re: Versioning file system |
| Jens Axboe | [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_pages |
