On Wed, Jan 23, 2008 at 06:32:30AM -0600, Robin Holt wrote:It wasn't immediately obvious that this was an addition but really I don't mind either ways as long as a feature workable for KVM is included ;). You want to be able to tell the mmu_notifier that you want the flush repeated without locks later? Sorry but then if you're always going to set the bitflag unconditionally, why don't you simply implement a second notifier in addition of my current ->invalidate_page (like ->after_invalidate_page). We can then implement a method in rmap.c for you to call to do the final freeing of the page (pagecache/swapcache won't be collected unless it's a truncate, as long as you keep it pinned and you certainly don't want to wait a second round of lru scan before freeing the page after you release the external reference, so you may need to call this method before returning from the ->after_invalidate_page). Infact I can call that method for you in the notifier implementation itself after all ->after_invalidate_pages have been called. (of course only if at least one of them was implemented and not-null) mmu notifiers should already cover all pte clearing cases (export notifiers definitely don't instead!). It's not as important as for quadrics, we're mostly interested in rmap.c swapping and do_wp_page for KVM page sharing. But in the longer term protection changes and other things happening in the main MMU can also be tracked through mmu notifiers (something quadrics will likely need). Tracking the asm-generic/pgtable.h is all about trivially tracking all places without cluttering mm/*.c, so the mm/*.c remains more hackable and more readable. Well as long as you send these messages somewhat serially and you don't pretend to allocate all packets at once it should be ok. Perhaps you should preallocate all packets statically and serialize the access to the pool with a lock. What I'd like to stress to be sure it's crystal clear, is that in the mm/rmap.c path GFP_KERNEL==GFP_ATOMIC, infact both are = PF_MEMALLOC = TIF_MEMDIE = if mempool is empty it will crash. The argument that you need to sleep to allocate memory with GFP_KERNEL is totally bogus. If that's the only reason, you don't need to sleep at all. alloc_pages will not invoke the VM when called inside the VM, it will grab ram from PF_MEMALLOC instead. At most it will schedule so the only benefit would be lower -rt latency in the end. As long as you keep a reference on the page too, you don't risk any corruption by flushing after. The window that you must close with that bitflag is the request coming from the remote node to map the page after the linux pte has been cleared. If you map the page in a remote node after the linux pte has been cleared ->invalidate_page won't be called again because the page will look unmapped in the linux VM. Now invalidate_page will clear the bitflag, so the map requests will block. But where exactly you know that the linux pte has been cleared so you can "unblock" the map requests? If a page is not mapped by some linux pte, mm/rmap.c will never be called and this is why any notification in mm/rmap.c should track the "address space" and not the "physical page". In effect you don't care less about the address space of the task in the master node, so IMHO you're hooking your ->invalidate_page(page) (instead of my ->invalidate_page(mm, address)) in the very wrong place. You should hook it in mm/vmscan.c shrink-list so it will be invoked regardless if the pte is mapped or not. Then your model that passes the "page" instead of an "mm+address" will start to work without any need clearing/setting PG_exported bifflags, and you will automatically close the above race window without the need of ever clearing the PG_exported bitflag etc.... So again the current design of Christoph's patch really looks flawed to me. If you work the "pages" you should stick to pages and to stay away from mm/rmap.c and ignore whatever is mapped in the master address space of the task. mm/rmap.c only deals with ptes/sptes and other _virtual-tracked_ mappings. If you work with get_user_pages for page allocation (instead of alloc_pages) and the userland "virtual" addresses are your RAM backing store, like with KVM, then you should build an rmap structure based on _virtual_ addresses because the virtual addresses of the task will be all you care about, and then you will register the notifier in the "mm" and you will need ->invalidate_page to take an "address" as parameter and not a "page". All reference counting will be automatic, the userland task will run with all virtual memory visible automatically. You should make sure this model tied to the ram mapped in the userland task can't fit your needs and you really have to care about building stuff on top of physical pages instead of letting the VM decide which physical page or swap entry is going back your memory needs. Ok. This wasn't my impression but again I'm fine either ways! ;) --
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| debian developer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Adrian Bunk | Re: LSM conversion to static interface |
git: | |
| Gerrit Renker | [PATCH 26/37] dccp: Integration of dynamic feature activation - part 1 (socket set... |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
| Linus Torvalds | Re: [GIT]: Networking |
