On Tue, Jan 29, 2008 at 01:53:05PM -0800, Christoph Lameter wrote:It's taken writable due to the code being inefficient the first time, all later times remap_populate_range overwrites ptes with the mmap_sem in readonly mode (finally rightfully so). The first remap_file_pages I guess it's irrelevant to optimize, the whole point of nonlinear is to call remap_file_pages zillon of times on the same vma, overwriting present ptes the whole time, so if the first time the mutex is not readonly it probably doesn't make a difference. get_user_pages invoked by the kvm spte-fault, can happen between invalidate_range and populate_range. If it can't happen, for sure nobody pointed out a good reason why it can't happen. The kvm page faults as well rightfully only takes the mmap_sem in readonly mode, so get_user_pages is only called internally to gfn_to_page with the readonly semaphore. With my approach ptep_clear_flush was not only invalidating sptes after ptep_clear_flush, but it was also invalidating them inside the PT lock, so it was totally obvious there could be no race vs get_user_pages. Yes, but it would have been micro-optimized later if you really cared, by simply changing ptep_clear_flush to __ptep_clear_flush, no big deal. Definitely all methods must be robust about them being called multiple times, even if the rmap finds no spte mapping such host virtual address. That's a question you should answer. No, that's a different angle. But now I think there may be an issue with a third thread that may show unsafe the removal of invalidate_page from ptep_clear_flush. A third thread writing to a page through the linux-pte and the guest VM writing to the same page through the sptes, will be writing on the same physical page concurrently and using an userspace spinlock w/o ever entering the kernel. With your patch that invalidate_range after dropping the PT lock, the third thread may start writing on the new page, when the guest is still writing to the old page through the sptes. While this couldn't happen with my patch. So really at the light of the third thread, it seems your approach is smp racey and ptep_clear_flush should invalidate_page as last thing before returning. My patch was enforcing that ptep_clear_flush would stop the third thread in a linux page fault, and to drop the spte, before the new mapping could be instantiated in both the linux pte and in the sptes. The PT lock provided the needed serialization. This ensured the third thread and the guest VM would always write on the same physical page even if the first thread runs a flood of remap_file_pages on that same page moving it around the pagecache. So it seems I found a unfixable smp race in pretending to invalidate in a sleeping place. Perhaps you want to change the PT lock to a mutex instead of a spinlock, that may be your only chance to sleep while maintaining 100% memory coherency with threads. --
| Michal Piotrowski | Re: 2.6.23-rc3-mm1 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Fred Tyler | Slow, persistent memory leak in 2.6.20 |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Antonio Almeida | HTB accuracy for high speed |
