Thanks a lot for working this out, Miklos.
(I don't see any explanation here for the madvise fuzzing page_mapped bug,
but that's not your fault! I'll have to do my own thinking on that one.)
Yes, I knowingly built that assumption into it 6 years ago.
Did you work out how it came about? About 2.6.10, I was observing that
unmap_mapping_range() is always called with i_mutex (and usually also
i_alloc_sem) held; whereas around the same time you were adding calls to
unmap_mapping_range() into invalidate_inode_pages2(), which has a much
looser definition than truncation, and does not (necessarily) have
i_mutex held. We raced.
One fix might be to take i_mutex in invalidate_inode_pages2(); but
I suspect a thorough search would show some calls do already hold it.
Truncation/invalidation have grown a lot more paths since those days,
hard work auditing through them all. generic_error_remove_page() is
also exceptional to be truncating without i_mutex, but I can never
care very deeply about what might go wrong with hwpoison.
Yes, I very much agree with you there: valiant effort by Miklos to
avoid bloat, but we're better off using a known primitive for now.
invalidate_inode_pages2() calls are the ones to check for that; but I
got tired, and maybe Miklos already found problems with that approach.
That would be lovely, but in fact no: it's guarding against operations on
vmas, things like munmap and mprotect, which can shuffle the prio_tree
when i_mmap_lock is dropped, without i_mutex ever being taken.
However, if we adopt Peter's preemptible mmu_gather patches, i_mmap_lock
becomes a mutex, so there's then no need for any of this (I think Peter
just did a straight conversion here, leaving it in, but it becomes
pointless and would gladly be removed).
Hugh
--