On Wed, 13 Aug 2008, Al Viro wrote:The really sad part is that readdir() really is also the thing that should make us change locking. That i_mutex thing is fine and dandy for everything else, but for readdir() we really would be much better off with a rwsem - and only take it for reading. Right now, readdir() is one of the most serialized parts of the whole kernel. Sad. And while it's a per-directory lock, there are directories that get much more reading than others, and this has been a small scalability issue (for samba and apache) for years. The thing is, generic_file_llseek() takes i_mutex, exactly because of issues like this. Of course, you have to ask for it (the _default_ llseek does not do it), and you're right that 9p does not. Strangely enough, at least 9p _does_ use it for regular files. I'm not sure how come it decided to do that, but whatever. The reason ext2 is ok is that you long long ago fixed it to use the page cache. That got rid of a _lot_ of the crap, and made all the IO look like regular files (including read-ahead etc). Ext2 _used_ to be the same crap that ext3 is. I so wish that ext3 could do the same thing, but no. I still think it should be possible, but the whole journalling is designed for buffer heads. I don't dispute at all that the readdir() thing is one of the weakest points of the whole VFS layer. And part of it is that there is no good caching helper for it at the VFS level, so we always end up having to do everything at the low-level filesystem level, and that invariably ends up being sh*t compared to the shared VFS routines. I'm convinced that the reason we do well on most other filesystem accesses is exactly the fact that a filesystem basically has to be crazy to try to do their own version, and in many cases cannot really do it at all (eg you can't really even avoid using the dcache or the page cache and actually get any valid semantics). But readdir() is the _one_ operation where the low-level filesystem still basically does it all itself. Which is why we can't fix locking, and why even simple changes are hard because it's not just complex code, it's complex code in 50+ filesystems with almost zero shared code! Linus --
| Hiten Pandya | Re: up? (emacs docbook xml ide) |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| James Bottomley | Re: [Ksummit-2008-discuss] Fixing the Kernel Janitors project |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
