On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:I suspect the real reason why we get away with it so much with ext3 is that the journal is usually contiguous on disk, hence, when you write to the journal, it's highly unlikely that commit block will be written and the blocks before the commit block have not. In addition, because we are doing physical block journalling, it repairs a huge amount of damage during the journal replay. So as we are writing the journal, the disk drive sees a large contiguous write stream, followed by singleton writes where the disk blocks end up on disk. The most important reason, though, is that the blocks which are dirty don't get flushed out to disk right away! They don't have to, since they are in the journal, and the journal replay will write the correct data to disk. Before the journal commit, the buffer heads are pinned, so they can't be written out. After the journal commit, the buffer heads may be written out, but they don't get written out right away; the kernel will only write them out when the periodic buffer cache flush takes place, *or* if the journal need to wrap, at which point if there are pending writes to an old commit that haven't been superceded by another journal commit, the jbd layer has to force them out. But the point is this is done in an extremely lazy fashion. As a result, it's very tough to create a situation where a hard drive will reorder write requests aggressively enough that we would see a potential problem. I suspect if we want to demonstrate the problem, we would need to do a number of things: * create a highly fragmented journal inode, forcing the jbd layer to seek all over the disk while writing out the journal blocks during a commit * make the journal small, forcing the journal to wrap very often * run with journal=data mode, to put maximal stress on the journal * make the workload one which creates and deletes large number of files scattered all over the directory hierarchy, so that we limit the number of blocks which are rewritten, * forcibly crash the system while subjecting the ext3 filesystem to this torture test Given that most ext3 filesystems have their journal created at mkfs time, so it is contiguous, and the journal is generally nice and large, in practice I suspect it's relatively difficult (I didn't say impossible) for us to trigger corruption given how far away we are from the worst case scenario described above. - Ted --
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Willy Tarreau | Re: Linux 2.6.21 |
| Jan Kundrát | kswapd high CPU usage with no swap |
git: | |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] tcp: splice as many packets as possible at once |
