This is a kernel patch of NILFS2 file system which was previously announced in [1]. Since the original code did not comply with the Linux Coding Style, I've rewritten it a lot. NILFS2 is a log-structured file system (LFS) supporting ``continuous snapshotting''. In addition to versioning capability of the entire file system, users can even restore files and namespaces mistakenly overwritten or destroyed just a few seconds ago. NILFS2 creates a number of checkpoints every few seconds or per synchronous write basis (unless there is no change). Users can select significant versions among continuously created checkpoints, and can change them into snapshots which will be preserved until they are changed back to checkpoints. There is no limit on the number of snapshots until the volume gets full. Each snapshot is mountable as a read-only file system concurrently with its writable mount, and this feature is convenient for online backup. It will be also favorable for time-machine like user environment or appliances. Please see [2] for details on the project. Other features are: - Quick crash recovery on-mount (like conventional LFS) - B-tree based file, inode, and other meta data management including snapshots. - 64-bit data structures; support many files, large files and disks. - Online disk space reclamation by userland daemon, which can maintain multiple snapshots. - Less use of barrier with keeping reliability. The barrier is enabled by default. - Easy and quickly performable snapshot administration Some impressive benchmark results on SSD are shown in [3], however the current NILFS2 performance is sensitive to machine environment due to its immature implementation. It has many TODO items: - performance improvement (better block I/O submission) - better integration of b-tree node cache with filemap and buffer code. - cleanups, further simplification. - atime support - extendend attributes support - POSIX ACL support - Quota support The patch ...
heh. It wipes the floor with everything, including btrfs. But a log-based fs will do that, initially. What will the performace Needs a few fixes for recent linux-next changes. I queued it up without looking at it, just for a bit of review and --
On Wed, Aug 20, 2008 at 10:43 AM, Andrew Morton
(a) why does NILFS need this and (b) why aren't these patches against
generic mm/*.c?
Pekka
--
Yeah, it's bothersome part.
I'd like to eliminate this peculiar code by using the standard mm/
functions or bd_inode, but still pending.
It's mainly used to maintain pages held by struct nilfs_btnode_cache,
which is a per-inode additional page cache used to store buffers of
B-tree.
Incidentally, for data blocks, mm/ page cache is used like other
(a) I believe this is historical, but I will confirm the reason
why filemap was not adopted.
(b) Because I think it should be eliminated rather than integrated
into mm/ at this point.
Thank you for comment.
Regards,
Ryusuke Konishi
--
Lifetime information is maintained for each (virtualized) address of disk block to judge whether a given disk block is eliminable or not. The garbage collector (GC) of NILFS2 works as follows: 1. GC does not remove snapshots, which are the checkpoints marked as snapshot. Plain checkpoints are not protected from GC except for the recent ones. 2. Disk blocks that do not belong to any snapshots nor the recent checkpoints, are eliminable. For a given disk block, GC confirms state of every checkpoints whose serial number is included in the lifetime. It judges the block is not eliminable if at least one snapshot or a recent checkpoint is included. 3. GC reclaims disk space in units of segment. (where a segment is equally divided disk region.) For a selected segment, removable blocks are just ignored, and unremovable blocks (live blocks) are copied to a new log appended in the current segment for writing. When all the live blocks are copied into the new log, the segment becomes free and reusable. 4. To make disk blocks relocatable, NILFS2 maintains a table file (called DAT) which maps virtual disk blocks addresses to usual block addresses. The lifetime information is recorded in the DAT per virtual block address. The current NILFS2 GC simply reclaims from the oldest segment, so the disk partition acts like a ring buffer. (this behaviour can be changed by I'm using NILFS2 for my home directory for serveral months, but so far I don't feel notable performance degradation. Later, I'd like to try a benchmark for a server. Sure, I will. With regards, Ryusuke Konishi --
It seems the benchmark was done over half year ago. It's questionable how
relevant today the performance comparison is with actively developed file
I ran compilebench on kernel 2.6.26 with freshly formatted volumes.
The behavior of NILFS2 was interesting.
Its peformance rapidly degrades to the lowest ever measured level
(< 1 MB/s) but after a while it recovers and gives consistent numbers.
However it's still very far from the current unstable btrfs performance.
The results are reproducible.
MB/s Runtime (s)
----- -----------
btrfs unstable 17.09 572
ext3 13.24 877
btrfs 0.16 12.33 793
nilfs2 2nd+ runs 11.29 674
ntfs-3g 8.55 865
reiserfs 8.38 966
nilfs2 1st run 4.95 3800
xfs 1.88 3901
Szaka
--
NTFS-3G: http://ntfs-3g.org
--
On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST) err, what the heck happened to xfs? Is this usual? --
vmstat typically shows that xfs does ... "nothing". It uses no CPU time and doesn't wait for I/O either. Szaka -- NTFS-3G: http://ntfs-3g.org --
No, definitely not usual. I suspect it's from an old mkfs and barriers being used. What is the output of the xfs.mkfs when you make the filesystem and what mount options being used? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Everything is default. % rpm -qf =mkfs.xfs xfsprogs-2.9.8-7.1 which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is the latest stable mkfs.xfs. Its output is meta-data=/dev/sda8 isize=256 agcount=4, agsize=1221440 blks = sectsz=512 attr=2 data = bsize=4096 blocks=4885760, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 Kernel xfs log: SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled SGI XFS Quota Management subsystem XFS mounting filesystem sda8 Ending clean XFS mount for filesystem: sda8 Szaka -- NTFS-3G: http://ntfs-3g.org --
Ok, I thought it might be the tiny log, but it didn't improve anything here when increased the log size, or the log buffer size. Looking at the block trace, I think elevator merging is somewhat busted. I'm seeing adjacent I/Os being dispatched without having been merged. e.g: 104,48 1 2139 4.803090086 4175 Q W 18540712 + 8 [pdflush] 104,48 1 2140 4.803092492 4175 G W 18540712 + 8 [pdflush] 104,48 1 2141 4.803094875 4175 P N [pdflush] 104,48 1 2142 4.803096205 4175 I W 18540712 + 8 [pdflush] 104,48 1 2143 4.803160324 4175 Q W 18540720 + 40 [pdflush] 104,48 1 2144 4.803162724 4175 M W 18540720 + 40 [pdflush] 104,48 1 2145 4.803231701 4175 Q W 18540760 + 48 [pdflush] 104,48 1 2146 4.803234223 4175 M W 18540760 + 48 [pdflush] ..... 104,48 1 2163 4.803844214 4175 Q W 18541032 + 56 [pdflush] 104,48 1 2164 4.803846694 4175 M W 18541032 + 56 [pdflush] 104,48 1 2165 4.803932321 4175 Q W 18541088 + 48 [pdflush] 104,48 1 2166 4.803937177 4175 G W 18541088 + 48 [pdflush] 104,48 1 2167 4.803940416 4175 I W 18541088 + 48 [pdflush] 104,48 1 2168 4.804005265 4175 Q W 18541136 + 24 [pdflush] 104,48 1 2169 4.804007664 4175 M W 18541136 + 24 [pdflush] ..... 104,48 1 2183 4.804518129 4175 D W 18540712 + 376 [pdflush] 104,48 1 2184 4.804537981 4175 D W 18541088 + 248 [pdflush] In entry 2165, a new request is made rather than merging the existing, adjacent request that is already open. The result is we then dispatch two I/Os instead of one. Also, CFQ appears to not be merging WRITE_SYNC bios or issuing them with any urgency. The result of this is that it stalls the XFS transaction subsystem by capturing all the log buffers in the elevator and not issuing them. e.g.: 104,48 0 149 0.107856547 4160 Q WS 35624860 + 128 [pdflush] 104,48 0 ...
I concur your observation, esp. w.r.t. XFS and CFQ clashing: http://gus3.typepad.com/i_am_therefore_i_think/2008/07/finding-the-fas.html CFQ is the default on most Linux systems AFAIK; for decent XFS performance one needs to switch to "noop" or "deadline". I wasn't sure if it was broken code, or simply base assumptions in conflict (XFS vs. CFQ). Your log output sheds light on the matter for me, thanks. --
I'm wondering if these elevators are just getting too smart for their own good. w.r.t to the above test, deadline was about twice as slow as CFQ - it does immediate dispatch on SYNC_WRITE bios and so caused more seeks that CFQ and hence went slower. noop had similar dispatch latency problems to CFQ, so it wasn't any faster either. I think that we need to issue explicit unplugs to get the log I/O dispatched the way we want on all elevators and stop trying to give elevators implicit hints by abusing the bio types and hoping they do the right thing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
FWIW, my explicit plugging idea is still hanging around in one of Jens' block trees (actually he refreshed it a couple of months ago). It provides an API for VM or filesystems to plug and unplug requests coming out of the current process, and it can reduce the need to idle the queue. Needs more performance analysis and tuning though. But existing plugging is below the level of the elevators, and should only kick in for at most tens of ms at queue idle events, so it sounds like it may not be your problem. Elevators will need some hint to give priority to specific requests -- either via the current threads's io priority, or information attached to bios. --
We've already got plenty of explicit unplugs in XFS to get stuff It's getting too bloody complex, IMO. What is right for one elevator is wrong for another, so as a filesystem developer I have to pick one to target. With the way the elevators have been regressing, improving and changing behaviour, I am starting to think that I should be picking the noop scheduler. Any 'advanced' scheduler that is slower than the same test on the noop scheduler needs fixing... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I don't really see it as too complex. If you know how you want the I disagree. On devices with no seek penalty or their own queueing, noop is often the best choice. Same for specialized apps that do their own disk scheduling. --
That is the problem in a nutshell. Nobody can keep up with all the shiny new stuff that is being implemented,let alone the subtle behavioural differences that accumulate through such Yet they've regularly shown performance regressions because other A filesystem is nothing but a complex disk scheduler that has to handle vastly larger queues than an elevator. Іf the filesystem doesn't get it's disk scheduling right, then the elevator is irrelevant because nothing will fix the I/O problems in the filesystem algorithms..... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I'm not sure exactly what you mean.. I certainly have not been keeping up with all the changes here as I'm spending most of my time on other things lately... But from what I see, you've got a fairly good handle on analysing the elevator behaviour (if only the end result). So if you were to tell Jens that "these blocks" need more priority, or not to contribute to a process's usage quota, etc. then I'm sure improvements could be made. Is this rhetorical? Because I don't see how *they* could be showing regular performance regressions. Deadline literally had its last behaviour change nearly a year ago, and before that was before recorded (git) history. AS hasn't changed much more frequently, although I will grant that it and CFS add a lot more complexity. So I would always compare results I wouldn't say it is so black and white if you have multiple processes submitting IO. You get more opportunities to sort and merge things in the disk scheduler, and you can do things like fairness and anticipatory scheduling. But if XFS does enough of what you need, then by all means use noop. There is an in-kernel API to change it (although it's designed more for block devices than filesystems so it might not work exactly for you). --
Only from having to do this analysis over and over again trying to understand what has changed in the elevator that has negated the It's exactly this sort of complexity that is the problem. When the behaviour of such things change, filesystems that are optimised for the previous behaviour are not updated - we're not even aware that the elevator has been changed in some subtle manner that breaks the optimisations that have been done. To keep on top of this, we keep adding new variations and types and expect the filesystems to make best use of them (without documentation) to optimise for certain situations. Example - the new(ish) BIO_META tag that only CFQ understands. I can change the way XFS issues bios to use this tag to make CFQ behave the same way it used to w.r.t. metadata I/O from XFS, but then the deadline and AS will probably regress because they don't understand that tag and still need the old optimisations that just got removed. Ditto for prioritised bio dispatch - CFQ supports it but none of the others do. IOWs, I am left with a choice - optimise for a specific elevator (CFQ) to the detriment of all others (noop, as, deadline), or make the filesystem work best with the simple elevator (noop) and consider the smarter schedulers deficient if they are slower than You're suggesting that I add complexity to solve the too much complexity I get private email fairly often asking questions as to why XFS is slower going from, say, 2.6.23 to 2.6.24 and then speeds back up in 2.6.25. I seen a number of cases where the answer to this was that elevator 'x' with XFS in 2.6.x because for some reason it is much, much slower than the others on that workload on that hardware. As seen earlier in this thread, this can be caused by a problem with the hardware, firmware, configuration, driver bugs, etc - there are so many combinations of variables that can cause performance issues that often the only 'macro' level change that you can make to avoid them is to switch schedulers. ...
I don't know why AS or DL would regress though. What old optimizations I don't think this is necessarily such a bad thing to do. It would be very helpful of course if you could report the workloads where one is slower than noop so that we can work out what is going wrong and Actually, if it's too much complexity that's the problem for you, then I Fair enough, and you're saying noop isn't so fragile to these other things changing. I would expect deadline to be pretty good too, in Well then I don't have a good answer, sorry :P --
There's nothing wrong with adding BIO_META (for example) and other hints in _principle_. You should be able to ignore it with no adverse effects. If its not used by a filesystem (and there's nothing else competing to use the same disk), I would hope to see the same performance as other kernels which don't have it. If the elevators are being changed in such a way that old filesystem code which doesn't use new hint bits is running significantly slower, surely that's blatant elevator regression, and that's where the bugs should be reported and fixed? -- Jamie --
Right, but it's what we need to do to make use of that optimisation that is the problem. For XFS, it needs to replace the current BIO_SYNC hints we use (even for async I/O) to get metadata dispatched quickly. i.e. CFQ looks at the sync flag first then the meta flag. Hence to take advantage of it, we need to remove the BIO_SYNC hints we currently use which will change the behaviour on all other elevators as a side effect. This is the optimisation problem I'm refering to - the BIO_SYNC usage was done years ago to get metadata dispatched quickly because that is what all the elevators did with sync I/O. Now to optimise for CFQ we need to remove that BIO_SYNC optimisation which is still Sure, but in reality getting ppl to go through the pain of triage is extremely rare because it only takes 10s to change elevators and make the problem go away... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
it sounds as if the various flag definitions have been evolving, would it be worthwhile to sep back and try to get the various filesystem folks to brainstorm together on what types of hints they would _like_ to see supported? it sounds like you are using 'sync' for things where you really should be saying 'metadata' (or 'journal contents'), it's happened to work well enough in the past, but it's forcing you to keep tweaking the filesystems. it may be better to try and define things from the filesystem point of view and let the elevators do the tweaking. basicly I'm proposing a complete rethink of the filesyste <-> elevator interface. David Lang --
Three types: 1. immediate dispatch - merge first with adjacent requests then dispatch 2. delayed dispatch - queue for a short while to allow merging of requests from above 3. bulk data - queue and merge. dispatch is completely controlled by the elevator Basically most metadata and log writes would fall into category 2, which every logbufs/2 log writes or every log force using a category 1 to prevent log I/O from being stalled too long by other I/O. Data writes from the filesystem would appear as category 3 (read and write) and are subject to the specific elevator scheduling. That is, things like the CFQ ionice throttling would work on the bulk data queue, but not the other queues that the filesystem is using for metadata. Tagging the I/O as a sync I/O can still be done, but that only affects category 3 scheduling - category 1 or 2 would do the same Right, because there was no 'metadata' tagging, and 'sync' happened Yeah, I've been saying that for a while w.r.t. the filesystem/block layer interfaces, esp. now with discard requests, data integrity, device alignment information, barriers, etc being exposed by the layers below the filesystem, but with no interface for filesystems to be able to access that information... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
does this list change if you consider the fact that there may be a raid array or some more complex structure for the block device instead of a simple single disk partition? since I am suggesting re-thinking the filesystem <-> elevator interface, is there anything you need to have the elevator tell the filesystem? (I'm thinking that this may be the path for the filesystem to learn things about the block device that's under it, is it a raid array, a solid-state drive, etc) --
No. The whole point of immediate dispatch is that those I/Os are extremely latency sensitive (i.e. whole fs can stall waiting or them), so it doesn't matter what the end target is. The faster the storage subsystem, the more important it is to dispatch those Not so much the elevator, but the block layer in general. That is: - capability reporting - barriers and type - discard support - integrity support - maximum number of I/Os that can be in flight before congestion occurs - geometry of the underlying storage - independent domains within the device (e.g. boundaries of linear concatentations) - stripe unit/width per domain - optimal I/O size per domain - latency characteristics per domain - notifiers to indicate change of status due to device hotplug back up to the filesystem - barrier status change - geometry changes due to on-line volume modification (e.g. raid5/6 rebuild after adding a new disk, added another disk to a linear concat, etc) I'm sure there's more, but that's the list quickly off the top of my head. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I did some compilebench runs with xfs this morning, creating 30 kernel trees on the same machine I posted btrfs and xfs numbers with last week. Btrfs gets between 60 and 75MB/s average depending on the mount options used, ext4 gets around 60MB/s This is a single sata drive that can run at 100MB/s streaming writes. The numbers show XFS is largely log bound, and that turning off barriers makes a huge difference. I'd be happy to try another run with explicit unplugging somewhere in the transaction commit path. I think the most relevant number is the count of MB written at the end of blkparse. I'm not sure why the 4ag XFS writes less, but the numbers do include calling sync at the end. None of the filesystems were doing barriers in these numbers: Ext4 9036MiB Btrfs metadata dup 9190MiB Btrfs metadata dup no inline files 10280MiB XFS 4ag, nobarrier 14299MiB XFS 1ag, nobarrier 17836MiB This is a long way of saying the xfs log isn't optimal for these kinds of operations, which isn't really news. I'm not ripping on xfs here, this is just one tiny benchmark. I uploaded some graphs of the IO here: http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs XFS: *** 4ag, 128m log, logbsize=256k intial create total runs 30 avg 7.48 MB/s (user 0.52s sys 1.04s) *** 4ag, 128m log, logbsize=256k, nobarrier intial create total runs 30 avg 21.58 MB/s (user 0.51s sys 1.04s) http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-4ag-nobarrier.png *** 1ag, 128m log, logbsize=256k, nobarrier intial create total runs 30 avg 26.28 MB/s (user 0.50s sys 1.15s) http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-nobarrier-1ag.png It is hard to see in the graph, but it looks like the log is in the first 128MB of the drive. If we give XFS an external log device: *** 1ag 128m external log, logbsize=256k, nobarrier intial create total runs 30 avg 38.44 MB/s (user 0.51s ...
One thing I just found out - my old *laptop* is 4-5x faster than the 10krpm scsi disk behind an old cciss raid controller. I'm wondering if the long delays in dispatch is caused by an interaction with CTQ but I can't change it on the cciss raid controllers. Are you using ctq/ncq on your machine? If so, can you reduce the depth to something less than 4 and see what difference that makes? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I've been benchmarking on a cciss card, and patched the driver to control the queue depth via sysfs. Maybe you'll find it useful... The original patch was for 2.6.24, but that won't apply on git head. I fixed it for 2.6.27, and it seems to work fine. Both are attached. -- Aaron
Just to point out - this is not a new problem - I can reproduce it on 2.6.24 as well as 2.6.26. Likewise, my laptop shows XFS being faster than ext3 on both 2.6.24 and 2.6.26. So the difference is something related to the disk subsystem on the server.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Interesting. I switched from cfq to deadline some time ago, due to abysmal XFS performance on parallel IO - aptitude upgrade and doing desktop stuff. Just my subjective perception, but I have seen it crawl, even stall for 5-10 seconds easily at times. I found deadline to be way faster initially, but then it rarely happened that IO for desktop tasks is basically stalled for even longer, say 15 seconds or more, on parallel IO. However I can't remember having this problem with the last kernel 2.6.26.2. I am now testing with cfq again. On a ThinkPad T42 internal 160 GB harddisk with barriers enabled. But you tell, it only happens on certain servers, so I might have seen something different. Thus I had the rough feeling that something is wrong with at least CFQ and XFS together, but I couldn't prove it back then. I have no idea how to easily do a reproducable test case. Maybe having a script that unpacks kernel source archives while I try to use the desktop... -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 --
Okay, some numbers attached: - On XFS: Barrier versus Nobarrier makes quite a difference with compilebench. Also on rm -rf'ing the large directory tree it leaves behind. While I did not measure the first barrier related compilebench directory deletion I am pretty sure it took way longer. Also vmstat throughput it higher without nobarriers. - On XFS: CFQ versus NOOP does not seem to make that much of a difference, at least not with barriers enabled (didn't test without). With NOOP responsiveness was even weaker than with CFQ. Opening a context menu on a webpage link displayed in Konqueror could take easily a minute or more. I think it shall never ever take that long for the OS to respond to user input. - Ext3, NILFS, BTRFS with CFQ: Perform quite well. Especially btrfs. nilfs text isn't complete, cause likely due to checkpoints those 4G I dedicated to it were not enough for the compilebench test to complete. So at least here performance degration with XFS seems more related to barriers than scheduler decision - least when it comes to the two choices CFQ and NOOP. But no, I won't switch barriers off permanently on my laptop. ;) Would be fine if performance impact of barriers could be reduced a bit tough. At last I appear to see something different than the I/O scheduler issue discussed here. Anyway subjectively I am quite happy with XFS performance nonetheless. But then since I can't switch from XFS to ext3 or btrfs in a second I can't really compare subjective impressions. Maybe desktop would respond faster with ext3 or btrfs? Who knows? I think a script which does extensive automated testing would be fine: - have some basic settings like SCRATCH_DEV=/dev/sda8 (this should be a real partition in order to be able to test barriers which do not work over LVM / device mapper) SCRATCH_MNT=/mnt/test - have an array of pre-pre-test setups like [ echo "cfq" >/sys/block/sda/queue/scheduler ] [ echo "deadline" >/sys/block/sda/queue/scheduler ...
It's a laptop and has NCQ. It makes no difference if NCQ is enabled or XFS definitely stalls somewhere: stats show virtually no CPU usage and no time spent waiting for IO. No file system produces similar output. procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3146180 7848 600868 0 0 0 4128 790 549 0 2 98 0 0 0 0 3145200 7848 601524 0 0 0 2372 766 516 0 2 98 0 1 0 0 3144328 7848 602260 0 0 0 2924 792 542 1 2 98 0 0 1 0 3143824 7856 602664 0 0 0 4116 732 426 0 2 53 45 1 0 0 3143068 7856 603136 0 0 0 4676 756 534 0 3 95 1 0 0 0 3142652 7856 603540 0 0 0 6577 756 436 0 0 100 0 0 0 0 3141952 7856 604100 0 0 0 5840 764 498 1 3 96 0 0 0 0 3141424 7856 604544 0 0 0 4752 761 386 0 0 99 0 0 0 0 3140860 7856 604916 0 0 0 6477 785 495 0 1 98 0 0 0 0 3139980 7856 605468 0 0 0 2840 743 370 1 2 97 0 0 0 0 3138464 7856 606884 0 0 0 4902 795 421 0 4 96 0 0 0 0 3137636 7856 607696 0 0 0 4364 739 395 0 1 99 0 0 0 0 3136520 7856 608220 0 0 0 6160 774 566 0 2 97 0 Szaka -- NTFS-3G: http://ntfs-3g.org --
The 'nobarrier' mount option made a big improvement:
MB/s Runtime (s)
----- -----------
btrfs unstable 17.09 572
ext3 13.24 877
btrfs 0.16 12.33 793
nilfs2 2nd+ runs 11.29 674
ntfs-3g 8.55 865
reiserfs 8.38 966
xfs nobarrier 7.89 949
nilfs2 1st run 4.95 3800
xfs 1.88 3901
Szaka
--
NTFS-3G: http://ntfs-3g.org
--
INteresting. Barriers make only a little difference on my laptop; 10-20% slower. But yes, barriers will have this effect on XFS. If you've got NCQ, then you'd do better to turn off write caching on the drive, turn off barriers and use NCQ to give you back the performance that the write cache used to. That is, of course, assuming the NCQ implementation doesn't suck.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
See my other post with performance numbers:
Barriers appear to make more than 50% difference on my laptop for some
operations on some other operations it hardly makes a difference at all -
I bet it goes slow mainly when creating or deleting lots of small files.
Looking at vmstat 1 during a rm -rf of a compilebench leftover directory
while switching off barriers shows a difference of even more than 50% in
metadata throughput.
It has this controller
00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller
(rev 01)
and this drive
---------------------------------------------------------------------
shambhala:~> hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: Hitachi HTS541616J9AT00
Serial Number: SB0442SJDVDDHH
Firmware Revision: SB4OA70H
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 1
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 312581808
device size with M = 1024*1024: 152627 MBytes
device size with M = 1000*1000: 160041 MBytes (160 GB)
Capabilities:
LBA, IORDY(can be disabled)
Standby timer values: spec'd by Vendor, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Advanced power management level: 254
Recommended acoustic management value: 128, current value: 128
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=240ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART ...Write cache off, nobarrier and AHCI NCQ lowered the XFS result:
MB/s Runtime (s)
----- -----------
btrfs unstable 17.09 572
ext3 13.24 877
btrfs 0.16 12.33 793
ntfs-3g unstable 11.52 673
nilfs2 2nd+ runs 11.29 674
reiserfs 8.38 966
xfs nobarrier 7.89 949
nilfs2 1st run 4.95 3800
xfs nobarrier, ncq, wc off 3.81 1973
xfs 1.88 3901
Szaka
--
NTFS-3G: http://ntfs-3g.org
--
Retested with a different disk, SATA-II, NCQ, capable of 70-110 MB/s
read/write:
MB/s Runtime (s)
----- -----------
btrfs unstable, no dup 51.42 168
btrfs unstable 42.67 197
ext4 2.6.26 35.63 245
nilfs2 2nd+ runs 26.43 287
ntfs-3g unstable 21.41 370
ext3 19.92 559
xfs nobarrier 14.17 562
reiserfs 13.11 595
nilfs2 1st run 12.06 3719
xfs nobarrier, ncq, wc off 6.89 1070
xfs 1.95 3786
Szaka
--
NTFS-3G: http://ntfs-3g.org
--
I don't think that's going to make a difference when using CFQ. I did some tests that showed that CFQ would never issue more than one IO at a time to a drive. This was using sixteen userspace threads, each doing a 4k direct I/O to the same location. When using noop, I would get 70k IOPS and when using CFQ I'd get around 40k IOPS. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Not obviously the same sort of issue. The traces clearly show multiple nested dispatches and completions so CTQ is definitely active... Anyway, after a teeth-pulling equivalent exercise of finding the latest firmware for the machine in a format I could apply, I upgraded the firmware throughout the machine (disks, raid controller, system, etc) and XFS is a *lot* faster. In fact - mostly back to +/- a small amount compared to ext3. run complete: ========================================================================== avg MB/s user sys runs xfs ext3 xfs ext3 xfs ext3 intial create total 30 6.36 6.29 4.48 3.79 7.03 5.22 create total 7 5.20 5.68 4.47 3.69 7.34 5.23 patch total 6 4.53 5.87 2.26 1.96 6.27 4.86 compile total 9 16.46 9.61 1.74 1.72 9.02 9.74 clean total 4 478.50 553.22 0.09 0.06 0.92 0.70 read tree total 2 13.07 15.62 2.39 2.19 3.68 3.44 read compiled tree 1 53.94 60.91 2.57 2.71 7.35 7.27 delete tree total 3 15.94s 6.82s 1.38 1.06 4.10 1.49 delete compiled tree 1 24.07s 8.70s 1.58 1.18 5.56 2.30 stat tree total 5 3.30s 3.22s 1.09 1.07 0.61 0.53 stat compiled tree total 3 2.93s 3.85s 1.17 1.22 0.59 0.55 The blocktrace looks very regular, too. All the big bursts of dispatch and completion are gone as are the latencies on log I/Os. It would appear that ext3 is not sensitive to concurrent I/O latency like XFS is... At this point, I'm still interested to know if the original results were had ctq/ncq enabled and if it is whether it is introducing latencies are not. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I'd expect that nilfs continues to win postmark. Btrfs splits data and metadata into different parts of the disk, so at best btrfs is going to produce two streams of writes into the SSD while nilfs is doing one. Most consumer ssds still benefit from huge writes, and so nilfs is pretty optimal in that case. The main benefit of the split for btrfs is being able to have different duplication policies for metadata and data, and faster fsck times because the metadata is more compact. Over time that may prove less relevant on SSD, and changing it in btrfs is just flipping a few bits during allocation. -chris --
Interesting approach. Does that mean that every block lookup involves Is this userland daemon really necessary? I do all that stuff in kernelspace and the amount of code I have is likely less than would be necessary for the userspace interface alone. Apart from creating a plethora of research papers, I never saw much use for pluggable cleaners. Did you encounter any nasty deadlocks and how did you solve them? Finding deadlocks in the vfs-interaction became a hobby of mine when testing logfs and at least one other lfs seems to have had similar problems - they exported the inode_lock in their patch. ;) Jörn -- Consensus is no proof! -- John Naisbitt --
Simply stated, it's Yes. But the actual number of disk accesses will become fewer because the DAT is cached like regular files and read-ahead is also applied. Well, that sounds reasonable. Still I cannot say which is better for now. My colleague has intention to develop other type of cleaners, and another colleague experimentally made a cleaner with GUI. In addition, there are possibilities to integrate attractive features Yeah, it was very tough battle :) Read is OK. But write was hard. I looked at the vfs code over again and again. We've implemented NILFS without bringing specific changes into vfs. However, if we can find common basis for LFSes, I'm grad to cooperate with you. Though I don't know whether exporting inode_lock is the case or not ;) Regards, Ryusuke Konishi --
Well, I was looking more for something like a list of problems and solutions. Partially because I am plain curious and partially because I know those are the problem areas of any log-structured filesystem and they deserve special attention in a review. In logfs, garbage collection may read (and write) any inode and any block from any file. And since garbage collection may be called from writepage() and write_inode(), the fun included: P: iget() on the inode being currently written back and locked. S: Split I_LOCK into I_LOCK and I_SYNC. Has been merged upstream. P: iget() on an inode in I_FREEING or I_WILL_FREE state. S: Add inodes to a list in drop_inode() and remove them again in destroy_inode(). iget() in GC context is wrapped in a method that checks said list first and return an inode from the list when applicable. Used to hold inode_lock to prevent races, but a logfs-local lock is actually sufficient. If either of the two problems above is solved by calling ilookup5_nowait() I bet you a fiver that a race with data corruption is lurking somewhere in the area. P: find_get_page() or some variant on a page handed to logfs_writepage(). S: Use the one available page flag, PG_owner_priv_1 to mark pages that are waiting for the single-threaded logfs write path. If any page GC needs is locked, check for PG_owner_priv_1 and if it is set, just use the page anyway. Whoever has set the flag cannot clear it until GC has finished. If the flag is not set, the page might still be somewhere in the logfs write path - before setting the page. So simply do the check in a loop, call schedule() each time, knock on wood and keep your fingers crossed that the page will either become unlocked and set PG_owner_priv_1 sometime soon. I'm not proud of this solution but know no better one. So something like the above for nilfs would be useful. And maybe, just to be on the safe side, try the following testcase overnight: - Create tiny ...
Yep. It is not a bad tradeoff. You pay with some extra seeks when the filesystem is freshly mounted but gain a lot of simplicity in garbage collection. More questions. I believe the first two answer are no, but would like to be sure. Do you support compression? Do you do wear leveling or scrubbing? How does garbage collection work? In particular, when the filesystem runs out of free space, do you depend on the userspace daemon to make some policy decisions or can the kernel make progress on its own? Jörn -- There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies. -- C. A. R. Hoare --
Hi, Jorn I'll reply from the latter mail. NILFS does not support scrubbing. (as you guessed) Under the current GC daemon, it writes logs sequentially and circularly in the partition, and as you know, this leads to the wear levelling The GC of NILFS depends on the userspace daemon to make policy decisions. NILFS cannot reclaim disk space on its own though it can work (i.e. read, write, or do other operations) without the daemon. After it runs out of free space, disk full errors will be returned until GC makes new space. But, usually the GC will make enough disk space in the background before that occurs. The userland GC daemon, which runs in the background, starts to reclaim logs (to be presice segments) if there are logs (segments) whose age is older than a certain period, which we call ``protection period''. If no recent logs are found, it goes sleeping. Regards, Ryusuke --
I don't see how that would cope with file systems that have a lot of static data. The classic problem of most cheap devices that implement wear leveling in hardware is that they never move data in an erase block that is used for read-only data. If 90% of the file system is read-only, your wear leveling will only work on 10% of the medium, wearing it down 10 times faster than it should. Can the GC daemon handle this case, e.g. by moving around aging read-only erase blocks? Arnd <>< --
Yeah, exactly. Thank you for this comment. To minimize aging of the device itself, the userland GC daemon would need another cleaning policy. So, in that sense, the answer of the above question is NO. Since the primary purpose of NILFS is providing continuous snapshotting, the GC is not necessarily designed with such requirement in mind. Regards, Ryusuke Konishi --
No shame in that. One fine day I would like to have a filesystem that combines all the neat tricks from the half-dozen new filesystems that are currently under development. Until then, people will simply have to pick which one matches their personal requirements best. Jörn -- There are three principal ways to lose money: wine, women, and engineers. While the first two are more pleasant, the third is by far the more certain. -- Baron Rothschild --
I am a bit confused here. My picture of log-structured filesystems was always that writes go round-robin _within_ a segment, but new segments can be picked in any order. So there is a good chance of some segments simply never being picked and others constantly being reused. If nilfs works in the same way, it will by design spread the writes somewhat better than ext3, to pick an example, but can still lead to local wear-out if f.e. 98% of the filesystem is full and the remaining 2% receive a high write load. True wear leveling requires a bit more work. Either some probabilistic garbage collection of any random segment, as jffs2 does, or storing some This looks problematic. In logfs I was very careful to define a "filesystem full" condition that is independent of GC. So with a single writer, -ENOSPC always means the filesystem is full and the only way to gain some free space is by deleting data again. In nilfs it appears possible that a single writer received -ENOSPC and can simply continue writing until - magically - there is space again because the GC daemon woke up and freed some more. That is unexpected, to say the least. Which is also one of the reasons why I don't like the userspace daemon approach very much. Decent behaviour now requires that you block the writes, wake up the userspace daemon and wait for it to do its job. Or you would have to implement a backup-daemon in kernelspace which gets Usually, yes. You just have to make sure that in the unusual cases the filesystem continues to behave correctly. ;) Jörn -- Homo Sapiens is a goal, not a description. -- unknown --
As the side remark, the GC of nilfs runs in the background, not started after it runs out of free space. Basically the intended meaning of -ENOSPC is same; it does not mean the GC is ongoing, but means the deletion is required. Of course this depends on the condition that the GC has been working with enough speed, so the meaning is not assured strictly. But, at least I won't return -ENOSPC so easily, and will deal it more politely if needed. On the other hand, there are some differences in premise because nilfs is aiming at racking up past user data and makes it a top priority to keep data which is overwritten by recent updates. If users want to preserve much data in nilfs, it will increase the chance of disk fulls than regular file systems. Cheers, Ryusuke --
Hm, good point. With continuous snapshots the rules of the game change considerably. So maybe it is ok to depend on the userspace daemon here, because the space is unreclaimable anyway. What is the policy on deleting continuous snapshots? Or can it even be configured by the administrator (which would be cool)? Jörn -- The cheapest, fastest and most reliable components of a computer system are those that aren't there. -- Gordon Bell, DEC labratories --
First, nilfs never deletes the checkpoints marked as snapshot nor the recent checkpoints whose elapsed time from its creation is smaller than ``protection period''. These are ground rules. Based on the rules, the userland GC daemon can delete arbitrary checkpoints among removable checkpoints. But the current GC just deletes the removable checkpoints in chronological order. More sophisticated policies, for example, the one detects landmark checkpoints and tries to keep them (a known policy in versioning filesystems), may be conceivable. But I feel the current policy is simple and satisfactory, so I'd like to leave others to someone who wants to implement them (e.g. one of my colleagues). Regards, Ryusuke Konishi --
How stable is the on-disk format? If the file system makes mainline your user base would likely increase significantly. Users then tend to have a reasonable exception that they can still mount old file systems later on newer kernels (although not necessarily the other way round) -Andi --
It's almost stable. Hopefully I don't want to make any major change on the on-disk format which affects the compatibility. Some unsupported features like atime, EA, and ACLs should be carefully confirmed before merging to the mainline though Yes I know. Thank you for the important advice. --
