Hello everyone,
Btrfs v0.16 is available for download, please see
http://btrfs.wiki.kernel.org/ for download links and project
information.v0.16 has a shiny new disk format, and is not compatible with
filesystems created by older Btrfs releases. But, it should be the
fastest Btrfs yet, with a wide variety of scalability fixes and new
features.There were quite a few contributors this time around, but big thanks to
Josef Bacik and Yan Zheng for their help on this release. Toei Rei also
helped track down an important corruption problem.Scalability and performance:
* Fine grained btree locking. The large fs_mutex is finally gone.
There is still some work to do on the locking during extent allocation,
but the code is much more scalable than it was.* Helper threads for checksumming and other background tasks. Most CPU
intensive operations have been pushed off to helper threads to take
advantage of SMP machines. Streaming read and write throughput now
scale to disk speed even with checksumming on.* Improved data=ordered mode. Metadata is now updated only after all
the blocks in a data extent are on disk. This allows btrfs to provide
data=ordered semantics without waiting for all the dirty data in the FS
to flush at commit time. fsync and O_SYNC writes do not force down all
the dirty data in the FS.* Faster cleanup of old transactions (Yan Zheng). A new cache now
dramatically reduces the amount of IO required to clean up and delete
old snapshots.Major features (all from Josef Bacik):
* ACL support. ACLs are enabled by default, no special mount options
required.* Orphan inode prevention, no more lost files after a crash
* New directory index format, fixing some suboptimal corner cases in the
original.There are still more disk format changes planned, but we're making every
effort to get them out of the way as quickly as we can. You can see the
major features we have planned on the development timeline:
Can this lead to the same Priority Inversion issues as seen with
kjournald?--
I was able to get it mostly lockdep complaint by using mutex_lock_nested
based on the level of the btree I was locking. My allocation mutex is aYes, although in general only the helper threads end up actually doing
the IO for writes. Unfortunately, they are almost but not quite an
elevator. It is tempting to try sorting the bios on the helper queues
etc. But I haven't done that because it gets into starvation and other
fun.I haven't done any real single cpu testing, it may make sense in those
workloads to checksum and submit directly in the calling context. But
real single cpu boxes are harder to come by these days.-chris
--
[just jumping in as a casual bystander with one remark]
For this purpose it seems booting up with limiting to one CPU should be
sufficient.Tvrtko
Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon,
OX14 3YP, United Kingdom.Company Reg No 2096520. VAT Reg No GB 348 3873 20.
--
They're still pretty common in the embedded/low power space. I could
see something like a settop box wanting to use btrfs with massive disks.Chris
--
Just took a peek, seems to be slightly out of date as it still lists the
single mutex thingy.Also, how true is the IO-error and disk-full claim?
--
Thanks, I thought I had removed all the references to it on that page,
We still don't handle disk full. The IO errors are handled most of the
time. If a checksum doesn't match or the lower layers report an IO
error, btrfs will use an alternate mirror of the block. If there is no
alternate mirror, the caller gets EIO and in the case of a failed csum,
the page is zero filled (actually filled with ones so I can find bogus
pages in an oops).Metadata is duplicated by default even on single spindle drives, so this
means that metadata IO errors are handled as long as the other mirror is
happy.If mirroring is off or both mirrors are bad, we currently get into
trouble.data pages work better, those errors bubble up to userland just like in
other filesystems.-chris
--
Can you please say a bit how much that impacts performance? That sounds
costly.-Andi
--
Most metadata is allocated in groups of 128k or 256k, and so most of the
writes are nicely sized. The mirroring code has areas of the disk
dedicated to mirror other areas. So we end up with something like this:metadata chunk A (~1GB in size)
[ ......................... ]mirror of chunk A (~1GB in size)
[ ......................... ]So, the mirroring turns a single large write into two large writes.
Definitely not free, but always a fixed cost.I started to make some numbers of this yesterday on single spindles and
discovered that my worker threads are not doing as good a job as they
should be of maintaining IO ordering. I've been using an array with a
writeback cache for benchmarking lately and hadn't noticed.I need to fix that, but here are some numbers on a single sata drive.
The drive can do about 100MB/s streaming reads/writes. Btrfs
checksumming and inline data (tail packing) are both turned on.Single process creating 30 kernel trees (2.6.27-rc2)
Btrfs defaults 36MB/s
Btrfs no mirror 50MB/s
Ext4 defaults 59.2MB/s (much better than ext3 here)With /sys/block/sdb/queue/nr_requests at 8192 to hide my IO ordering
submission problems:Btrfs defaults: 57MB/s
Btrfs no mirror: 61.51MB/s-chris
--
I spent a bunch of time hammering on different ways to fix this without
increasing nr_requests, and it was a mixture of needing better tuning in
btrfs and needing to init mapping->writeback_index on inode allocation.So, today's numbers for creating 30 kernel trees in sequence:
Btrfs defaults 57.41 MB/s
Btrfs dup no csum 74.59 MB/s
Btrfs no duplication 76.83 MB/s
Btrfs no dup no csum no inline 76.85 MB/sExt4 data=writeback, delalloc 60.50 MB/s
I may be able to get the duplication numbers higher by tuning metadata
writeback. My current code doesn't push metadata throughput as high in
order to give some spindle time to data writes.This graph may give you an idea of how the duplication goes to disk:
http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-default.png
Compared with the result of mkfs.btrfs -m single (no duplication):
http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-single.png
Both on one graph is a little hard to read:
http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-compare.png
Here is btrfs with duplication on, but without checksumming. Even with
inline extents on, the checksums seem to cause most of the metadata
related syncing (they are stored in the btree):http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-nosum.png
It is worth noting that with checksumming on, I go through async
kthreads to do the checksumming and they may be reordering the IO a bit
as they submit things. So, I'm not 100% sure the extra seeks aren't
coming from my async code.And Ext4:
http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/ext4-writeback.png
This benchmark has questionable real world value, but since it includes
a number of smallish files it is a good place to look at the cost of
metadata and metadata dupI'll push the btrfs related changes for this out tonight after some
stress testing.-chris
--
What sort of script are you using? Basically something like this?
for i in `seq 1 30` do
mkdir $i; cd $i
tar xjf /usr/src/linux-2.6.28.tar.bz2
cd ..
done- Ted
--
Similar. I used compilebench -i 30 -r 0, which means create 30 initial
kernel trees and then do nothing. compilebench simulates compiles by
writing to the FS files of the same size that you would get by creating
kernel trees or compiling them.The idea is to get all of the IO without needing to keep 2.6.28.tar.bz2
in cache or the compiler using up CPU.http://www.oracle.com/~mason/compilebench
-chris
--
Whoops the link above is wrong, try:
http://oss.oracle.com/~mason/compilebench
It is worth noting that the end throughput doesn't matter quite as much
as the writeback pattern. Ext4 is pretty solid on this test, with very
consistent results.-chris
--
There were two reasons why I wanted to play with compilebench. The
first is we have a fragmentation problem with delayed allocation and
small files getting forced out due to memory pressure, that we've been
working for the past week. My intuition (which has proven to be
correct) is that compilebench is a great tool to show it off. It may
not matter so much for write throughput results, since usually the
separation distance between the first block and the rest of the file
is small, and the write elevator takes care of it, but in the long run
this kind of allocation pattern is no good:Inode 221280: (0):887097, (1):882497
Inode 221282: (0):887098, (1-2):882498-882499
Inode 221284: (0):887099, (1):882500The other reason why I was interested in playing with compilebench
tool is that I wanted to try tweaking the commit timers to see if this
would make a difference to the result. Not for this benchmark, it
appears, given a quick test that I did last night.- Ted
--
Have you tried this one:
http://article.gmane.org/gmane.linux.file-systems/25560
This bug should cause fragmentation on small files getting forced out
due to memory pressure in ext4. But, I wasn't able to really
demonstrate it with ext4 on my machine.-chris
--
I've been able to use compilebench to see the fragmentation problem
very easily.Annesh has been workign on it, and has some fixes that he queued up.
I'll have to point him at your proposed fix, thanks. This is what he
came up with in the common code. What do you think?- Ted
(From Annesh, on the linux-ext4 list.)
As I explained in my previous patch the problem is due to pdflush
background_writeout. Now when pdflush does the writeout we may
have only few pages for the file and we would attempt
to write them to disk. So my attempt in the last patch was to
do the belowa) When allocation blocks try to be close to the goal block specified
b) When we call ext4_da_writepages make sure we have minimal nr_to_write
that ensures we allocate all dirty buffer_heads in a single go.
nr_to_write is set to 1024 in pdflush background_writeout and that
would mean we may end up calling some inodes writepages() with really
small values even though we have more dirty buffer_heads.What it doesn't handle is
1) File A have 4 dirty buffer_heads.
2) pdflush try to write them. We get 4 contig blocks
3) File A now have new 5 dirty_buffer_heads
4) File B now have 6 dirty_buffer_heads
5) pdflush try to write the 6 dirty buffer_heads of file B and allocate
them next to earlier file A blocks
6) pdflush try to write the 5 dirty buffer_heads of file A and allocate
them after file B blocks resulting in discontinuity.I am right now testing the below patch which make sure new dirty inodes
are added to the tail of the dirty inode listcommit 6ad9d25595aea8efa0d45c0a2dd28b4a415e34e6
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Fri Aug 15 23:19:15 2008 +0530move the dirty inodes to the end of the list
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 25adfc3..91f3c54 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -163,7 +163,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!was_dirty) {
inode-...
It sounds like ext4 would show the writeback_index bug with
fragmentation on disk and btrfs would show it with seeks during the
benchmark. I was only watching the throughput numbers and not lookingpdflush and delalloc and raid stripe alignment and lots of other things
don't play well together. In general, I think we need one or more
pdflush threads per mounted FS so that write_cache_pages doesn't have to
bail out every time it hits congestion.The current write_cache_pages code even misses easy changes to create
bigger bios just because a block device is congested when called by
background_writeout()But I would hope we can deal with a single threaded small file workload
Looks like everyone who walks sb->s_io or s_dirty walks it backwards.
This should make the newly dirtied inode the first one to be processed,
which probably isn't what we want. I could be reading it backwards of
course ;)-chris
--
I tried just the writeback_index patch and got only 4 fragmented files
on ext4 after a compilebench run. Then I tried again and got 1200.
Seems there is something timing dependent in here ;)By default compilebench uses 256k buffers for writing (see compilebench
-b) and btrfs_file_write will lock down up to 512 pages at a time during
a single write. This means that for most small files, compilebench will
send the whole file down in one write() and btrfs_file_write will lock
down pages for the entire write() call while working on it.So, even if pdflush tries to jump in and do the wrong thing, the pages
will be locked by btrfs_file_write and pdflush will end up skipping
them.With the generic file write routines, pages are locked one at a time,
giving pdflush more windows to trigger delalloc while a write is still
ongoing.-chris
--
Yeah, the patch Aneesh sent to change where we added the inode to the
dirty list was false lead. The right fix is in the ext4 patch queue
now. I think we have the problem licked and a quick test showed it
increased the compilebench MB/s by a very tiny amount (enough so that
I wasnt sure whether or not it was measurement error), but it does
avoid the needly fragmentation.- Ted
--
But without duplication they are basically free here at least
in IO rate. Seems odd?Does it compute them twice in the duplication case perhaps?
-Andi
--
Looks like I can get the btrfs defaults up to 64MB/s with some writeback
The async worker threads should be spreading the load across CPUs pretty
well, and even a single CPU could keep up with 100MB/s checksumming.
But, the async worker threads do randomize the IO somewhat because the
IO goes from pdflush -> one worker thread per CPU -> submit_bio. So,
maybe that 3rd thread is more than the drive can handle?btrfsck tells me the total size of the btree is only 20MB larger with
The duplication happens lower down in the stack, they only get done
once.-chris
--
Ok was just speculation. The big difference still seems odd.
-Andi
--
It was a very confusing use of the word thread. I have the same number
of kernel threads running, but the single spindle on the drive has to
deal with 3 different streams of writes. The seeks/sec portion of the
graph shows a big enough increase in seeks on the duplication run toIt does, I'll give the test a shot on other hardware too. To be honest
I'm pretty happy at matching ext4 with duplication on. The graph shows
even writeback and the times from each iteration are fairly consistent.Ext3 and XFS score somewhere between 10-15MB/s on the same test...
-chris
--
Interesting (and cool animations).
We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26,
freshly formatted partition, with defaults. Results:MB/s Runtime (s)
----- -----------
ext3 13.24 877
btrfs 12.33 793
ntfs-3g 8.55 865
reiserfs 8.38 966
xfs 1.88 3901Regards,
Szaka--
NTFS-3G: http://ntfs-3g.org
--
Thanks for running things.
The code in the btrfs-unstable tree has all my performance fixes.
You'll need it to get good results. Also, the MB/s number doesn't
include the time to run sync at the end, which is probably why the
runtime for btrfs is shorter but MB/s is lower.-chris
--
The numbers are indeed much better:
MB/s Runtime (s)
----- -----------
btrfs-unstable 17.09 572The disk is capable of 40+ MB/s however the test partition was one of the
last ones and as I figured it out now, it can do only 26 MB/sec. Btrfs bulk
write easily sustains it. The write speed was 21 MB/s during the benchmark,
so btrfs is the closest to the possible best write speed in the test
environment.Szaka
--
NTFS-3G: http://ntfs-3g.org--
Thanks for the explanation and the numbers. I see that's the advantage of
copy-on-write that you can actually always cluster the metadata together and
get always batched IO this way and then afford to do more of it.Still wondering what that will do to read seekiness.
-Andi
--
In theory, if the elevator was smart enough, it could actually help
read seekiness; there are two copies of the metadata, and it shouldn't
matter which one is fetched. So I could imagine a (hypothetical) read
request which says, "please give me the contents of block 4500 or
75000000 --- I don't care which, if the disk head is closer to one end
of the disk or another, use whichever one is most convenient". Our
elevator algorithms are currently totally unable to deal with this
sort of request, and if SSD's are going to be coming on line as
quickly as some people are claiming, maybe it's not worth it to try to
implement that kind of thing, but at least in theory it's something
that could be done....- Ted
--
That assumes the elevator actually knows what is nearby? I thought
that wasn't that easy with modern disks with multiple spindles
and invisible remapping, not even talking about RAID
arrays looking like disks.-Andi
--
RAID is the big problem, yeah. In general, though, we are already
making an assumption in the elevator code and in filesystem code that
block numbers which are numerically closer together are "close" from
the perspective of disks. There has been talk about trying to make
filesystems smarter about allocating blocks by giving them visibility
to the RAID parameters; in theory the elevator algorithm could also be
made smarter as well using the same information. I'm really not sure
if the complexity is worth it, though....- Ted
--
| Zach Brown | [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole tas... |
| Linus Torvalds | Re: LSM conversion to static interface |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| Gregory Haskins | [RFC PATCH 00/17] virtual-bus |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
