linux-ext4 mailing list

FromSubjectsort iconDate
Dmitry Monakhov
[PATCH] ext4: random performance optimizations for ext4_ ...
If quota is not enabled it is not necessery to start separate transaction for uid, gid and quota credits changes. If inode wasn't added to orphan list when it is not necessary to remove it from the list. This allow to avoid locking on per-sb s_orphan_lock mutex. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/inode.c | 15 ++++++++++++--- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3996151..b498274 ...
Apr 8, 1:29 am 2010
Eric Sandeen
Re: [PATCH] ext4: random performance optimizations for e ...
lame spell-check review ;) "uid, gid, and ..." --
Apr 8, 8:33 am 2010
minskey
binary search in ext4_ext_binsearch_idx()
in ext4_ext_binsearch_idx() routine, there is the following code: while (l <= r) { m = l + (r - l) / 2; if (block < le32_to_cpu(m->ei_block)) r = m - 1; else l = m + 1; } path->p_idx = l - 1; The code always runs logN iterations to get the expected extent, if 3-way comparison is used, we can save some iterations when (block == le32_to_cpu(m->ei_block)) at the first iteration. some ...
Apr 7, 8:26 pm 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
Surely a bug. :) If you can provide details we'll look into it. (perhaps it's obvious on first try but still worth saying exactly what problematic behavior you saw, when reporting a bug you encountered) Thanks, -Eric --
Apr 7, 6:22 pm 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
Something like this probably works, but I really REALLY would not test it on an important filesystem. :) I'm not sure it's a good idea to discard it before returning it to the prealloc pool, because it may well get re-used again quickly.... not sure if that's helpful. Just a note, I think eventually we may move to more of a batch discard in the background, because these little discards are actually quite inefficient on the hardware we've tested so far. -Eric p.s. really. Don't test ...
Apr 7, 9:37 pm 2010
Nebojsa Trpkovic
"data=writeback" and TRIM don't get along
Hello. TRIM command issued to SSD doesn't work with this mount options: ============================ rootfs / rootfs rw 0 0 /dev/root / ext4 rw,noatime,commit=100,barrier=0,nobh,stripe=128,data=writeback,discard 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 rc-svcdir /lib64/rc/init.d tmpfs rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 udev /dev tmpfs rw,nosuid,relatime,size=10240k,mode=755 0 0 devpts /dev/pts devpts ...
Apr 7, 5:50 pm 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
I think the answer is "not yet but it's being worked on" --
Apr 8, 9:34 am 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
Well, you might just keep in mind that: 1) trimming these small amounts has actually looked very inefficient, and 2) data=writeback really isn't very safe in the face of a crash or power loss, and 3) hopefully we'll have a better trim solution eventually. -Eric --
Apr 8, 8:32 am 2010
Nebojsa Trpkovic
Re: "data=writeback" and TRIM don't get along
1) I understand that big TRIMs are better then small ones, but skipping some TRIMs completely would lead to slow but sure drive degradation as drive would have less and less spare space for wear leveling. 2) Yes, I'm aware of possible data=writeback inconsistency, but I've tried to let IO scheduler to merge and reorganize as many writes as it can, all to avoid small writes to SSD which are main cause of write amplification. 3) I'll stick with no data=writeback for the time being. I guess ...
Apr 8, 9:21 am 2010
Nebojsa Trpkovic
Re: "data=writeback" and TRIM don't get along
Well, I've done a simple test, described like: "get the used sectors for a file hdparm --fibmap filename read a sector from the file eg. with sudo hdparm --read-sector 66385920 /dev/sda delete the file and sync rm filename;sync and read the sector a second time" And I get something like this: ================================ # dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct 100+0 records in 100+0 records out 52428800 bytes (52 MB) copied, 6.47137 s, 8.1 MB/s # ...
Apr 7, 6:37 pm 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
Ok, thanks, perfect test & explanation. Well the good news is, at least it's nothing like discarding the wrong block. :) Long explanation: in ext4_free_blocks(): /* * We need to make sure we don't reuse the freed block until * after the transaction is committed, which we can do by * treating the block as metadata, below. We make an * exception if the inode is to be written in writeback mode * since writeback mode has weak data ...
Apr 7, 9:10 pm 2010
Eric Sandeen
Re: "data=writeback" and TRIM don't get along
(now I'm really talking to myself, but scratch that bit - ext4_mb_return_to_preallocation is pretty much a no-op) --
Apr 7, 9:47 pm 2010
Dmitry Monakhov
Re: "data=writeback" and TRIM don't get along
can you please provide an actual version of firmware. As soon as i know X25 zeroing was disabled. Can you please post an output of your queue flags --
Apr 8, 12:17 am 2010
Nebojsa Trpkovic
Re: "data=writeback" and TRIM don't get along
====================== cat /sys/block/sda/queue/discard_zeroes_data 1 cat /sys/block/sda/queue/discard_granularity 512 ====================== Short history (you're probably talking about Intel's firmware problems): - Intel releases X25-M (80 and 160GB) G1. G1 doesn't support TRIM. - Intel releases X25-M (80 and 160GB) G2. G2's initial firmware doesn't support TRIM. - Intel makes TRIM-enabled firmware for G2 and publishes it. - Customers hammer some of their G2 SSDs by flashing new ...
Apr 8, 4:47 am 2010
Nebojsa Trpkovic
Re: "data=writeback" and TRIM don't get along
Well, to be honest, I'm not some programmer guy, so I doubt my skills can be of any help here. Second, unfortunately, my SSD is now my root partition (just one big sda1), so I cannot experiment with it too much. I'm not sure I understood you well about this prealloc pool - re-using mechanism, but... AFAIK, modern SSDs are using very aggressive wear-leveling algorithms. Writing two times into the same filesystem sector almost newer goes to the same hardware sector. Therefore, saving ...
Apr 8, 4:48 am 2010
tytso
Re: ext4 dbench performance with CONFIG_PREEMPT_RT
Hmm.... I've taken a very close look at jbd2_journal_stop(), and I don't think we need to take j_state_lock() at all except if we need to call jbd2_log_start_commit(). t_outstanding_credits, h_buffer_credits, and t_updates are all documented (and verified by me) to be protected by the t_handle_lock spinlock. So I ***think*** the following might be safe. WARNING! WARNING!! No real testing done on this patch, other than "it compiles! ship it!!". I'll let other people review it, and ...
Apr 7, 8:46 pm 2010
Theodore Tso
Re: ext4 dbench performance with CONFIG_PREEMPT_RT
BTW, it might be possible to remove the need to take t_handle_lock by converting t_outstanding_credits and t_updates to be atomic_t's, but that might have other performance impacts for other cases. This patch shouldn't cause any performance regressions because we're just removing code. As I said, I'm pretty sure it's safe but it could use more review and I should look at it again with fresh eyes, but in the meantime, it would be great if you could let us know what sort of results you get with ...
Apr 8, 3:18 am 2010
john stultz
Re: ext4 dbench performance with CONFIG_PREEMPT_RT
So this patch seems to match the performance and has similar perf log output to what I was getting with my hack. Very very cool! I'll continue to play with your patch and see if I can con some some folks with more interesting storage setups to do some testing as well. Any thoughts for ways to rework the state_lock in start_this_handle? (Now that its at the top of the contention logs? :) thanks so much! -john --
Apr 8, 1:41 pm 2010
tytso
Re: ext4 dbench performance with CONFIG_PREEMPT_RT
You might want to ask djwong to play with it with his nice big machine. (We don't need a big file system, but we want as many CPU's as possible, and to use his "mailserver" workload to really stress the journal. I'd recommend using barrier=0 for additional journal lock-level stress testing, and then try some forced sysrq-b reboots and then make sure that the filesystem is consistent after the journal replay.) I've since done basic two-CPU testing using xfstests under KVM, but That's ...
Apr 8, 2:10 pm 2010
Mingming Cao
Re: ext4 dbench performance with CONFIG_PREEMPT_RT
Seems so, I verified the code, looks we could drop the j_state_lock() there. Also, I wonder if we could make the journal->j_average_commit_time as atomic, so we could drop the j_state_lock() more in jbd2_journal_stop()? Not sure how much this will improve the rt kernel, but might be worth doing since j_state_lock() seems to be the hottest one. --
Apr 8, 3:37 pm 2010
Eric Sandeen
Re: [PATCH 0/3] ext4: don't use quota reservation for sp ...
Hm, if these start returning EIO then maybe my patch should be modified to treat EDQUOT differently than EIO ... assuming callers can handle the return at all. In other words, make NOFAIL really just mean "don't fail for EDQUOT" -Eric --
Apr 8, 8:28 am 2010
Dmitry Monakhov
Re: [PATCH 0/3] ext4: don't use quota reservation for sp ...
Hm.. Totally agree with issue description. And seem there is no another solution except yours. ASAIU alloc_nofail is called from places where it is impossible to fail an allocation even if something goes wrong. I ask because currently i'm working on EIO handling in alloc/free calls. I've found that it is useless to fail claim/free procedures because caller is unable to handle it properly. It is impossible to fail following operation ->writepage ->dquot_claim_space (what to do if EIO ...
Apr 8, 1:20 am 2010
Theodore Ts'o
[PATCH] ext4: don't scan/accumulate more pages than mbal ...
From: From: Eric Sandeen <sandeen@redhat.com> There was a bug reported on RHEL5 that a 10G dd on a 12G box had a very, very slow sync after that. At issue was the loop in write_cache_pages scanning all the way to the end of the 10G file, even though the subsequent call to mpage_da_submit_io would only actually write a smallish amt; then we went back to the write_cache_pages loop ... wasting tons of time in calling __mpage_da_writepage for thousands of pages we would just revisit (many times) ...
Apr 7, 7:10 pm 2010
Eric Sandeen Apr 7, 7:31 pm 2010
tytso
Re: [PATCH] ext4: stop issuing discards if not supported ...
Added to the ext4 patch queue, thanks. - Ted --
Apr 7, 5:58 pm 2010
previous daytodaynext day
April 7, 2010April 8, 2010April 9, 2010