| From | Subject | Date |
|---|---|---|
| Dmitry Monakhov | [PATCH] ext4: random performance optimizations for ext4_ ...
If quota is not enabled it is not necessery to start separate
transaction for uid, gid and quota credits changes.
If inode wasn't added to orphan list when it is not necessary
to remove it from the list. This allow to avoid locking on
per-sb s_orphan_lock mutex.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 15 ++++++++++++---
1 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3996151..b498274 ...
| Apr 8, 1:29 am 2010 |
| Eric Sandeen | Re: [PATCH] ext4: random performance optimizations for e ...
lame spell-check review ;)
"uid, gid, and ..."
--
| Apr 8, 8:33 am 2010 |
| minskey | binary search in ext4_ext_binsearch_idx()
in ext4_ext_binsearch_idx() routine, there is the following code:
while (l <= r) {
m = l + (r - l) / 2;
if (block < le32_to_cpu(m->ei_block))
r = m - 1;
else
l = m + 1;
}
path->p_idx = l - 1;
The code always runs logN iterations to get the expected extent,
if 3-way comparison is used, we can save some iterations when
(block == le32_to_cpu(m->ei_block)) at the first iteration.
some ...
| Apr 7, 8:26 pm 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
Surely a bug. :) If you can provide details we'll look into it.
(perhaps it's obvious on first try but still worth saying exactly
what problematic behavior you saw, when reporting a bug you
encountered)
Thanks,
-Eric
--
| Apr 7, 6:22 pm 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
Something like this probably works, but I really REALLY would not test
it on an important filesystem. :)
I'm not sure it's a good idea to discard it before returning it
to the prealloc pool, because it may well get re-used again
quickly.... not sure if that's helpful.
Just a note, I think eventually we may move to more of a batch discard
in the background, because these little discards are actually quite
inefficient on the hardware we've tested so far.
-Eric
p.s. really. Don't test ...
| Apr 7, 9:37 pm 2010 |
| Nebojsa Trpkovic | "data=writeback" and TRIM don't get along
Hello.
TRIM command issued to SSD doesn't work with this mount options:
============================
rootfs / rootfs rw 0 0
/dev/root / ext4
rw,noatime,commit=100,barrier=0,nobh,stripe=128,data=writeback,discard 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
rc-svcdir /lib64/rc/init.d tmpfs
rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
udev /dev tmpfs rw,nosuid,relatime,size=10240k,mode=755 0 0
devpts /dev/pts devpts ...
| Apr 7, 5:50 pm 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
I think the answer is "not yet but it's being worked on"
--
| Apr 8, 9:34 am 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
Well, you might just keep in mind that:
1) trimming these small amounts has actually looked very inefficient, and
2) data=writeback really isn't very safe in the face of a crash or power loss, and
3) hopefully we'll have a better trim solution eventually.
-Eric
--
| Apr 8, 8:32 am 2010 |
| Nebojsa Trpkovic | Re: "data=writeback" and TRIM don't get along
1) I understand that big TRIMs are better then small ones, but skipping
some TRIMs completely would lead to slow but sure drive degradation as
drive would have less and less spare space for wear leveling.
2) Yes, I'm aware of possible data=writeback inconsistency, but I've
tried to let IO scheduler to merge and reorganize as many writes as it
can, all to avoid small writes to SSD which are main cause of write
amplification.
3) I'll stick with no data=writeback for the time being. I guess ...
| Apr 8, 9:21 am 2010 |
| Nebojsa Trpkovic | Re: "data=writeback" and TRIM don't get along
Well, I've done a simple test, described like:
"get the used sectors for a file
hdparm --fibmap filename
read a sector from the file eg. with
sudo hdparm --read-sector 66385920 /dev/sda
delete the file and sync
rm filename;sync
and read the sector a second time"
And I get something like this:
================================
# dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct
100+0 records in
100+0 records out
52428800 bytes (52 MB) copied, 6.47137 s, 8.1 MB/s
# ...
| Apr 7, 6:37 pm 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
Ok, thanks, perfect test & explanation.
Well the good news is, at least it's nothing like discarding
the wrong block. :)
Long explanation:
in ext4_free_blocks():
/*
* We need to make sure we don't reuse the freed block until
* after the transaction is committed, which we can do by
* treating the block as metadata, below. We make an
* exception if the inode is to be written in writeback mode
* since writeback mode has weak data ...
| Apr 7, 9:10 pm 2010 |
| Eric Sandeen | Re: "data=writeback" and TRIM don't get along
(now I'm really talking to myself, but scratch that bit -
ext4_mb_return_to_preallocation is pretty much a no-op)
--
| Apr 7, 9:47 pm 2010 |
| Dmitry Monakhov | Re: "data=writeback" and TRIM don't get along
can you please provide an actual version of firmware.
As soon as i know X25 zeroing was disabled.
Can you please post an output of your queue flags
--
| Apr 8, 12:17 am 2010 |
| Nebojsa Trpkovic | Re: "data=writeback" and TRIM don't get along
======================
cat /sys/block/sda/queue/discard_zeroes_data
1
cat /sys/block/sda/queue/discard_granularity
512
======================
Short history (you're probably talking about Intel's firmware problems):
- Intel releases X25-M (80 and 160GB) G1. G1 doesn't support TRIM.
- Intel releases X25-M (80 and 160GB) G2. G2's initial firmware doesn't
support TRIM.
- Intel makes TRIM-enabled firmware for G2 and publishes it.
- Customers hammer some of their G2 SSDs by flashing new ...
| Apr 8, 4:47 am 2010 |
| Nebojsa Trpkovic | Re: "data=writeback" and TRIM don't get along
Well, to be honest, I'm not some programmer guy, so I doubt my skills
can be of any help here.
Second, unfortunately, my SSD is now my root partition (just one big
sda1), so I cannot experiment with it too much.
I'm not sure I understood you well about this prealloc pool - re-using
mechanism, but...
AFAIK, modern SSDs are using very aggressive wear-leveling algorithms.
Writing two times into the same filesystem sector almost newer goes to
the same hardware sector. Therefore, saving ...
| Apr 8, 4:48 am 2010 |
| tytso | Re: ext4 dbench performance with CONFIG_PREEMPT_RT
Hmm.... I've taken a very close look at jbd2_journal_stop(), and I
don't think we need to take j_state_lock() at all except if we need to
call jbd2_log_start_commit(). t_outstanding_credits,
h_buffer_credits, and t_updates are all documented (and verified by
me) to be protected by the t_handle_lock spinlock.
So I ***think*** the following might be safe. WARNING! WARNING!! No
real testing done on this patch, other than "it compiles! ship it!!".
I'll let other people review it, and ...
| Apr 7, 8:46 pm 2010 |
| Theodore Tso | Re: ext4 dbench performance with CONFIG_PREEMPT_RT
BTW, it might be possible to remove the need to take t_handle_lock by converting t_outstanding_credits and t_updates to be atomic_t's, but that might have other performance impacts for other cases. This patch shouldn't cause any performance regressions because we're just removing code. As I said, I'm pretty sure it's safe but it could use more review and I should look at it again with fresh eyes, but in the meantime, it would be great if you could let us know what sort of results you get with ...
| Apr 8, 3:18 am 2010 |
| john stultz | Re: ext4 dbench performance with CONFIG_PREEMPT_RT
So this patch seems to match the performance and has similar perf log
output to what I was getting with my hack.
Very very cool!
I'll continue to play with your patch and see if I can con some some
folks with more interesting storage setups to do some testing as well.
Any thoughts for ways to rework the state_lock in start_this_handle?
(Now that its at the top of the contention logs? :)
thanks so much!
-john
--
| Apr 8, 1:41 pm 2010 |
| tytso | Re: ext4 dbench performance with CONFIG_PREEMPT_RT
You might want to ask djwong to play with it with his nice big
machine. (We don't need a big file system, but we want as many CPU's
as possible, and to use his "mailserver" workload to really stress the
journal. I'd recommend using barrier=0 for additional journal
lock-level stress testing, and then try some forced sysrq-b reboots
and then make sure that the filesystem is consistent after the journal
replay.)
I've since done basic two-CPU testing using xfstests under KVM, but
That's ...
| Apr 8, 2:10 pm 2010 |
| Mingming Cao | Re: ext4 dbench performance with CONFIG_PREEMPT_RT
Seems so, I verified the code, looks we could drop the j_state_lock()
there.
Also, I wonder if we could make the journal->j_average_commit_time as
atomic, so we could drop the j_state_lock() more in jbd2_journal_stop()?
Not sure how much this will improve the rt kernel, but might be worth
doing since j_state_lock() seems to be the hottest one.
--
| Apr 8, 3:37 pm 2010 |
| Eric Sandeen | Re: [PATCH 0/3] ext4: don't use quota reservation for sp ...
Hm, if these start returning EIO then maybe my patch should be modified
to treat EDQUOT differently than EIO ... assuming callers can handle
the return at all.
In other words, make NOFAIL really just mean "don't fail for EDQUOT"
-Eric
--
| Apr 8, 8:28 am 2010 |
| Dmitry Monakhov | Re: [PATCH 0/3] ext4: don't use quota reservation for sp ...
Hm.. Totally agree with issue description. And seem there is no another
solution except yours.
ASAIU alloc_nofail is called from places where it is impossible to fail
an allocation even if something goes wrong.
I ask because currently i'm working on EIO handling in alloc/free calls.
I've found that it is useless to fail claim/free procedures because
caller is unable to handle it properly.
It is impossible to fail following operation
->writepage
->dquot_claim_space (what to do if EIO ...
| Apr 8, 1:20 am 2010 |
| Theodore Ts'o | [PATCH] ext4: don't scan/accumulate more pages than mbal ...
From: From: Eric Sandeen <sandeen@redhat.com>
There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.
At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) ...
| Apr 7, 7:10 pm 2010 |
| Eric Sandeen | Re: [PATCH] ext4: don't scan/accumulate more pages than ...
Seems fine, thanks.
--
| Apr 7, 7:31 pm 2010 |
| tytso | Re: [PATCH] ext4: stop issuing discards if not supported ...
Added to the ext4 patch queue, thanks.
- Ted
--
| Apr 7, 5:58 pm 2010 |
| previous day | today | next day |
|---|---|---|
| April 7, 2010 | April 8, 2010 | April 9, 2010 |
