Nick's vfs-scalability patches ported to 2.6.33-rt

Previous thread: [PATCH 0/2 v2] [RFC] tracing: Showing symbols for TRACE_EVENT by Steven Rostedt on Thursday, February 25, 2010 - 10:38 pm. (4 messages)

Next thread: [PATCH] vga_switcheroo: initial implementation (v11) by Dave Airlie on Thursday, February 25, 2010 - 11:00 pm. (2 messages)
From: john stultz
Date: Thursday, February 25, 2010 - 10:53 pm

Hey Thomas, Nick,
	I just wanted to let you know I've just finished forward porting Nick's
patches to 2.6.33-rc8-rt2.  Luckily my forward port of Nick's patches to
2.6.33 applies on top of the -rt tree without any collisions, and I've
added a handful of maybe sketchy fixups to get it working with -rt.

You can find the patchset here:
http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-rc8-rt2/vfs-scale.33-rt.tar...

Here's a chart showing how much these patches help dbench numbers on
ramfs:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ramfs-dbench.png

I've not done any serious stress testing with the patchset yet, but
wanted to post it for your review.

Nick: I'd appreciate any feedback as to if any of my forward porting has
gone awry. I'm still very green with respect to the vfs, so I don't
doubt there are some issues hiding here.

Thomas: Let me know if you want to start playing with this in the -rt
tree. I'm not seeing any warnings with the debugging options on, so I
think I squashed all of those issues, but let me know if you manage to
trigger anything.

thanks
-john


--

From: Nick Piggin
Date: Thursday, February 25, 2010 - 11:01 pm

BTW there are a few issues Al pointed out. We have to synchronize RCU
after unregistering a filesystem so d_ops/i_ops doesn't go away, and
mntput can sleep so we can't do it under RCU read lock.

The store-free path walk patches don't really have the required RCU
barriers in them either (which is fine for x86, but would have to be
fixed).

--

From: john stultz
Date: Wednesday, March 3, 2010 - 4:31 pm

Does the following address this issue properly?

Signed-off-by: John Stultz <johnstul@us.ibm.com>

diff --git a/fs/filesystems.c b/fs/filesystems.c
index a24c58e..3448e7c 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -110,6 +110,7 @@ int unregister_filesystem(struct file_system_type * fs)
 			*tmp = fs->next;
 			fs->next = NULL;
 			write_unlock(&file_systems_lock);
+			synchronize_rcu();
 			return 0;
 		}
 		tmp = &(*tmp)->next;


--

From: Nick Piggin
Date: Wednesday, March 3, 2010 - 8:33 pm

As far as I could tell, yes that should solve the code reference
--

From: john stultz
Date: Wednesday, March 3, 2010 - 9:05 pm

Good to hear! Thanks for the review Nick!


Thomas:  I ran a number of kernel-bench and dbench stress tests on this
today and I've not seen any issues, so unless Nick has other issues, I
think it should be ok to pull into -rt.

You can grab the full patchset that builds ontop of 2.6.33-rt4 here:
http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-rt4/vfs-scale.33-rt.tar.bz2

thanks
-john


--

From: john stultz
Date: Tuesday, March 9, 2010 - 7:51 pm

Oh, and another interesting data point!

The ext2 performance numbers with this patch set are scaling better then
the 2.6.31-rt-vfs set earlier tested!

http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext2-dbench.png

Its not perfect, but its closing the gap. More interestingly, where as
we were still seeing path lookup contention in 2.6.31, its not showing
up in the perf logs with 2.6.33. Instead, the contention is on the ext2
group_adjust_blocks function.

And replacing the statvfs call in dbench with statfs pushes the results
past mainline:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext2-dbench-statfs.png


So this all means that with Nick's patch set, we're no longer getting
bogged down in the vfs (at least at 8-way) at all. All the contention is
in the actual filesystem (ext2 in group_adjust_blocks, and ext3 in the
journal and block allocation code).

So again, kudos to Nick!

thanks
-john

--

From: Christoph Hellwig
Date: Wednesday, March 10, 2010 - 2:01 am

Can you check if you're running into any fs scaling limit with xfs?

--

From: john stultz
Date: Thursday, March 11, 2010 - 8:08 pm

Here's the charts from some limited testing:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/xfs-dbench.png

They're not great.  And compared to ext3, the results are basically
flat.
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext3-dbench.png

Now, I've not done any real xfs work before, so if there is any tuning
needed for dbench, please let me know.

The odd bit is that perf doesn't show huge overheads in the xfs runs.
The spinlock contention is supposedly under 5%. So I'm not sure whats
causing the numbers to be so bad.

Clipped perf log below.

thanks
-john

    11.06%       dbench  [kernel]                    [k] copy_user_generic_strin

     4.82%       dbench  [kernel]                    [k] __lock_acquire
                |          
                |--94.74%-- lock_acquire
                |          |          
                |          |--38.89%-- rt_spin_lock
                |          |          |          
                |          |          |--28.57%-- _slab_irq_disable
                |          |          |          |          
                |          |          |          |--50.00%-- kmem_cache_alloc
                |          |          |          |          kmem_zone_alloc
                |          |          |          |          xfs_buf_get
                |          |          |          |          xfs_buf_read
                |          |          |          |          xfs_trans_read_buf
                |          |          |          |          xfs_btree_read_buf_b
                |          |          |          |          xfs_btree_lookup_get
                |          |          |          |          xfs_btree_lookup
                |          |          |          |          xfs_alloc_lookup_eq
                |          |          |          |          xfs_alloc_fixup_tree
                |          |          |          |          xfs_alloc_ag_vextent
                |          |          |          |      ...
From: Dave Chinner
Date: Thursday, March 11, 2010 - 9:41 pm

What's the X-axis? Number of clients?

If so, I have previously tested XFS to make sure throughput is flat
out to about 1000 clients, not 8. i.e I'm not interested in peak
throughput from dbench (generally a meaningless number), I'm much
more interested in sustaining that throughput under the sorts of

Dbench does lots of transactions which runs XFS into being log IO
bound. Make sure you have at least a 128MB log and are using
lazy-count=1 andperhaps even the logbsize=262144 mount option.  but
in general it only takes 2-4 clients to reach maximum throughput on

It's bound by sleeping locks or IO. call-graph based profiles
triggered on context switches are the easiest way to find the
contending lock.

Last time I did this (around 2.6.16, IIRC) it involved patching the
kernel to put the sample point in the context switch code - can we
do that now without patching the kernel?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Monday, March 15, 2010 - 9:15 am

dbench is simply one that is known bad for core vfs locks. If it is
run on top of tmpfs it gives relatively stable numbers, and on a
real filesystem on ramdisk it works OK too. Not sure if John was
running it on a ramdisk though.

It does emulate the syscall pattern coming from samba running netbench
test, so it's not _totally_ meaningless :)

In this case, we're mostly interested in it to see if there are

lock profiling can track sleeping locks, profile=schedule and
profile=sleep still works OK too. Don't know if any useful tracing
stuff is there for locks yet.

--

Previous thread: [PATCH 0/2 v2] [RFC] tracing: Showing symbols for TRACE_EVENT by Steven Rostedt on Thursday, February 25, 2010 - 10:38 pm. (4 messages)

Next thread: [PATCH] vga_switcheroo: initial implementation (v11) by Dave Airlie on Thursday, February 25, 2010 - 11:00 pm. (2 messages)