Hey Thomas, Nick, I just wanted to let you know I've just finished forward porting Nick's patches to 2.6.33-rc8-rt2. Luckily my forward port of Nick's patches to 2.6.33 applies on top of the -rt tree without any collisions, and I've added a handful of maybe sketchy fixups to get it working with -rt. You can find the patchset here: http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-rc8-rt2/vfs-scale.33-rt.tar... Here's a chart showing how much these patches help dbench numbers on ramfs: http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ramfs-dbench.png I've not done any serious stress testing with the patchset yet, but wanted to post it for your review. Nick: I'd appreciate any feedback as to if any of my forward porting has gone awry. I'm still very green with respect to the vfs, so I don't doubt there are some issues hiding here. Thomas: Let me know if you want to start playing with this in the -rt tree. I'm not seeing any warnings with the debugging options on, so I think I squashed all of those issues, but let me know if you manage to trigger anything. thanks -john --
BTW there are a few issues Al pointed out. We have to synchronize RCU after unregistering a filesystem so d_ops/i_ops doesn't go away, and mntput can sleep so we can't do it under RCU read lock. The store-free path walk patches don't really have the required RCU barriers in them either (which is fine for x86, but would have to be fixed). --
Does the following address this issue properly? Signed-off-by: John Stultz <johnstul@us.ibm.com> diff --git a/fs/filesystems.c b/fs/filesystems.c index a24c58e..3448e7c 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -110,6 +110,7 @@ int unregister_filesystem(struct file_system_type * fs) *tmp = fs->next; fs->next = NULL; write_unlock(&file_systems_lock); + synchronize_rcu(); return 0; } tmp = &(*tmp)->next; --
As far as I could tell, yes that should solve the code reference --
Good to hear! Thanks for the review Nick! Thomas: I ran a number of kernel-bench and dbench stress tests on this today and I've not seen any issues, so unless Nick has other issues, I think it should be ok to pull into -rt. You can grab the full patchset that builds ontop of 2.6.33-rt4 here: http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-rt4/vfs-scale.33-rt.tar.bz2 thanks -john --
Oh, and another interesting data point! The ext2 performance numbers with this patch set are scaling better then the 2.6.31-rt-vfs set earlier tested! http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext2-dbench.png Its not perfect, but its closing the gap. More interestingly, where as we were still seeing path lookup contention in 2.6.31, its not showing up in the perf logs with 2.6.33. Instead, the contention is on the ext2 group_adjust_blocks function. And replacing the statvfs call in dbench with statfs pushes the results past mainline: http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext2-dbench-statfs.png So this all means that with Nick's patch set, we're no longer getting bogged down in the vfs (at least at 8-way) at all. All the contention is in the actual filesystem (ext2 in group_adjust_blocks, and ext3 in the journal and block allocation code). So again, kudos to Nick! thanks -john --
Can you check if you're running into any fs scaling limit with xfs? --
Here's the charts from some limited testing: http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/xfs-dbench.png They're not great. And compared to ext3, the results are basically flat. http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33/ext3-dbench.png Now, I've not done any real xfs work before, so if there is any tuning needed for dbench, please let me know. The odd bit is that perf doesn't show huge overheads in the xfs runs. The spinlock contention is supposedly under 5%. So I'm not sure whats causing the numbers to be so bad. Clipped perf log below. thanks -john 11.06% dbench [kernel] [k] copy_user_generic_strin 4.82% dbench [kernel] [k] __lock_acquire | |--94.74%-- lock_acquire | | | |--38.89%-- rt_spin_lock | | | | | |--28.57%-- _slab_irq_disable | | | | | | | |--50.00%-- kmem_cache_alloc | | | | kmem_zone_alloc | | | | xfs_buf_get | | | | xfs_buf_read | | | | xfs_trans_read_buf | | | | xfs_btree_read_buf_b | | | | xfs_btree_lookup_get | | | | xfs_btree_lookup | | | | xfs_alloc_lookup_eq | | | | xfs_alloc_fixup_tree | | | | xfs_alloc_ag_vextent | | | | ...
What's the X-axis? Number of clients? If so, I have previously tested XFS to make sure throughput is flat out to about 1000 clients, not 8. i.e I'm not interested in peak throughput from dbench (generally a meaningless number), I'm much more interested in sustaining that throughput under the sorts of Dbench does lots of transactions which runs XFS into being log IO bound. Make sure you have at least a 128MB log and are using lazy-count=1 andperhaps even the logbsize=262144 mount option. but in general it only takes 2-4 clients to reach maximum throughput on It's bound by sleeping locks or IO. call-graph based profiles triggered on context switches are the easiest way to find the contending lock. Last time I did this (around 2.6.16, IIRC) it involved patching the kernel to put the sample point in the context switch code - can we do that now without patching the kernel? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
dbench is simply one that is known bad for core vfs locks. If it is run on top of tmpfs it gives relatively stable numbers, and on a real filesystem on ramdisk it works OK too. Not sure if John was running it on a ramdisk though. It does emulate the syscall pattern coming from samba running netbench test, so it's not _totally_ meaningless :) In this case, we're mostly interested in it to see if there are lock profiling can track sleeping locks, profile=schedule and profile=sleep still works OK too. Don't know if any useful tracing stuff is there for locks yet. --
