Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

Previous thread: Receive steering and hash and cache misses by Stephen Hemminger on Friday, April 2, 2010 - 10:26 am. (5 messages)

Next thread: Unaligned access in xfrm_user:copy_to_user_state by Jan Engelhardt on Friday, April 2, 2010 - 1:18 pm. (2 messages)
From: Sridhar Samudrala
Date: Friday, April 2, 2010 - 10:31 am

Make vhost scalable by creating a separate vhost thread per vhost
device. This provides better scaling across multiple guests and with
multiple interfaces in a guest.

I am seeing better aggregated througput/latency when running netperf
across multiple guests or multiple interfaces in a guest in parallel
with this patch.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a6a88df..29aa80f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		return r;
 	}
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			&n->dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			&n->dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 
 	f->private_data = n;
@@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-	int r = vhost_init();
-	if (r)
-		goto err_init;
-	r = misc_register(&vhost_net_misc);
-	if (r)
-		goto err_reg;
-	return 0;
-err_reg:
-	vhost_cleanup();
-err_init:
-	return r;
-
+	return misc_register(&vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
-	vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7bd7a1e..243f4d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -36,8 +36,6 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	if (!((unsigned long)key & poll->mask))
 		return 0;
 ...
From: Michael S. Tsirkin
Date: Sunday, April 4, 2010 - 4:14 am

Thanks for looking into this. An alternative approach is
to simply replace create_singlethread_workqueue with
create_workqueue which would get us a thread per host CPU.

It seems that in theory this should be the optimal approach
wrt CPU locality, however, in practice a single thread
seems to get better numbers. I have a TODO to investigate this.

--

From: Sridhar Samudrala
Date: Monday, April 5, 2010 - 10:35 am

Yes. I tried using create_workqueue(), but the results were not good
atleast when the number of guest interfaces is less than the number
of CPUs. I didn't try more than 8 guests.
Creating a separate thread per guest interface seems to be more
scalable based on the testing i have done so far.

I will try some more tests and get some numbers to compare the following
3 options.
- single vhost thread
- vhost thread per cpu
- vhost thread per guest virtio interface

Thanks

--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 11:49 am

Thread per guest is also easier to account.  I'm worried about guests 
impacting other guests' performance outside scheduler control by 
extensive use of vhost.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Sridhar Samudrala
Date: Thursday, April 8, 2010 - 5:05 pm

Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host 
CPU utilization.

Current default single vhost thread
-----------------------------------
1 guest:  12500  37%    
2 guests: 12800  46%
3 guests: 12600  47%
4 guests: 12200  47%
5 guests: 12000  47%
6 guests: 11700  47%
7 guests: 11340  47%
8 guests: 11200  48%

vhost thread per cpu
--------------------
1 guest:   4900 25%
2 guests: 10800 49%
3 guests: 17100 67%
4 guests: 20400 84%
5 guests: 21000 90%
6 guests: 22500 92%
7 guests: 23500 96%
8 guests: 24500 99%

vhost thread per guest interface
--------------------------------
1 guest:  12500 37%
2 guests: 21000 72%
3 guests: 21600 79%
4 guests: 21600 85%
5 guests: 22500 89%
6 guests: 22800 94%
7 guests: 24500 98%
8 guests: 26400 99%

Thanks
Sridhar


--

From: Rick Jones
Date: Thursday, April 8, 2010 - 5:14 pm

I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?

Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-) 
but I find it incomplete and also like to run things like single-instance TCP_RR 
and multiple-instance, multiple "transaction" (./configure --enable-burst) 
TCP_RR tests, particularly when concerned with "scaling" issues.

happy benchmarking,


--

From: Sridhar Samudrala
Date: Friday, April 9, 2010 - 8:39 am

Yes. It is a 2 socket quad-core Nehalem. so i guess it is a 8 core

Can we run multiple instance and multiple transaction tests with a
single netperf commandline?

Is there any easy way to get consolidated throughput when a netserver on
the host is servicing netperf clients from multiple guests?

Thanks

--

From: Rick Jones
Date: Friday, April 9, 2010 - 10:13 am

I tend to use a script such as:

ftp://ftp.netperf.org/netperf/misc/runemomniagg2.sh

which presumes that netperf/netserver have been built with:

./configure --enable-omni --enable-burst ...

and uses the CSV output format of the omni tests.  When I want sums I then turn 
to a spreadsheet, or I suppose I could turn to awk etc.

The TCP_RR test can be flipped around request size for response size etc, so 
when I have a single sustem under test, I initiate the netperf commands on it, 
targetting netservers on the clients.  If I want inbound bulk throughput I use 
the TCP_MAERTS test rather than the TCP_STREAM test.

happy benchmarking,

rick jones
--

From: Michael S. Tsirkin
Date: Sunday, April 11, 2010 - 8:47 am

Consider using Ingo's perf tool to get error bars, but looks good
overall. One thing I note though is that we seem to be able to
consume up to 99% CPU now. So I think with this approach
we can no longer claim that we are just like some other parts of
networking stack, doing work outside any cgroup, and we should
make the vhost thread inherit the cgroup and cpu mask
from the process calling SET_OWNER.

-- 
MST
--

From: Sridhar Samudrala
Date: Monday, April 12, 2010 - 10:35 am

Yes. I am not sure what is the right interface to do this, but this
should also allow binding qemu to a set of cpus and automatically having
vhost thread inherit the same cpu mask.

Thanks
Sridhar

--

From: Michael S. Tsirkin
Date: Monday, April 12, 2010 - 10:42 am

How noisy are the numbers?



-- 
MST
--

From: Rick Jones
Date: Monday, April 12, 2010 - 10:50 am

In netperf terms that would be adding the confidence intervals calculations to 
the results - which will be done by that "runemomniagg2.sh" script I mentioned. 
When running multiple instance tests, it is very important to set the min and 
max iterations to the same value so no instance thinks to finish early.  The 
script does that, just want to make sure that those leveraging it do the same.

happy benchmarking,

rick jones
--

From: Michael S. Tsirkin
Date: Monday, April 12, 2010 - 9:27 am

Previous thread: Receive steering and hash and cache misses by Stephen Hemminger on Friday, April 2, 2010 - 10:26 am. (5 messages)

Next thread: Unaligned access in xfrm_user:copy_to_user_state by Jan Engelhardt on Friday, April 2, 2010 - 1:18 pm. (2 messages)