BGL-free net stack

Previous thread: Re: altq spinlock and if_start dispatch by Matthew Dillon on Monday, April 14, 2008 - 6:19 pm. (1 message)

Next thread: Re: BGL-free net stack by Matthew Dillon on Tuesday, April 15, 2008 - 7:42 pm. (1 message)
To: <kernel@...>
Date: Tuesday, April 15, 2008 - 7:17 pm

Hello all,

for my diploma thesis, I've arranged to work on getting the DragonFlyBSD
network stack[*] to run without the BGL. Now, it would be great if the
code changes would become part of the project. So, my plan is to study
the net code over easter vacation (it's been a long while since I last looked
at it and I've never been a network guy in any case) and then post a
preliminary roadmap on kernel@. At that point, everyone can give an opinion
before any code gets written.

Needless to say, I'm interested in any insight from the network guys and any
references to past discussions (perhaps my searches have missed an important
thread). You needn't refer me to stuff that's on the website.

TIA,
Aggelos

[*] At least, most of the paths that matter.

To: <kernel@...>
Date: Monday, May 5, 2008 - 1:36 pm

On second thought, let me ask for input sooner rather than later. I'm leaning
towards Matt's suggestion for a semi-lockless ring buffer for the sockbuf

http://leaf.dragonflybsd.org/mailarchive/kernel/2007-06/msg00122.html

One of the issues is that we can no longer give a stable character count for
the sockbuf, since data can arrive (if we're a reader) or be removed (if
we're a writer) at any time. But we can give some weaker guarantees: if we're
a reader, we can get a lower bound for cc and if we're a writer we have an
upper bound. Now, for references to the sendbuf in e.g. tcp_{in,out}put() I
think we can get away with taking a snapshot of the cc; the alternative would
be to try and make the code very very smart, but for now I'll settle for
obvious correctness. From a superficial first look, I think the handling of
so_oobmark will require some changes. At this point I'm inclined to start
going through all users and updating them to use the new sockbuf and see if
any real problems crop up. If anyone can see a fundamental problem or has a
better approach to suggest, please speak up now, so that I won't waste time
with a suboptimal/flawed approach.

I should also mention that I'm only interested in IPv4 TCP and UDP. The other
protocols can stay under the BGL for now.

Of course the sockbuf isn't the only issue I've busied myself with the past
couple of weeks, but it is one of the more interesting shared data
structures. Hopefully I'll get around to starting a discussion on inpcbs and
tcpcbs soon.

Aggelos

To: <kernel@...>
Date: Thursday, June 5, 2008 - 5:35 pm

[this email was supposed to be much more organized, but this will have to do]

OK, same thing, but now it's the pcbs. TCP is "easy". The inpcb is inserted on
a per-cpu hash table so that the corresponding protocol thread runs on the
same cpu. Some tcpcb fields however, are accessed directly from user-thread
context. The interesting fields are:

t_flags:
need to do early copyin / delayed copyout in so_pr_ctloutput

snd_una, snd_recover, snd_next, t_rxtcur, t_maxopd, t_rtttime, t_rtseq,
t_softerror:

Those fields are touched by tcp_mtudisc() or tcp_notify(). AFAICT, these two
are only reached along the path transport_processing_oncpu->icmp_input->
tcp_ctlinput [resume writing email 10 days later, hopefully my notes are
complete] which shouldn't be a problem.

snd_cwnd, snd_wacked, t_rxtcur:
Are touched by tcp_quench() which again is called only by tcp_ctlinput().

In other news, the new sockbuf code is operational and after I squash an nfs
bug and simplify it a bit I'm going to ask for testers. I expected it to be at
this stage at least a couple of weeks ago but life and a few silly bugs
intervened. My plan is to start a discussion on the more interesting in_pcb
situation on kernel@ this weekend.

Aggelos

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 7:43 am

On Friday 06 June 2008, Aggelos Economopoulos wrote:

So, I was thinking something like the following:

diff --git a/sys/kern/uipc_msg.c b/sys/kern/uipc_msg.c
index fde4c93..e786e02 100644
--- a/sys/kern/uipc_msg.c
+++ b/sys/kern/uipc_msg.c
@@ -35,6 +35,7 @@

#include <sys/param.h>
#include <sys/systm.h>
+#include <sys/kernel.h>
#include <sys/msgport.h>
#include <sys/protosw.h>
#include <sys/socket.h>
@@ -379,22 +380,38 @@ so_pru_sopoll(struct socket *so, int events, struct ucred *cred)
return (error);
}

+MALLOC_DEFINE(M_SOPT, "sopt", "sopt temp storage");
+
int
so_pr_ctloutput(struct socket *so, struct sockopt *sopt)
{
- return ((*so->so_proto->pr_ctloutput)(so, sopt));
#ifdef gag /* does copyin and copyout deep inside stack XXX JH */
+ return ((*so->so_proto->pr_ctloutput)(so, sopt));
+#else
struct netmsg_pr_ctloutput msg;
lwkt_port_t port;
int error;
+ void *uval;
+
+ uval = sopt->sopt_val;

- port = so->so_proto->pr_mport(so, NULL);
+ /* we keep duplicate copies, but for option {s,g}etting who cares? */
+ sopt->sopt_val = kmalloc(sopt->sopt_valsize, M_SOPT, M_WAITOK);
+ error = copyin(uval, sopt->sopt_val, sopt->sopt_valsize);
+ if (error)
+ goto out;
+ port = so->so_proto->pr_mport(so, NULL, NULL, XXX);
netmsg_init(&msg.nm_netmsg, &curthread->td_msgport, 0,
netmsg_pru_ctloutput);
msg.nm_prfn = so->so_proto->pr_ctloutput;
msg.nm_so = so;
msg.nm_sopt = sopt;
error = lwkt_domsg(port, &msg.nm_netmsg.nm_lmsg, 0);
+ if (error)
+ goto out;
+ error = copyout(sopt->sopt_val, uval, sopt->sopt_valsize);
+out:
+ kfree(sopt->sopt_val, M_SOPT);
return (error);
#endif
}

But before I update all callees to remove copy{in,out}(), does anybody have
any objections? Also, what should be the value of XXX? Perhaps ctloutput
belongs to pr_usrreqs...

Thanks,
Aggelos

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 8:14 am

I don't think above change is enough. You need to change
sooptcopy{in,out}() or at least set sopt->sopt_td to NULL. Please
take a look at ip_ctloutput() for what the above patch may break.

Best Regards,
sephe

--
Live Free or Die

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 8:28 am

Yes, I know. I'm in the middle of updating the callees. This is how I'm doing
it:

diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index 9278ffa..549352f 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -1418,8 +1418,7 @@ ip_ctloutput(struct socket *so, struct sockopt *sopt)
break;
}
m->m_len = sopt->sopt_valsize;
- error = sooptcopyin(sopt, mtod(m, char *), m->m_len,
- m->m_len);
+ bcopy(sopt->sopt_val, mtod(m, void *), m->m_len);

return (ip_pcbopts(sopt->sopt_name, &inp->inp_options,
m));
@@ -1434,10 +1433,11 @@ ip_ctloutput(struct socket *so, struct sockopt *sopt)
case IP_RECVIF:
case IP_RECVTTL:
case IP_FAITH:
- error = sooptcopyin(sopt, &optval, sizeof optval,
- sizeof optval);
- if (error)
+ if (sopt->sopt_valsize != sizeof optval) {
+ error = EINVAL;
break;
+ }
+ optval = *(int *)sopt->sopt_val;

switch (sopt->sopt_name) {
case IP_TOS:
@@ -1826,6 +1826,13 @@ ip_setmoptions(struct sockopt *sopt, struct ip_moptions **imop)
imo->imo_num_memberships = 0;
}

+#define getval(var) \
+ if (sopt->sopt_valsize != sizeof var) { \
+ error = EINVAL; \
+ break; \
+ } \
+ bcopy(sopt->sopt_val, &var, sizeof var) \
+
switch (sopt->sopt_name) {
/* store an index number for the vif you wanna use in the send */
case IP_MULTICAST_VIF:
@@ -1833,9 +1840,7 @@ ip_setmoptions(struct sockopt *sopt, struct ip_moptions **imop)
error = EOPNOTSUPP;
break;
}
- error = sooptcopyin(sopt, &i, sizeof i, sizeof i);
- if (error)
- break;
+ getval(i);
if (!legal_vif_num(i) && (i != -1)) {
error = EINVAL;
break;
@@ -1847,9 +1852,7 @@ ip_setmoptions(struct sockopt *sopt, struct ip_moptions **imop)
/*
* Select the interface for outgoing multicast packets.
*/
- error = sooptcopyin(sopt, &addr, sizeof addr, sizeof addr);
- if (error)
...

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 8:46 am

Ah, I see. Sorry, I misread the previous mail. I think you could use
-1 in soport()

--
Live Free or Die

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 8:57 am

Any unused value would do, right? Isn't this a hint that we should just move
pr_ctloutput to pru_ctloutput and add a PRU_CTLOUTPUT value?

Aggelos

To: <kernel@...>
Date: Monday, June 16, 2008 - 1:16 pm

What I've got seems to work (I can't easily test the ip_mroute changes), even
though I'm not completely happy with the implementation. You (I don't mean
just Sephe here) can get the changes by pulling the tcpcb branch from the git
repo at git://repo.or.cz/dragonfly/netmp.git or by fetching
http://leaf.dragonflybsd.org/~aggelos/B1-sopt.patch and applying it to HEAD.
Please test in your regular setup and report any errors ("works for me" is
interesting too). This patch does NOT contain any sockbuf changes and should
be quite safe. That said, don't run it on your production boxes yet.

I'd also like to know what the developers think about

int
kva_p(const void *addr)
{
/* XXX: mapped? */
return ((unsigned long)KvaStart <= (unsigned long)addr) &&
((unsigned long)addr < (unsigned long)KvaEnd);
}

do we need such a function? Should it be quick-n-silly like above or
should it check if there is a page mapped on addr as well?

diffstat for the patch is

kern/uipc_msg.c | 45 +++++++++++++++++----
kern/uipc_socket.c | 82 ++++++++++++++++++++++++++++++++++++++--
net/dummynet/ip_dummynet_glue.c | 8 +--
net/ip_mroute/ip_mroute.c | 32 +++++++--------
net/ipfw/ip_fw2.c | 61 ++++++++++++++---------------
net/netisr.h | 9 ----
net/netmsg.h | 7 +++
netinet/ip_output.c | 56 ++++++++++++---------------
netinet/tcp_usrreq.c | 21 ++--------
netinet6/ip6_output.c | 34 ++++++----------
sys/protosw.h | 5 +-
sys/socketops.h | 2
sys/socketvar.h | 5 ++
13 files changed, 228 insertions(+), 139 deletions(-)

Aggelos

To: <kernel@...>
Date: Tuesday, June 17, 2008 - 2:39 pm

On Monday 16 June 2008, Aggelos Economopoulos wrote:

I'm going to take this silence as unanimous approval and commit this later
today :P

Aggelos

To: <kernel@...>
Date: Sunday, June 8, 2008 - 1:35 pm

On Friday 06 June 2008, Aggelos Economopoulos wrote:

Well, quite a few know issues remain (see
http://wiki.dragonflybsd.org/index.cgi/NetMP) and the sb_*_cc_*() interface is
butt-ugly and has to go, but at this point some testing would be helpful. So,
anybody feeling adventurous please pull the sockbuf branch from
git://repo.or.cz/dragonfly/netmp.git, build the kernel with -DNO_MODULES and
try booting with it and do what you normally do. Let me know what breaks.
Needless to say, this is not code you should run on any machine with data you
care about that is not backed up. That said, I haven't lost any data running
on it yet.

If you are new to git, the necessary commands are:

git clone git://repo.or.cz/dragonfly/netmp.git
cd netmp
git branch sockbuf origin/sockbuf
git checkout sockbuf
make -DNO_MODULES KERNCONF=$KC nativekernel
su
cp /usr/obj/`pwd`/sys/$KC/kernel.debug /kernel-newsb
reboot

and in the boot loader

unload kernel
load kernel-newsb
boot

TIA,
Aggelos

To: <kernel@...>
Date: Sunday, June 8, 2008 - 1:58 pm

Oh and if you just want to review the changes you can get the patch from

http://leaf.dragonflybsd.org/~aggelos/A1-sockbuf.patch

This won't apply to current HEAD of course.

kern/sys_socket.c | 8
kern/uipc_mbuf.c | 8
kern/uipc_sockbuf.c | 741 ++++++++++++++++++++------------------------
kern/uipc_socket.c | 194 ++++++-----
kern/uipc_socket2.c | 46 --
kern/uipc_syscalls.c | 2
kern/uipc_usrreq.c | 25 -
net/accf_http/accf_http.c | 9
net/ip_mroute/ip_mroute.c | 4
net/raw_usrreq.c | 5
netbt/bluetooth.h | 4
netbt/hci_socket.c | 6
netbt/l2cap_socket.c | 6
netbt/rfcomm_socket.c | 6
netbt/sco_socket.c | 8
netgraph/socket/ng_socket.c | 4
netinet/ip_divert.c | 2
netinet/raw_ip.c | 5
netinet/tcp_input.c | 60 ++-
netinet/tcp_output.c | 41 +-
netinet/tcp_sack.c | 5
netinet/tcp_usrreq.c | 11
netinet/tcp_var.h | 2
netinet/udp_usrreq.c | 5
netinet6/icmp6.c | 5
netinet6/ip6_mroute.c | 3
netinet6/raw_ip6.c | 5
netinet6/sctp6_usrreq.c | 2
netinet6/udp6_usrreq.c | 9
netproto/atalk/ddp_input.c | 2
netproto/atm/atm_aal5.c | 2
netproto/ipsec/keysock.c | 4
netproto/ipx/ipx_usrreq.c | 2
netproto/ipx/spx_usrreq.c | 14
netproto/key/keysock.c | 5
netproto/natm/natm.c | 5
netproto/ncp/ncp_ncp.c | 2
netproto/ns/idp_usrreq.c | 2
netproto/ns/spp_usrreq.c | 14
sys/mbuf.h | 26 +
sys/protosw.h | 1
sys/sockbuf.h | 188 ++++++++---
sys/socketops.h | 6
sys/socketvar.h | 158 +++++----
sys/socketvar2.h | 152 +++++++--
sys/unpcb.h | 1
vfs/fifofs/fifo_vnops.c | 5
vfs/nfs/krpc_subr.c ...

To: <kernel@...>
Date: Sunday, June 8, 2008 - 6:40 pm

On Sunday 08 June 2008, Aggelos Economopoulos wrote:

And the reason for that can be found at
http://wiki.dragonflybsd.org/index.cgi/NetMP

One issue I forgot to mention is that I haven't updated some obscure protocols
to the new interface. Do we care about ncp/ipx and friends? I very much doubt
anyone is using them on DragonFly, so there's a good chance they're broken in
HEAD too. Also, I don't know how they're supposed to work or how to set them
up and test them. In any case, I think it would be a waste of time. At this
point they are mostly a maintainance burden. Is there any objection to
removing them?

SCTP is another protocol I've broken, but this is different because SCTP may
become relevant. Does anyone use SCTP on DragonFly now? Updating it may not
be exactly trivial; it seems to be very intimate with the sockbuf internals.
If anyone wants to give it a try, please step forward! I think my time would
be better spent elsewere (it's not as if I can easily test SCTP anyway).

Aggelos

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 9:04 am

So I guess SCTP can be declared an orphan that nobody wants to adopt. Life is
hard for code that nobody cares about...

Aggelos

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 10:42 am

On Tue, Jun 10, 2008 at 9:04 PM, Aggelos Economopoulos

I suggest to move out netproto/{ns, ipx, atalk, ncp} like what we did
for the alpha port, so if someone interested to pick it again, they
still have it there.

Best Regards,
sephe

--
Live Free or Die

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 11:49 am

At least AppleTalk still has an active userbase in the real world.
XNS support should be safe to remove, even NetBSD did that.
I'm not sure about the practical use of NCP.
IPX was mentioned already.

Joerg

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 11:48 am

Hello,

I object to removing of IPX from DragonflyBSD.
If we want to be competitive with other operating systems (say, for instance, Linux) which do support older protocols (Like IPX/SPX) than we might as well
keep what we can.

I, myself am somewhat of an expert of IPX (I wrote several DOS TSR/Interrupt driven IPX drivers).
I personally volunteer to maintain the IPX code, but I'll only be able to do it on weekends and sometimes during weeknights, as work has been very busy lately. I'll keep the list updated on progress.

As for the other protocols, I know nothing about them, and thus cannot maintain them, but shouldn't we leave the code in the tree while we look for someone can maintain those parts?

Regards,

Jon

jcs@vmware.com (This is my preferred e-mail address, the one you received this from was an alias because I am using Outlook Web Access).

Sent: Tuesday, June 10, 2008 7:42 AM
To: kernel@crater.dragonflybsd.org
Subject: Re: tcpcb (was Re: sockbuf (was Re: BGL-free net stack))

On Tue, Jun 10, 2008 at 9:04 PM, Aggelos Economopoulos

I suggest to move out netproto/{ns, ipx, atalk, ncp} like what we did
for the alpha port, so if someone interested to pick it again, they
still have it there.

Best Regards,
sephe

--
Live Free or Die

To: <kernel@...>
Date: Tuesday, June 10, 2008 - 5:58 pm

The reason to remove IPX was not "it's old" but "nobody cares about it". Since
you do, the other reason remaining is "there are no users". If you want to
maintain it anyway, great! To begin with, you probably want to verify that the

Good, although, if you don't intend to use it, I think we could use you

Er, the number of people who want to maintain obscure protocols (for
compatibility needs I assume) is only going to decrease with time. If
nobody cares now...

Aggelos

To: <kernel@...>
Date: Friday, June 6, 2008 - 3:16 pm

Currently, inpcb's for UDP are all in a global hash table which is
only protected by (you guessed it) the BGL. The straightforward
way to avoid this would be to break the table a la TCP. This presents
two problems.

First, UDP sockets can issue connect() multiple times (thus changing
faddr, fport) and call bind to go from wildcard laddr to a specific one.
When this happens, the inpcb must be moved to another cpu. This shouldn't
be too hard to handle; just mark the old inpcb as BEING_DELETED and only
delete it after inserting the new inpcb. Dropping a few packets is expected
for UDP and this shouldn't happen very often anyway.

Then there is the more interesting issue of how to hash. As described above,
the lport is the only field we can be sure is not a wildcard. Now consider
a UDP (say DNS) server; such a server does not normally connect() so whatever
hash function we choose, the inpcb is going to end up on one cpu. This is the
cpu we would normally dispatch an incoming UDP packet to. The thing is, all
datagrams for our UDP server will end up going through the same cpu. So our
busy DNS server just can't scale: using only one protocol thread is going to
be a bottleneck.

And if we decide to allow multiple UDP protocol threads to access the socket
then we may have to lock around accesses to the socket, but AFAICT that
won't be necessary. UDP does not mess with most socket fields in the
input/output paths and it seems to me the code can survive some socket
option changing under it. However, our sockbuf can't handle concurrent
accesses, so we'd have to have multiple sockbufs (one per cpu) and then
the socket layer would have to pull data from all of them (probably in a
round-robin fashion). UDP does not guarantee in-order delivery but, since
in-order is typically the case, I'm not sure how well the apps can handle
it. On top of that we'd need to decide what to do about buffer size limits
and whether the sockbufs should stay in struct socket.

OK, this should get the discussion star...

To: <kernel@...>
Date: Friday, June 6, 2008 - 10:56 pm

On Sat, Jun 7, 2008 at 3:16 AM, Aggelos Economopoulos

Followings are my vague thoughts:
Add per-CPU state in UDP inpcb. There are something I think the
per-CPU state should have: input sockbuf, addr/port pair, valid bit.
Valid bit is on if the UDP socket is not connected, or the UDP socket
is connected and {lf}{port,addr} hash value equals to the state's
owner CPU. Mainly for input packet validation.
Hash should be quite straightforward for connected UDP socket. For
unconnected UDP socket, f{addr.port} is from user. The problem is the
lport upon the first sending and laddr for unbound UDP socket. laddr
for unbound UDP socket could possibly be handled by having per-CPU
route cache. We could also let lport determination happens only on
CPU0, this is assumed to be time consuming, but under worst case it
could only happens for the first several packets in the life time of
the socket.
There is no need to add output sockbuf in per-CPU state, and input

This belongs {wildcard,bound} laddr and bound lport, use th per-CPU
UDP states I mentioned above, then in a normal use case, work load
should be able to be distributed.

Best Regards,
sephe

--
Live Free or Die

To: <kernel@...>
Date: Sunday, June 8, 2008 - 10:18 am

That would be one way to do it, especially if we want to go in the per-cpu

You mean sending an IPI? That would add significant latency to UDP
ping-pongs which would affect DNS queries. I think it would be measurable.

Perhaps we can again partition the port space per cpu and only use

Right.

Thanks for your insight,
Aggelos

Previous thread: Re: altq spinlock and if_start dispatch by Matthew Dillon on Monday, April 14, 2008 - 6:19 pm. (1 message)

Next thread: Re: BGL-free net stack by Matthew Dillon on Tuesday, April 15, 2008 - 7:42 pm. (1 message)