[PATCH] netlink: add NETLINK_NO_ENOBUFS socket flag

Previous thread: [GIT]: Networking by David Miller on Monday, March 23, 2009 - 1:40 am. (1 message)

Next thread: [PATCH] ucc_geth: Convert to net_device_ops by Joakim Tjernlund on Monday, March 23, 2009 - 3:17 am. (11 messages)
From: Pablo Neira Ayuso
Date: Monday, March 23, 2009 - 2:33 am

This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can
be used by unicast and broadcast listeners to avoid receiving
ENOBUFS errors.

Generally speaking, ENOBUFS errors are useful to notify two things
to the listener:

a) You may increase the receiver buffer size via setsockopt().
b) You have lost messages, you may be out of sync.

In some cases, ignoring ENOBUFS errors can be useful. For example:

a) nfnetlink_queue: this subsystem does not have any sort of resync
method and you can decide to ignore ENOBUFS once you have set a
given buffer size.

b) ctnetlink: you can use this together with the socket flag
NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as
you do not need to resync (packets whose event are not delivered
are drop to provide reliable logging and state-synchronization).

Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down"
effect in terms of performance which is due to the netlink congestion
control when the listener cannot back off. The effect is the following:

1) throughput rate goes up and netlink messages are inserted in the
receiver buffer.
2) Then, netlink buffer fills and overruns (set on nlk->state bit 0).
3) While the listener empties the receiver buffer, netlink keeps
dropping messages. Thus, throughput goes dramatically down.
4) Then, once the listener has emptied the buffer (nlk->state
bit 0 is set off), goto step 1.

This effect is easier to trigger with netlink broadcast under heavy
load, and it is more noticeable when using a big receiver buffer.
You can find some results in [1] that show this problem.

[1] http://1984.lsi.us.es/linux/netlink/

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---

 include/linux/netlink.h  |    1 +
 net/netlink/af_netlink.c |   30 +++++++++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 1e6bf99..5ba398e 100644
--- a/include/linux/netlink.h
+++ ...
From: Patrick McHardy
Date: Monday, March 23, 2009 - 4:58 am

I agree that not having netlink drop new messages after congestion
might be useful. Two suggestions though:

- NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than
   "NO_ENOBUFS"

- The ENOBUFS error itself is actually not the problem, but the
   congestion handling. It still makes sense to notify userspace
   of congestion. I'd suggest to deliver the error, but avoid setting
   the congestion bit.

--

From: Pablo Neira Ayuso
Date: Monday, March 23, 2009 - 5:11 am

I thought about this choice but I see one problem with this. The ENOBUFS
error is attached to the congestion control. If we keep reporting
ENOBUFS errors to userspace with no congestion control, the listener may
keep receiving ENOBUFS indefinitely. In other words, the congestion
control seems to me like a way to avoid spamming ENOBUFS errors to
userspace.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers
--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 5:14 am

What do you mean by "attached to"? Congestion control is done by

The error will be cleared by the next call to recvmsg().
--

From: Pablo Neira Ayuso
Date: Monday, March 23, 2009 - 5:25 am

Yes, but once we set that bit to 1, we stop sending ENOBUFS to
userspace. So I think that congestion also applies to error reporting,

Yes, but think about this scenario:

1) We hit ENOBUFS, you call recvmsg() you get the error, and error is
cleared.
2) You're going to call recvmsg() again but before doing so, we hit
ENOBUFS again. So you call recvmsg() and you get the error again.

I think that this may lead to indefinitely getting ENOBUFS without
retrieving data under very heavy load.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers
--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 5:41 am

That's correct, there can only be a single outstanding error at any

I'm not sure that this would be a bad thing under the circumstances
you describe. We drop packets, we notify userspace.

I agree though that my proposed way isn't ideal either, since we can't
queue errors, they will be delivered sporadically (not reflecting the
true amount of dropped messages) and without stopping to queue new
messages, it can't be determined at which "position" the error occured.

But I think some notification or other way to notice whats happening
is needed for userspace, otherwise it can neither report not handle
this in any way.

--

From: Pablo Neira Ayuso
Date: Monday, March 23, 2009 - 6:05 am

Hm, I see. I think that we can increase sk_drop like in the UDP code
when the NETLINK_NO_ENOBUFS flag is set. We can display it in the
netlink /proc entry. Would you be OK with this?

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers
--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 6:09 am

Yes, something like that seems OK.

--

Previous thread: [GIT]: Networking by David Miller on Monday, March 23, 2009 - 1:40 am. (1 message)

Next thread: [PATCH] ucc_geth: Convert to net_device_ops by Joakim Tjernlund on Monday, March 23, 2009 - 3:17 am. (11 messages)