I agree with all this Ingo.
The netfilter case is real simple Ingo, (note I did not use "obvious" here ;) )
netfilter in 2.6.2[0-9] used :
CPU 1
sofirq handles one packet from a NIC
ipt_do_table() /* to handle INPUT table for example, or FORWARD */
read_lock_bh(&a_central_and_damn_rwlock)
... parse rules
-> calling some netfilter sub-function
re-entering network IP stack to send some packet (say a RST packet)
...
ipt_do_table() /* to handle OUPUT table rules for example */
read_lock_bh() ; /* WE RECURSE here, but once in a while (if ever) */
This is one of the case, but other can happens with virtual networks, tunnels, ...
and so on. (Stephen had some cases with KVM if I remember well)
If this could be done without recursion, I am pretty sure netfilter
and network guys would have done it. I found Linus reaction quite
shocking IMHO, considering hard work done by all people on this.
I was pleased by your locking schem, that was *very* interesting, even
if not yet ready.
1) We can discuss of how bad recursion is.
We know loopback_xmit() could be way faster if we could avoid queeing packet
to softirq handler.
(Remember you and I suggested this patch some months ago ? Please remember
David rejected this because of recursion and possibility to overflow stack)
So yes, people are aware of recursion problems.
2) We can discuss how bad rwlocks are
We did lot of work on last months to delete some of rwlocks we had in kernel.
UDP stack for example dont use them anymore.
We tried to delete them on x_tables, but we must take care of ipt_do_table() nesting,
that is legal on 2.6.30.
Maybe netfilter guys can work to avoid this nesting on 2.6.31,
I dont know how hard it is, definitly not 2.6.30 material.
Solutions were discussed many times and Stephen provided 13 versions of the patch.
If this was that obvious, one or two iterations would have been OK.
3) About last patch (v13)
Stephen did not agreed with you (well... maybe after all ..),
he only submitted again a previous version.
With linux-2.6.2[0-9], "iptables -L" used to block all cpus from entering netfilter
because we did a write_lock_bh() on the central rwlock while folding counters.
On v13, we dont try to freeze whole x_table context, but cpu per cpu.
Thats a minor change against previous versions, and should not lead to strange
application behavior. Its so much scalable that we should accept this change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html