Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem by David Witbrodt on Monday, August 11, 2008 - 12:05 pm. (1 message)

Next thread: [PATCH] pkt_sched: Add BH protection for qdisc_stab_lock. by Jarek Poplawski on Monday, August 11, 2008 - 2:42 pm. (3 messages)
From: Jarek Poplawski
Date: Monday, August 11, 2008 - 1:53 pm

pkt_sched: Destroy gen estimators under rtnl_lock().

gen_kill_estimator() requires rtnl_lock() protection, and since it is
called in qdisc ->destroy() too, this has to go back from RCU callback
to qdisc_destroy().


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_generic.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7cf83b3..336dc88 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -524,18 +524,10 @@ EXPORT_SYMBOL(qdisc_reset);
 static void __qdisc_destroy(struct rcu_head *head)
 {
 	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
-	const struct Qdisc_ops  *ops = qdisc->ops;
 
 #ifdef CONFIG_NET_SCHED
 	qdisc_put_stab(qdisc->stab);
 #endif
-	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
-	if (ops->reset)
-		ops->reset(qdisc);
-	if (ops->destroy)
-		ops->destroy(qdisc);
-
-	module_put(ops->owner);
 	dev_put(qdisc_dev(qdisc));
 
 	kfree_skb(qdisc->gso_skb);
@@ -547,6 +539,8 @@ static void __qdisc_destroy(struct rcu_head *head)
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	const struct Qdisc_ops *ops = qdisc->ops;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
@@ -554,6 +548,13 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	if (qdisc->parent)
 		list_del(&qdisc->list);
 
+	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+	if (ops->reset)
+		ops->reset(qdisc);
+	if (ops->destroy)
+		ops->destroy(qdisc);
+
+	module_put(ops->owner);
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 EXPORT_SYMBOL(qdisc_destroy);
--

From: David Miller
Date: Monday, August 11, 2008 - 6:12 pm

From: Jarek Poplawski <jarkao2@gmail.com>

We can't do this.  And at a minimum, the final ->reset() must
occur in the RCU callback, otherwise asynchronous threads of
execution could queue packets into this dying qdisc and
such packets would leak forever.
--

From: Jarek Poplawski
Date: Monday, August 11, 2008 - 10:20 pm

Could you explain this more? I've thought this synchronize_rcu() is
just to prevent this (and what these comments talk about?):

void dev_deactivate(struct net_device *dev)
{
        bool running;

        netdev_for_each_tx_queue(dev, dev_deactivate_queue, &noop_qdisc);
        dev_deactivate_queue(dev, &dev->rx_queue, &noop_qdisc);

        dev_watchdog_down(dev);

        /* Wait for outstanding qdisc-less dev_queue_xmit calls. */
        synchronize_rcu();

        do {
                while (some_qdisc_is_running(dev, 0))
                        yield();

                /*
                 * Double-check inside queue lock to ensure that all effects
                 * of the queue run are visible when we return.
                 */
                running = some_qdisc_is_running(dev, 1);

                /*
                 * The running flag should never be set at this point because
                 * we've already set dev->qdisc to noop_qdisc *inside* the same
                 * pair of spin locks.  That is, if any qdisc_run starts after
                 * our initial test it should see the noop_qdisc and then
                 * clear the RUNNING bit before dropping the queue lock.  So
                 * if it is set here then we've found a bug.
                 */
        } while (WARN_ON_ONCE(running));
}

Jarek P.
--

From: David Miller
Date: Monday, August 11, 2008 - 10:40 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Those comments are out of date and I need to update them.
In fact this whole loop is now largely pointless.

The rcu_dereference() on dev_queue->qdisc happens before the
QDISC_RUNNING bit is set.

We no longer resample the qdisc under any kind of lock.  Because we no
longer have a top-level lock that synchronizes the setting of
dev_queue->qdisc

Rather, the lock we use for calling ->enqueue() and ->dequeue() is
inside of the root qdisc itself.

That's why all of the real destruction has to occur in the RCU handler.

Anyways, this is part of the problem I think is causing the crash the
Intel folks are triggering.

We sample the qdisc in dev_queue_xmit() or wherever, then we attach
that to the per-cpu ->output_queue to process it via qdisc_run()
in the software interrupt handler.

The RCU quiesce period extends to the next scheduling point and this
is enough if we do normal direct softirq processing of this qdisc.

But if it gets postponed into ksoftirqd... the RCU will pass too
early.

I'm still thinking about how to fix this without avoiding RCU
and without adding new synchronization primitives.
--

From: Jarek Poplawski
Date: Tuesday, August 12, 2008 - 12:00 am

On Mon, Aug 11, 2008 at 10:40:47PM -0700, David Miller wrote:

Of course I've to miss something, but I still don't get it: after
synchronize_rcu() in dev_deactivate() we are sure anyone in
dev_queue_xmit() rcu block has to see the change to noop_qdisc(),
so it can only lose packets and not really enqueue(). IMHO the
only problem is this __netif_schedule(), which could be done with
dev_queues instead of Qdiscs with proper dereferencing there.
(BTW, I think we need rcu_read_lock() instead of the _bh() version in
dev_queue_xmit() to match this with rcu_call() or synchronize_rcu().)

Thanks,
Jarek P.
--

From: David Miller
Date: Tuesday, August 12, 2008 - 1:15 am

From: Jarek Poplawski <jarkao2@gmail.com>

The qdisc pointer traverses to the softirq handler, which can be run
in a process context (via ksoftirqd), and this pointer gets there

I didn't see it possible to keep scheduling the netdev_queues, as the
qdiscs can be shared with multiple queues.

Qdisc "are we running?" and other state pieces are now inside of the
Qdisc itself.  And all of the qdisc_run() and netif_schedule logic is,
as a result, Qdisc centric.

The synchronization object is the qdisc.  So we can't resample the
qdisc after scheduling it, because then the qdisc attached to the
netdev_queue can change and we'd be holding the root lock for
the wrong qdisc object.
--

From: Jarek Poplawski
Date: Tuesday, August 12, 2008 - 3:38 am

On Tue, Aug 12, 2008 at 01:15:10AM -0700, David Miller wrote:

If you mean net_tx_action() this looks like we would get a root lock
of a current qdisc, just like seen in dev_queue_xmit() at the moment,
so I'm still looking for a clue, what could be wrong with this...

Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 12, 2008 - 9:30 pm

Here are two possible solutions:

1) The active way: smp_call_function and forcibly remove the qdiscs
in question from each output_queue.

2) The passive way: Make dev_deactive call yield() until no qdisc's
are on an output_queue.  This assumes there is some sort of dead
flag detection on the output_queue side so it doesn't keep going
forever.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Tuesday, August 12, 2008 - 10:11 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Yes, we'll need some kind of dead flag it seems.

Another thing we can do is, in the yield loop, grabbing the
__QDISC_STATE_RUNNING bit.

But actually, I think Jarek has a point.

The existing loop there in dev_deactivate() should work _iff_ we make
it look at the proper qdisc.

This is another case where I didn't transform things correctly.  The
old code worked on dev->state since that's where we kept what used
to be __LINK_STATE_QDISC_RUNNING.

But now that state is kept in the qdisc itself.  But we just zapped
the active qdisc, so the old one is in ->qdisc_sleeping not ->qdisc.

So, just like one of Jarek's patches, we should simply change
dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
take care of the bulk of the issues.

Shouldn't it?

Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.


--

From: Herbert Xu
Date: Tuesday, August 12, 2008 - 10:31 pm

You need this too for the ones which aren't running but sitting
on output_queue.  Either you'll have to forcibly remove them as
I outlined earlier, or just wait for them to expire naturally.

The latter would seem more suitable since we're waiting anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Wednesday, August 13, 2008 - 2:30 am

From: Herbert Xu <herbert@gondor.apana.org.au>

This is what I do in the patch I just posted.
--

From: Jarek Poplawski
Date: Tuesday, August 12, 2008 - 11:13 pm

If we don't change anything in __netif_schedule() I doubt it's enough.
And if this old way of waiting for "outstanding qdisc-less" calls was
really needed we should probably wait for both qdisc and qdisc_sleeping

We could probably even think of using this flag napi_disable() way.

Jarek P.
--

From: David Miller
Date: Tuesday, August 12, 2008 - 11:16 pm

From: Jarek Poplawski <jarkao2@gmail.com>


Here (in dev_deactivate), ->qdisc is going to now be &noop_qdisc or
similar.  Asynchronous contexts can run into that thing as much as
they want :-)
--

From: Jarek Poplawski
Date: Tuesday, August 12, 2008 - 11:53 pm

Thats why I still think a "common" RCU with rcu_dereference() (from
dev_queue pointer) in net_tx_action() should be enough: after
synchronize_rcu() in dev_deactivate() we are sure any qdisc_run(),
from dev_queue_xmit() or net_tx_action() can only see and lock
noop_qdisc. Any activities on qdisc_sleeping can't happen so no
need to wait for this. There could be some skbs enqueued just before
synchronize, and they could be ->reset() and ->destroy() just after,
even without rcu_call().

Otherwise, I think you would better send some code example with these
flags, so we could be sure there is no misunderstanding around this.

Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, August 13, 2008 - 12:31 am

On Wed, Aug 13, 2008 at 06:53:02AM +0000, Jarek Poplawski wrote:

I hope nobody read this literally: I mean using dev_queue pointer

BTW, this all is easy to verify: simply adding debugging waiting loops
with checking the state of qdisc_sleeping after synchronize_rcu.

Jarek P. 
--

From: David Miller
Date: Wednesday, August 13, 2008 - 2:25 am

From: Jarek Poplawski <jarkao2@gmail.com>

Here it my concrete proposal for a fix.

pkt_sched: Fix queue quiescence testing in dev_deactivate().

Based upon discussions with Jarek P. and Herbert Xu.

First, we're testing the wrong qdisc.  We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.

We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more.  And that would be
->qdisc_sleeping.

Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7cf83b3..4685746 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -647,7 +647,7 @@ static void dev_deactivate_queue(struct net_device *dev,
 	}
 }
 
-static bool some_qdisc_is_running(struct net_device *dev, int lock)
+static bool some_qdisc_is_busy(struct net_device *dev, int lock)
 {
 	unsigned int i;
 
@@ -658,13 +658,14 @@ static bool some_qdisc_is_running(struct net_device *dev, int lock)
 		int val;
 
 		dev_queue = netdev_get_tx_queue(dev, i);
-		q = dev_queue->qdisc;
+		q = dev_queue->qdisc_sleeping;
 		root_lock = qdisc_lock(q);
 
 		if (lock)
 			spin_lock_bh(root_lock);
 
-		val = test_bit(__QDISC_STATE_RUNNING, &q->state);
+		val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
+		       test_bit(__QDISC_STATE_SCHED, &q->state));
 
 		if (lock)
 			spin_unlock_bh(root_lock);
@@ -689,14 +690,14 @@ void dev_deactivate(struct net_device *dev)
 
 	/* Wait for outstanding qdisc_run calls. */
 	do {
-		while (some_qdisc_is_running(dev, 0))
+		while (some_qdisc_is_busy(dev, 0))
 			yield();
 
 		/*
 		 * Double-check inside queue lock to ensure that all effects
 		 * of the queue run are visible when we return.
 		 ...
From: Herbert Xu
Date: Wednesday, August 13, 2008 - 2:58 am

Looks good to me.  Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, August 13, 2008 - 3:27 am

Of course, checking this needs more time, but it looks like it could
work, only two little doubts:

- in net_tx_action() we can hit a place just after clear_bit() where
none of these bits is set. Of course, hitting this 2 times in a row
seems to be very unprobable, yet possible, and a lock isn't helpful
here, so probably some change around this would make this nicer.

- isn't there possible some longer ping-pong between qdic_run() and
net_tx_action() when dev_requeue_skb() would get it back to
__netif_schedule() and so on (with NETDEV_TX_BUSY)?

Otherwise, this patch looks OK to me.

--

From: Jarek Poplawski
Date: Wednesday, August 13, 2008 - 3:42 am

On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:

Hmm... I see, after qdisc_reset() it doesn't seem possible to happen.

Jarek P.
--

From: Herbert Xu
Date: Wednesday, August 13, 2008 - 3:42 am

Good point.  I think we should add an aliveness check in both
net_tx_action and qdisc_run.  In fact the net_tx_action problem
existed previously as well.  But it is pretty darn unlikely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, August 13, 2008 - 3:50 am

Yes, it seems qdisc_reset() doesn't have to help with this, so
probably there is needed some requeue counter or something...

Jarek P.
--

From: David Miller
Date: Wednesday, August 13, 2008 - 3:19 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Ok, so what I'm going to do is check in my patch and then try
to figure out how to resolve this "both bits clear" scenerio.

Thanks.
--

From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 12:59 am

Here is my proposal.

Jarek P.

----------->

net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().

Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate_queue(). Now, if this flag is
seen as unset under root_lock means a qdisc can't be netif_scheduled.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/dev.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..f67581b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1339,19 +1339,23 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 }
 
 
-void __netif_schedule(struct Qdisc *q)
+static inline void __netif_reschedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
+	struct softnet_data *sd;
+	unsigned long flags;
 
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_schedule(struct Qdisc *q)
+{
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
+		__netif_reschedule(q);
 }
 EXPORT_SYMBOL(__netif_schedule);
 
@@ -1974,15 +1978,15 @@ static void net_tx_action(struct softirq_action *h)
 
 			head = head->next_sched;
 
-			smp_mb__before_clear_bit();
-			clear_bit(__QDISC_STATE_SCHED, &q->state);
-
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) {
+				smp_mb__before_clear_bit();
+				clear_bit(__QDISC_STATE_SCHED,
+					  &q->state);
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else ...
From: Herbert Xu
Date: Thursday, August 14, 2008 - 1:16 am

Well this probably works in practice but at least on paper it
is vulnerable to live-lock if the net_tx_action side always gets
to the trylock stage and loses to the waiting side.

An aliveness flag would be the safest.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 1:31 am

I'm not sure of your point... This patch is only to fix my yesterday's
doubt #1, and it doesn't introduce, I hope, any new live-lock
vulnerabity. So, if you mean doubt #2, there is needed a separate
patch, but I'm not sure there is a need to add a flag. I've thougt
about a counter in a Qdisc for consecutive requeues with
netif_schedule, so we could break after some limit. Of course, your
idea could be simpler and better, but if I could only see some code...

Cheers,
Jarek P.
--

From: Herbert Xu
Date: Thursday, August 14, 2008 - 1:33 am

What I mean is the extremely unlikely scenario of net_tx_action
always failing on trylock because dev_deactivate has grabbed the
lock to check whether net_tx_action has completed.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 1:44 am

Of course! I got it myself after responding and re-reading, sorry. So,
this is yet another doubt, but I still wonder why you don't attach
any code... (I'm currently trying to re-think this.)

Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 1:52 am

On the other hand... such a flag would be probably for one thing only.
And if we would have a "netif_scheduled_without_xmitting" counter this
could probably make 2 in 1?

Jarek P.
--

From: David Miller
Date: Sunday, August 17, 2008 - 3:57 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Here is how I propose to plug this hole.  The issue is that there is
this potential gap where both bits are clear while there is a context
that can reschedule the qdisc.

And I'm pretty sure the most desirable fix is to get rid of that gap
if it can easily be done.

It seems like it can be.  What we do in the patch below is something
like this:

1) The one code path that can leave both bits clear then reschedule
   is changed to only leave that state visible while holding the
   root qdisc's lock.

   All other code paths will not reschedule once they clear the bits.

2) dev_deactivate() unconditionally takes the root qdisc spinlock
   and just waits for both bits to clear.  This is now entirely
   sufficient to ensure that no pending runs of the qdisc remain.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..896393d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1338,20 +1338,27 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 	rcu_read_unlock();
 }
 
+/* We own the __QDISC_STATE_SCHED state bit, and know that
+ * we are the only entity which can schedule this qdisc on
+ * the output queue.
+ */
+static void __qdisc_schedule(struct Qdisc *q)
+{
+	struct softnet_data *sd;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
 
 void __netif_schedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
-
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, ...
From: David Miller
Date: Sunday, August 17, 2008 - 4:03 pm

From: David Miller <davem@davemloft.net>

My apologies Jarek, this is largely identical to a patch you posted
already.  Sorry :(

I thought about it some more, but there is still the dev_queue_xmit()
context that can have a reference to the old qdisc, be about to take
the qdisc lock and feed packets into it's ->enqueue().

That's why we need the RCU destruction still.  dev_deactivate() really
doesn't clear out all references to the qdisc fully.

It might be a better approach to work on making that simply not matter,
and just rely on RCU or similar to defer destruction and any real state
changes internally to the qdisc tree will simply use the root qdisc's
lock to block out enqueue/dequeue/requeue calls.

In that kind of scheme dev_deactivate() just sets the RCU pointer and
doesn't sit around waiting for anything.
--

From: Herbert Xu
Date: Sunday, August 17, 2008 - 6:25 pm

Well one of the advantages of it being synchronous is that the
driver may be relying on this so that it knows all transmissions
have ceased in dev_close.  If we stick with RCU then drivers would
have to implement their own synchronisation anyway.

So what's the issue with dev_queue_xmit? That should be taken care
of by something like rcu_barrier_bh, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, August 17, 2008 - 6:35 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

The code in dev_queue_xmit() used to resample the pointer.  It
relied upon the fact that we always used the same top-level
spinlock to set the qdisc.

But now that the lock is in the qdisc itself instead of the
netdevice or netdev_queue, that no longer works.

That's why I got rid of the "resample" code in these places
and tried to move everything into RCU.

I think I see another way out of this:

1) Add __QDISC_STATE_DEACTIVATE.

2) Set it right before dev_deactivate() swaps resets the qdisc
   pointer.

3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
   acquired, and drop lock and resample ->qdisc if
   __QDISC_STATE_DEACTIVATE is set.
--

From: Herbert Xu
Date: Sunday, August 17, 2008 - 6:36 pm

Yep this sounds good to me.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, August 17, 2008 - 6:49 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Here it is as a patch below.

The test being added to __netif_schedule() is just a reminder that
we have to address the "both bits clear" case somehow, likely with
Jarko's patch which I unintentionally reimplemented :)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a7abfda..757ab08 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -27,6 +27,7 @@ enum qdisc_state_t
 {
 	__QDISC_STATE_RUNNING,
 	__QDISC_STATE_SCHED,
+	__QDISC_STATE_DEACTIVATED,
 };
 
 struct qdisc_size_table {
diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..b88f669 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 
 void __netif_schedule(struct Qdisc *q)
 {
+	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
+		return;
+
 	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
 		struct softnet_data *sd;
 		unsigned long flags;
@@ -1790,6 +1793,8 @@ gso:
 	rcu_read_lock_bh();
 
 	txq = dev_pick_tx(dev, skb);
+
+resample_qdisc:
 	q = rcu_dereference(txq->qdisc);
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -1800,6 +1805,11 @@ gso:
 
 		spin_lock(root_lock);
 
+		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+			spin_unlock(root_lock);
+			goto resample_qdisc;
+		}
+
 		rc = qdisc_enqueue_root(skb, q);
 		qdisc_run(q);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4685746..ff1c455 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -597,6 +597,9 @@ static void transition_one_qdisc(struct net_device *dev,
 	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
 	int *need_watchdog_p = _need_watchdog;
 
+	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
+		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
+
 	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
 	if (need_watchdog_p && new_qdisc != &noqueue_qdisc)
 ...
From: Herbert Xu
Date: Sunday, August 17, 2008 - 9:27 pm

We could just drop the packet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, August 17, 2008 - 9:31 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

True, that would be simpler and have the same effect.

I just noticed that ingress qdisc needs similar code.

Ok, here is what I'll do as a rough plan:

1) Integrate this patch with your drop suggestion and ingress
   fixed up.

2) Integrate Jarek's patch that closes the "both bits clear"
   hole.

3) Add my bit that makes dev_deactivate() always take the spinlock
   so there is just one simple yield() loop.

And, finally, we can be at the point where we can get rid of the
RCU deferral of qdisc_destroy().

Should be fun evening :-)
--

From: Herbert Xu
Date: Sunday, August 17, 2008 - 9:36 pm

Worthy of a gold medal :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, August 17, 2008 - 10:13 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

All of this is now pushed to net-2.6 on kernel.org
--

From: Denys Fedoryshchenko
Date: Sunday, August 17, 2008 - 11:08 pm

So i can build & test net-2.6 from git? :-)
--

From: David Miller
Date: Sunday, August 17, 2008 - 11:13 pm

From: Denys Fedoryshchenko <denys@visp.net.lb>

Should be able to, yes :-)
--

From: Jarek Poplawski
Date: Sunday, August 17, 2008 - 11:27 pm

You shouldn't bother with this at all. I'm really pleased if I
sometimes think similarly to you, and this wasn't the most complex
idea to found, btw.

But, there is probably something other to bother here. I didn't get
the final version of this patch nor I can see this on the list, but
in your git there is this change to "goto out_kfree_skb", which
seems to skip rcu_read_unlock_bh().

Otherwise, hmm.. I'm half asleep yet, and only after 1 coffee, so
maybe I'll change my mind soon, but now I've some doubts about these
last changes.

--

From: David Miller
Date: Sunday, August 17, 2008 - 11:38 pm

From: Jarek Poplawski <jarkao2@gmail.com>

That's a bug I added when I implemented Herber't suggestion
to just drop the packet.  Good spotting.

I've just pushed the following fix, thanks!

pkt_sched: Fix missed RCU unlock in dev_queue_xmit()

Noticed by Jarek Poplawski.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 819f017..8d13380 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1805,14 +1805,12 @@ gso:
 		spin_lock(root_lock);
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-			spin_unlock(root_lock);
+			kfree_skb(skb);
 			rc = NET_XMIT_DROP;
-			goto out_kfree_skb;
+		} else {
+			rc = qdisc_enqueue_root(skb, q);
+			qdisc_run(q);
 		}
-
-		rc = qdisc_enqueue_root(skb, q);
-		qdisc_run(q);
-
 		spin_unlock(root_lock);
 
 		goto out;
-- 
1.5.6.5.GIT

--

From: Jarek Poplawski
Date: Monday, August 18, 2008 - 2:29 pm

Why I can't see this code in net-2.6? BTW, I guess it should be now

OK, we now have this kfree_skb() with NET_XMIT_DROP here, but how is it
better than qdisc_enque_root() on noop_qdisc? Or how can we have here
anything else under both rcu lock and spin_lock() while this
__QDISC_STATE_DEACTIVATED bit is set?

...
--

From: David Miller
Date: Monday, August 18, 2008 - 4:47 pm

From: Jarek Poplawski <jarkao2@gmail.com>

I deleted it, it's unnecessary with your "both bits clear" fix

Both operations are equivalent, the choice is arbitrary in my
opinion.  It's not like the user can even see this in noop_qdisc's
stats or something like that.

--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 3:31 am

Herbert was concerned earlier with this:
"What I mean is the extremely unlikely scenario of net_tx_action
always failing on trylock because dev_deactivate has grabbed the
lock to check whether net_tx_action has completed."

So, I guess this could help here.

Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 3:51 am

Right.  Even though the live-lock is an extremely unlikely event,
as the aliveness flag check isn't on the fast path anyway I think
we should keep it.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Tuesday, August 19, 2008 - 3:54 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Every qdisc_run() will invoke __netif_schedule() so it is
a fast path I think :-)

But anyways, all of these paths didle with these state bits
anyways, so it's in the cache and a cheap test.

I can add it back once we're all sure we're talking about
the same thing.  So you're saying that we should add the
__QDISC_STATE_DEACTIVATED test to __netif_schedule(), right?

I was confused because you say "we should keep it", it was never
in the net-2.6 tree and only existed in my RFC patch posting, so
I'm trying to figure out what you meant. :-)

--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 3:55 am

Argh, I meant __netif_reschedule which shouldn't be the fast path.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 3:58 am

Nevermind, both paths call __netif_reschedule :)

OK, how about just moving it to the else clause in net_tx_action?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Tuesday, August 19, 2008 - 4:02 am

From: Herbert Xu <herbert@gondor.apana.org.au>

I just checked the following into net-2.6

pkt_sched: Prevent livelock in TX queue running.

If dev_deactivate() is trying to quiesce the queue, it
is theoretically possible for another cpu to livelock
trying to process that queue.  This happens because
dev_deactivate() grabs the queue spinlock as it checks
the queue state, whereas net_tx_action() does a trylock
and reschedules the qdisc if it hits the lock.

This breaks the livelock by adding a check on
__QDISC_STATE_DEACTIVATED to net_tx_action() when
the trylock fails.

Based upon feedback from Herbert Xu and Jarek Poplawski.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8d13380..60c51f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1990,7 +1990,9 @@ static void net_tx_action(struct softirq_action *h)
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else {
-				__netif_reschedule(q);
+				if (!test_bit(__QDISC_STATE_DEACTIVATED,
+					      &q->state))
+					__netif_reschedule(q);
 			}
 		}
 	}
-- 
1.5.6.5.GIT

--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 4:11 am

Thanks Dave!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 9:48 am

I can miss something, but probably it's needed in __netif_reschedule()
yet. Here is a scenario:

cpu1				cpu2
dev_deactivate()
dev_deactivate_queue()
qdisc_reset()
				qdisc_run()
				qdisc_watchdog_schedule()
				(or hrtimer_restart in cbq)
while (some_qdisc_is_busy())
return (qdisc not busy)
				hrtimer triggered
				__netif_schedule()
qdisc_destroy()			qdisc_run()
		
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 3:23 pm

No this is a genuine bug.  However, it's also an old bug :) Even

We should add an aliveness check before scheduling the timer.  This
is the slow path so adding a check shouldn't hurt.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, August 20, 2008 - 4:56 am

Thanks,
Jarek P.

--------------->

pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
or other timer calling netif_schedule() after dev_queue_deactivate().
We prevent this checking aliveness before scheduling the timer.

With feedback from Herbert Xu <herbert@gondor.apana.org.au>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_api.c |    3 +++
 net/sched/sch_cbq.c |    3 +++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..6f2bc7f 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -444,6 +444,9 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
 {
 	ktime_t time;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(wd->qdisc)->state))
+		return;
+
 	wd->qdisc->flags |= TCQ_F_THROTTLED;
 	time = ktime_set(0, 0);
 	time = ktime_add_ns(time, PSCHED_US2NS(expires));
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 47ef492..c04d335 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -521,6 +521,9 @@ static void cbq_ovl_delay(struct cbq_class *cl)
 	struct cbq_sched_data *q = qdisc_priv(cl->qdisc);
 	psched_tdiff_t delay = cl->undertime - q->now;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(cl->qdisc)->state))
+		return;
+
 	if (!cl->delayed) {
 		psched_time_t sched = q->now;
 		ktime_t expires;
--

From: Herbert Xu
Date: Wednesday, August 20, 2008 - 5:16 am

Looks good to me.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, August 20, 2008 - 10:17 pm

On Wed, Aug 20, 2008 at 11:56:56AM +0000, Jarek Poplawski wrote:

Actually, this patch is "no good" (wrong root qdisc).

David, please don't apply - I'll send it redone.

Thanks,
--

From: Jarek Poplawski
Date: Wednesday, August 20, 2008 - 10:49 pm

---------------> (take 2)

pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
or other timer calling netif_schedule() after dev_queue_deactivate().
We prevent this checking aliveness before scheduling the timer. Since
during deactivation the root qdisc is available only as qdisc_sleeping
additional accessor qdisc_root_sleeping() is created.

With feedback from Herbert Xu <herbert@gondor.apana.org.au>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/sch_generic.h |    5 +++++
 net/sched/sch_api.c       |    4 ++++
 net/sched/sch_cbq.c       |    4 ++++
 3 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 84d25f2..b1d2cfe 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -193,6 +193,11 @@ static inline struct Qdisc *qdisc_root(struct Qdisc *qdisc)
 	return qdisc->dev_queue->qdisc;
 }
 
+static inline struct Qdisc *qdisc_root_sleeping(struct Qdisc *qdisc)
+{
+	return qdisc->dev_queue->qdisc_sleeping;
+}
+
 /* The qdisc root lock is a mechanism by which to top level
  * of a qdisc tree can be locked from any qdisc node in the
  * forest.  This allows changing the configuration of some
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..45f442d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -444,6 +444,10 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
 {
 	ktime_t time;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED,
+		     &qdisc_root_sleeping(wd->qdisc)->state))
+		return;
+
 	wd->qdisc->flags |= TCQ_F_THROTTLED;
 	time = ktime_set(0, 0);
 	time = ktime_add_ns(time, PSCHED_US2NS(expires));
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 47ef492..8fa90d6 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -521,6 +521,10 @@ static void cbq_ovl_delay(struct cbq_class *cl)
 	struct ...
From: Herbert Xu
Date: Wednesday, August 20, 2008 - 11:10 pm

When would we actually want the non-sleeping variant?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, August 20, 2008 - 11:49 pm

On Thu, Aug 21, 2008 at 04:10:24PM +1000, Herbert Xu wrote:

We need to check if something depends on &noop_qdisc returned in the
similar state. Otherwise, there is a bit too much possibilities here,
so it would be nice to simplify this all.

Thanks,
Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 12:16 am

Actually, why do we even keep a netdev_queue pointer in a qdisc?
A given qdisc can be used by multiple queues (which is why the
lock was moved into the qdisc in the first place).

How about keeping a pointer directly to the root qdisc plus a
pointer to the netdev (which seems to be the only other use for
qdisc->dev_queue)? That way there won't be any confusion as to
whether we want the sleeping or non-sleeping qdisc.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, August 21, 2008 - 12:52 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Not a bad idea at all.

The reason it's there is a left-over from earlier designs of my
multiqueue stuff, I thought we'd always multiplex the qdiscs to be
per-queue.  But once Patrick showed me we couldn't do that, we now
have shared qdiscs.

If I get a chance I'll work on this.  But to be honest given
Linus's temperment right now we're only going to be able to
merge one-liners to fix these kinds of problems and anything
more serious is going to be for net-next-2.6 only.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 1:00 am

Oh yeah we don't need this right away.  Jarek's patch should be
quite safe for the time being.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 1:27 am

Probably it's too early in the morning, so I would prefer to see some
code... It seems exchanging the qdisc to &noop_qdisc needs such a
direct root dev_queue access, e.g. we want to get in dev_queue_xmit()
&noop_qdisc as fast as possible during deactivation. On the other hand
we need to check for __QDISC_STATE_DEACTIVATED to know the state of
a "real" qdisc.

BTW, it looks like currently we do this wrong: if it's really

I think definitely some changes are needed here for the future,
but of course no need to experiment with our stable -rc now.

Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 1:35 am

On Thu, Aug 21, 2008 at 08:27:21AM +0000, Jarek Poplawski wrote:


...and it's OK, yet.

Sorry,
Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 1:47 am

Anyway, it's tricky: we have 2 deactivation cases to care for:

1) a "real" (sleeping) qdisc and __QDISC_STATE_DEACTIVATED
2) a &noop_qdisc and not __QDISC_STATE_DEACTIVATED

Jarek P.
--

From: David Miller
Date: Thursday, September 11, 2008 - 3:39 am

From: David Miller <davem@davemloft.net>

I got to looking into this and we do need the qdisc->dev_queue member,
see qdisc_run().  So it's not like we can get rid of it if we replace
it with ->netdevdev and add a ->root_qdisc backpointer as well.

--

From: Herbert Xu
Date: Thursday, September 11, 2008 - 3:45 am

That can't be right.  Let's I've got a single qdisc shared by
n queues.  It makes no sense for qdisc_run to decide to whether
it should process the qdisc depending on the status of a single
one out of the n queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, September 11, 2008 - 3:49 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Well some kind of check has to be there.

I _did_ remove it during my initial implementation, and that
turned into a reported performance regression.

See:

commit 83f36f3f35f4f83fa346bfff58a5deabc78370e5
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Aug 13 02:13:34 2008 -0700

    pkt_sched: Add queue stopped test back to qdisc_run().
    
    Based upon a bug report by Andrew Gallatin on netdev
    with subject "CPU utilization increased in 2.6.27rc"
    
    In commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf
    ("pkt_sched: Schedule qdiscs instead of netdev_queue.")
    the test of the queue being stopped was erroneously
    removed from qdisc_run().
    
    When the TX queue of the device fills up, this omission
    causes lots of extraneous useless work to be queued up
    to softirq context, where we'll just return immediately
    because the device is still stuffed up.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
--

From: Herbert Xu
Date: Thursday, September 11, 2008 - 4:00 am

I see.  How about looking at the queue that the head-of-qdisc
packet maps to? That should be fairly cheap to compute.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, September 11, 2008 - 4:42 am

From: Herbert Xu <herbert@gondor.apana.org.au>

This gets us back to the whole qdisc->ops->peek() discussion :)

And we don't have the qdisc lock here, taking it is undesirable,
and if we do take it we have to transfer that lock down into
__qdisc_run() which means adjusting all the other __qdisc_run()
callers.

It's very clumsy at best.

I therefore don't think it's wise peeking into the qdisc here.

But I do realize we have to do something about this, hmmm...
--

From: Herbert Xu
Date: Thursday, September 11, 2008 - 4:45 am

Last I checked qdisc_run did run under the root lock...

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, September 11, 2008 - 4:47 am

From: Herbert Xu <herbert@gondor.apana.org.au>

That certainly changes things :)

Ok, so implementing ->peek() is the first step in dealing
with this.
--

From: David Miller
Date: Thursday, September 11, 2008 - 9:49 pm

From: David Miller <davem@davemloft.net>

Ok, here's a first cut at this.

Most of it is simple and straightforward.

As usual, though, CBQ, HFSC, and HTB are complicated.

Most of the peek implementations just skb_peek() in their
downstream queue or iterate over their prio array doing
the same looking for a non-empty list.

But CBQ, HFSC, and HTB have complicated class iterators and
internal time state machine things.  I tried to do my best
in these cases.

I didn't want these things firing off class watchdog timers and
stuff like this.  Just see if there is any packet ready now
and return it.

The one exception is that I allow CBQ to advance it's time
state machine.

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e556962..c4eb6e5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,6 +41,7 @@ struct Qdisc
 {
 	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
 	struct sk_buff *	(*dequeue)(struct Qdisc *dev);
+	struct sk_buff *	(*peek)(struct Qdisc *dev);
 	unsigned		flags;
 #define TCQ_F_BUILTIN	1
 #define TCQ_F_THROTTLED	2
@@ -110,6 +111,7 @@ struct Qdisc_ops
 
 	int 			(*enqueue)(struct sk_buff *, struct Qdisc *);
 	struct sk_buff *	(*dequeue)(struct Qdisc *);
+	struct sk_buff *	(*peek)(struct Qdisc *);
 	int 			(*requeue)(struct sk_buff *, struct Qdisc *);
 	unsigned int		(*drop)(struct Qdisc *);
 
@@ -431,6 +433,28 @@ static inline struct sk_buff *qdisc_dequeue_tail(struct Qdisc *sch)
 	return __qdisc_dequeue_tail(sch, &sch->q);
 }
 
+static inline struct sk_buff *__qdisc_peek_head(struct Qdisc *sch,
+						struct sk_buff_head *list)
+{
+	return skb_peek(list);
+}
+
+static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch)
+{
+	return __qdisc_peek_head(sch, &sch->q);
+}
+
+static inline struct sk_buff *__qdisc_peek_tail(struct Qdisc *sch,
+						struct sk_buff_head *list)
+{
+	return skb_peek_tail(list);
+}
+
+static inline struct sk_buff *qdisc_peek_tail(struct Qdisc ...
From: Jarek Poplawski
Date: Friday, September 12, 2008 - 1:02 am

Alas I'm still not sure how this whole peek idea is going to be
implemented (dequeuing after peeking doesn't have to give us the
same skb since in the meantime e.g. in HTB some other class with
higher prio can get enough tokens etc., and if there is a break
for transmit in the meantime with possible enqueuing, or we can
deal with something like sch_multiq, which depends on the current
state of the tx_queues, this all looks even more interesting).

So, until there is some example, even in pseudocode, for qdisc_run()
vs. some_classful_sched interaction, I think, I'm not able to give
more feedback now, but mabe only one doubt if this wrapper below is
really needed, since skb_peek() is more readable and it's used

Thanks,
Jarek P.
--

From: David Miller
Date: Friday, September 12, 2008 - 4:10 pm

From: Jarek Poplawski <jarkao2@gmail.com>

That's a good point.

Well, once that is discussed and resolved, at least the patch I
posted can be used as a base for implementation :-)
--

From: Herbert Xu
Date: Friday, September 12, 2008 - 6:10 pm

Well we should remember the skb returned by peek, and then return
it on the next dequeue.  If peek hasn't been called since the
last dequeue then it's equivalent to the current dequeue.

This works for the two scenarios where we're planning to use
peek since they'll both call dequeue immediately.

It'd even work if there was a large gap (i.e., a sleep) after
the peek since we'd just peek again after waking up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Friday, September 12, 2008 - 6:22 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

This requires state, for the unlink.  And that unlink point could be
several levels deep into the tree.

There is also currently no restriction on how a packet scheduler
maintains it's queue of SKBs.

So, if the idea is to keep track of the skb_queue_head pointer to
unlink from at the root, I don't think that's a good idea.

And then there are all of these complicated pieces of state the
classful qdiscs modify when a SKB is removed from their visibility.
So, the unlink isn't just a simple list delete operation.  There
are queue lengths that need to be updated, watchdog timers to
manage, etc.
--

From: Herbert Xu
Date: Friday, September 12, 2008 - 6:27 pm

No non-leaf qdiscs would remember remember the child qdisc where

The decision really comes down to whether it's harder to requeue
a seemingly arbitrary packet or whether it's harder to dequeue the
packet that was last peeked.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Friday, September 12, 2008 - 6:40 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

My current opinion is that both operations are equally difficult.
With the slight advantage for ->requeue() because all the complicated
logic is already implemented :-)
--

From: Herbert Xu
Date: Friday, September 12, 2008 - 6:48 pm

I'd agree with you if you haven't written the peek stuff :)

Now that peek exists, the dequeue stuff would be a lot simpler
than requeue because the only non-trivial logic would be in the
leaf qdiscs.  All the complex/classful qdiscs would be trivial
as they'd just write down the child qdisc that was peeked and
then call dequeue on that child.

Compare that to requeue where the classful qdiscs have to do
loads of work to figure out which child the packet should be
sent to.

In fact it looks like CBQ has taken the easy way out by remembering
the class the last packet was dequeued from so it's essentially
doing what I'm proposing here :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Saturday, September 13, 2008 - 1:54 pm

If I get it right peek + dequeue should do all current dequeue logic
plus additionally write down the child qdisc or skb (leaves) info,
plus, probably, some ifs btw., which looks like a bit of overhead,
if we consider requeuing as something exceptional. Unless we don't -
then of course something like this could be useful.

Jarek P.
--

From: Herbert Xu
Date: Saturday, September 13, 2008 - 11:16 pm

I don't see the overhead in writing down something that we alrady
have.  In any case, do you have an alternative solution to the
current problem that qdisc_run looks at an arbitrary queue's
status to decide whether it should process a qdisc that empties
into n queues?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Alexander Duyck
Date: Sunday, September 14, 2008 - 3:31 am

On Sat, Sep 13, 2008 at 11:16 PM, Herbert Xu <herbert@gondor.apana.org.au>

What if we were to push the check for netif_tx_queue_stopped further down into
the qdisc layer so that the basic qdiscs such as pfifo_fast did their own peek
and in the event that a queue is stopped it just returned NULL and set a flag
bit?  Basically this would mimic how we are currently handling throttled
queues (TCQ_F_THROTTLED).  Then in turn each parent could do a check on skb ==
NULL and set the same flag, or they could act like multiq and just skip over
that leaf and move onto the next because it is stopped.

I might try putting together a patch for this on Monday but in the meantime
here are a couple code snippets to demonstrate what I am thinking.  It seems
like this would  provide most of what you are looking for because the first
thing that happens in qdisc_restart() is a packet is dequeued and if that
fails the routine exits.

Thanks,

Alex

static inline struct sk_buff *__qdisc_dequeue_head(struct Qdisc *sch,
						   struct sk_buff_head *list)
{
	struct sk_buff *skb = skb_peek(list);
	struct netdev_queue *txq;

	if (skb == NULL)
		return NULL;
		
	txq = netdev_get_tx_queue(sch->dev, skb_get_queue_mapping(skb));
	if (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq)) {
		sch->flags |= TCQ_F_STOPPED;
		return NULL;
	}

	__skb_unlink(skb, list);
	sch->qstats.backlog -= qdisc_pkt_len(skb);
	sch->flags &= ~TCQ_F_STOPPED;

	return skb;
}

static struct sk_buff *prio_dequeue(struct Qdisc* sch)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	int prio;

	for (prio = 0; prio < q->bands; prio++) {
		struct Qdisc *qdisc = q->queues[prio];
		struct sk_buff *skb = qdisc->dequeue(qdisc);
		if (skb) {
			sch->q.qlen--;
			sch->flags &= ~TCQ_F_STOPPED;
			return skb;
		}
		if (qdisc->flags & TCQ_F_STOPPED) {
			sch->flags |= TCQ_F_STOPPED;
			return NULL;
		}
	}
	sch->flags &= ~TCQ_F_STOPPED;
	return NULL;

}
--

From: Jarek Poplawski
Date: Sunday, September 14, 2008 - 2:43 pm

On Sun, Sep 14, 2008 at 03:31:35AM -0700, Alexander Duyck wrote:

IMHO it's a very interesting idea. Probably it could be better
evaluated if we have any stats how much this reason of requeing is a
problem with mq devices.

On the other hand, I wondered about a possibility of rehashing to other
queues in some cases during requeuing, which would be impossible after
such change.

Thanks,
Jarek P.
--

From: Herbert Xu
Date: Sunday, September 14, 2008 - 3:13 pm

Why would you want to do that? Just because people have abused
requeue in the psat doesn't mean that we need to support such
abuses for perpetuity.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Sunday, September 14, 2008 - 11:07 pm

Well, it was only wondering, and probably you are right this is wrong.
On the other hand, simple_tx_hash() choices are "probabilistic": user
doesn't care if it goes through tx_queue #1 or #11. And here, in some
cases, some tx_queues could be always full while other always empty,
so some dynamic rehashing could be thought of, but I understand it's
not trivial.
 
Cheers,
Jarek P.
--

From: Herbert Xu
Date: Sunday, September 14, 2008 - 11:19 pm

No that would be totally wrong.  One of the important constraints
on a TX hashing mechanism is to preserve packet ordering within
a flow.  If simple_tx_hash started placing the same packet in
different queues then it would do the same thing to flows which
is unacceptable.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Monday, September 15, 2008 - 12:20 am

Of course preserving a flow consistency is must-be here, but I think
there are rehashing algorithms used in similar cases (sch_sfq) which
take care for this. As a matter of fact, I've thought of requeuing as
a best place to detect possible problems, but now I see that
Alexander's proposal let's to do this simply by observing this
TCQ_F_STOPPED flag, so I withdraw my objection.

Cheers,
Jarek P.
--

From: Jarek Poplawski
Date: Monday, September 15, 2008 - 12:45 am

On Mon, Sep 15, 2008 at 07:20:08AM +0000, Jarek Poplawski wrote:

Hmm... or maybe it doesn't? Since this is qdisc flag we don't know at
the top which tx_queue is a problem at the bottom...

Jarek P.
--

From: Duyck, Alexander H
Date: Monday, September 15, 2008 - 4:44 pm

The problem is with the nature of a qdisc.  For example let's say we have a
prio qdisc with a packet in lowest priority that fails to be transmitted due
to a stopped subqueue.  If we add an skb for a non-stopped queue to a higher
prio then the qdisc should be no longer stopped since we can dequeue from the
ring and transmit.  Thus, keeping a memory of which queue is stopped may not be
useful in a situation such as this.

The only thing I really prefer about my solution as opposed to the solution
Dave implemented was that it would mean only one dequeue instead of a peek
followed by a dequeue.  I figure the important thing is to push the
discovery of us being stopped to as soon as possible in the process.

It will probably be a few days before I have a patch with my approach ready.
I didn't realize how complex it would be to resolve this issue for CBQ, HTB,
HFSC, etc.  Also it is starting to look like I will probably need to implement
another function to support this since it seems like the dequeue operations
would need to be split into a multiqueue safe version, and a standard version
to support some workarounds like those found in qdisc_peek_len() for HFSC.

Thanks,

Alex

--

From: Jarek Poplawski
Date: Tuesday, September 16, 2008 - 3:47 am

On Mon, Sep 15, 2008 at 04:44:08PM -0700, Duyck, Alexander H wrote:

Actually, looking at this HFSC now I start to doubt we need to
complicate these things so much. If HFSC is OK with its simple
hfsc_requeue() I doubt other qdiscs need much more, and we should
reconsider David's idea to do the same on top, in dev_requeue_skb().
Qdiscs like multiq would probably never use this, and these above
mentioned (not mq-optimized) qdiscs could be used with multiq if
needed. Then, it seems, it would be enough to improve multiq as a
"leaf" adding these dedicated operations and/or flags.

Thanks,
Jarek P.
--

From: Alexander Duyck
Date: Tuesday, September 16, 2008 - 7:31 pm

I am just not convinced that the requeue approach will work very well.  I am
just starting to test my patch today and the cpu savings were pretty significant
against the current configuration when using just the standard prio qdisc on a
multiqueue device.

I setup a simple test running a neterf UDP_STREAM test from my test system to
one of my clients sending 1460 byte UDP messages at line rate on an 82575
with 4 tx queues enabled.  The current dequeue/requeue approach used 2.5% cpu
whenever the test was run through queue 0, but if I ended up with packets going
out one of the other queues the cpu utilization would jump to ~12.5%.  The same
test done using my patch showed ~2.5% for every queue I tested.

I will hopefully have the patch ready to submit for comments tomorrow.  I just
need to run a few tests with the patch on versus the patch off to
verify that I didn't
break any of the qdiscs and that there isn't any negative performance impact.

Thanks,

Alex
--

From: jamal
Date: Sunday, September 14, 2008 - 4:56 am

What about something that would be cheaper than a peek - just check some
driver per-hwq variable? variable is set at netif_stop and unset at
netif_wake and should tell you if the driver infact can send if you gave
it a packet (way before you dequeue from qdisc, assuming you know which
hardware queue it is going to).

cheers,
jamal

--

From: Jarek Poplawski
Date: Sunday, September 14, 2008 - 1:27 pm

If it's only for this initial check I still think my earlier proposal
should be enough:
http://marc.info/?l=linux-netdev&m=122113717013988&w=2

Anyway, the main problem here was a high cpu load despite stopped
queue. Are you sure this peek, which is almost full dequeue, can
really help for this? BTW, since after current fix there were no
later complains I guess it's just about full netif_stop or non-mq
device.

Cheers,
Jarek P.
--

From: David Miller
Date: Saturday, September 20, 2008 - 12:21 am

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 14 Sep 2008 22:27:15 +0200


I think we are overengineering this situation.

Let's look at what actually matters for cpu utilization.  These
__qdisc_run() things are invoked in two situations where we might
block on the hw queue being stopped:

1) When feeding packets into the qdisc in dev_queue_xmit().

   Guess what?  We _know_ the queue this packet is going to
   hit.

   The only new thing we can possible trigger and be interested
   in at this specific point is if _this_ packet can be sent at
   this time.

   And we can check that queue mapping after the qdisc_enqueue_root()
   call, so that multiq aware qdiscs can have made their changes.

2) When waking up a queue.  And here we should schedule the qdisc_run
   _unconditionally_.

   If the queue was full, it is extremely likely that new packets
   are bound for that device queue.  There is no real savings to
   be had by doing this peek/requeue/dequeue stuff.

The cpu utilization savings exist for case #1 only, and we can
implement the bypass logic _perfectly_ as described above.

For #2 there is nothing to check, just do it and see what comes
out of the qdisc.

I would suggest adding an skb pointer argument to qdisc_run().
If it's NULL, unconditionally schedule __qdisc_run().  Else,
only schedule if the TX queue indicated by skb_queue_mapping()
is not stopped.

dev_queue_xmit() will use the "pass the skb" case, but only if
qdisc_enqueue_root()'s return value doesn't indicate that there
is a potential drop.  On potential drop, we'll pass NULL to
make sure we don't potentially reference a free'd SKB.

The other case in net_tx_action() can always pass NULL to qdisc_run().
--

From: Herbert Xu
Date: Saturday, September 20, 2008 - 12:25 am

Your analysis sounds perfect to me.  And I'm sure happy to see
this thread die as it's starting to take up a significant amount
of space in my mailbox :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Saturday, September 20, 2008 - 12:28 am

From: Herbert Xu <herbert@gondor.apana.org.au>

It's not really dead until someone writes a patch :-)
--

From: Jarek Poplawski
Date: Saturday, September 20, 2008 - 4:48 pm

On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:

Right, unless __netif_schedule() wasn't done when waking up. I've
thought about this because of another thread/patch around this
problem, and got misled by dev_requeue_skb() scheduling. Now, I think
this could be the main reason for this high load. Anyway, if we want
to skip this check for #2 I think something like the patch below is

I'm not convinced this #1 is useful for us: this could be an skb #1000
in a queue; the tx status could change many times before this packet
would be #1; why worry? This adds additional checks on the fast path
for something which is unlikely even if this skb would be #1, but for
any later skbs it's only a guess. IMHO, if we can't check for the next
skb to be xmitted it's better to skip this test entirely (which seems
to be safe with the patch below).

Jarek P.

--------------->
pkt_sched: dev_requeue_skb: Don't schedule if a queue is stopped

Doing __netif_schedule() while requeuing because of a stopped tx queue
and skipping such a test in qdisc_run() can cause a requeuing loop with
high cpu use until the queue is awaken.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/sched/sch_generic.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..bae2eb8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,14 +42,17 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q,
+				  bool stopped)
 {
 	if (unlikely(skb->next))
 		q->gso_skb = skb;
 	else
 		q->ops->requeue(skb, q);
 
-	__netif_schedule(q);
+	if (!stopped)
+		__netif_schedule(q);
+
 	return 0;
 }
 
@@ -89,7 +92,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 		 * some time.
 		 */
 ...
From: David Miller
Date: Saturday, September 20, 2008 - 10:35 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Hmmm, looking at your patch....

It's only doing something new when the driver returns NETDEV_TX_BUSY
from ->hard_start_xmit().

That _never_ happens in any sane driver.  That case is for buggy
devices that do not maintain their TX queue state properly.  And
in fact it's a case for which I advocate we just drop the packet
instead of requeueing.  :-)

Oh I see, you're concerned about that cases where qdisc_restart() ends
up using the default initialization of the 'ret' variable.

Really, for the case where the driver actually returns NETDEV_TX_BUSY
we _do_ want to unconditionally __netif_schedule(), since the device
doesn't maintain it's queue state in the normal way.

Therefore it seems logical that what really needs to happen is that
we simply pick some new local special token value for 'ret' so that
we can handle that case.  "-1" would probably work fine.

So I'm dropping your patch.

I also think the qdisc_run() test needs to be there.  When the TX
queue fills up, we will doing tons of completely useless work going:

1) ->dequeue
2) qdisc unlock
3) TXQ lock
4) test state
5) TXQ unlock
6) qdisc lock
7) ->requeue

for EVERY SINGLE packet that is generated towards that device.

That has to be expensive, and I am still very much convinced that
this was the original regression cause that made me put that TXQ
state test back into qdisc_run().
--

From: David Miller
Date: Saturday, September 20, 2008 - 10:50 pm

From: David Miller <davem@davemloft.net>

Ok, here is the kind of thing I'm suggesting in all of this.

It gets rid of bouncing unnecessarily into __qdisc_run() when
dev_queue_xmit()'s finally selected TXQ is stopped.

It also gets rid of the dev_requeue_skb() looping case Jarek
discovered.

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&
-	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fdfc4b6..4654127 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1809,7 +1809,15 @@ gso:
 			rc = NET_XMIT_DROP;
 		} else {
 			rc = qdisc_enqueue_root(skb, q);
-			qdisc_run(q);
+
+			txq = NULL;
+			if (rc == NET_XMIT_SUCCESS) {
+				int map = skb_get_queue_mapping(skb);
+				txq = netdev_get_tx_queue(dev, map);
+			}
+
+			if (!txq || !netif_tx_queue_stopped(txq))
+				qdisc_run(q);
 		}
 		spin_unlock(root_lock);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..b6e6926 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct Qdisc *q)
 {
 	struct netdev_queue *txq;
-	int ret = NETDEV_TX_BUSY;
+	int ret = -2;
 	struct net_device *dev;
 	spinlock_t *root_lock;
 	struct sk_buff *skb;
@@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q)
 		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
-
+	case -2:
 		ret = dev_requeue_skb(skb, q);
 ...
From: Herbert Xu
Date: Saturday, September 20, 2008 - 11:38 pm

Looks good to me.

However, I've changed my mind about letting this thread die :)

Let's go back to the basic requirements.  I claim that there
are exactly two different ways for which multiple TX queues are
useful:  SMP scaling and QOS.  In the first case we stuff different
flows into different queues to reduce contention between CPUs.
In the latter case we put packets of different priorities into
different queues in order to prevent a storm of lower priority
packets from starving higher priority ones that arrive later.

Despite the different motivations, these two scenarios have
one thing in common, we can structure it such that there is
a one-to-one correspondence between the qdisc/software queues
and the hardware queues.  I know that this isn't currently the
case for prio but I'll get to that later in the message.

What I'm trying to say is that we shouldn't ever support cases
where a single qdisc empties into multiple TX queues.  It just
doesn't make sense.

For example, if you were using a qdisc like TBF, multiple queues
buy you absolutely nothing.  All it does is give you a longer queue
to stuff packets into but you can already get that in software.

Why am I saying all this? It's because a lot of the complexity
in the current code comes from supporting the case of one qdisc
queue mapping onto n hardware queues.  If we didn't do that then
handling stopped queues becomes trivial (or at least as easy as
it was before).

Put it another way, it makes absolutely no sense to map packets
onto different queues after you've dequeued them from a single
qdisc queue.  The mapping by hashing is for SMP scalability only
and if you've already gone through a single qdisc queue you can
stop worrying about it because it will suck on SMP :)

Going back to the case of prio, I think what we should do is to
create a separate qdisc queue for each band.  The qdisc selection
should be done before the packet is queued, just as we do in the
TX hashing case.

Cheers,
-- 
Visit ...
From: David Miller
Date: Sunday, September 21, 2008 - 12:03 am

From: Herbert Xu <herbert@gondor.apana.org.au>

That's a very interesting idea.

This works if you want it at the root, but what if you only wanted to
prio at a leaf?  I think that case has value too.

I tend to also disagree with another mentioned assertion.  The one
where having a shared qdisc sucks on SMP.  It doesn't.  The TX queue
lock is held much longer than the qdisc lock.

The ->hard_start_xmit() TXQ lock has to be held while:

1) DMA mapping the SKB
2) building TX descriptors
3) doing at least one PIO to the hardware

These operations, each, can be on the order of few thousands of
cycles.

Whereas a qdisc dequeue is perhaps a few hundred, maybe on the order
of a thousand, except in very elaborate classful qdisc configs.

--

From: Herbert Xu
Date: Monday, September 22, 2008 - 11:23 pm

Good question :)

I think what we should do is to pass some token that rerepsents
the TX queue that's being run down into the dequeue function.

Then each qdisc can decide which child to recursively dequeue
based on that token (or ignore it for non-prio qdiscs such as
HTB).  When the token reaches the leaf then we have two cases:

1) A prio-like qdisc that has separate queues based on priorities.
In this case we dequeue the respective queue based on the token.

2) Any other qdisc.  We dequeue the first packet that hashes
into the queue given by the token.  Ideally these qdiscs should

Yes I was exaggerating :)

However, after answering your question above I'm even more convinced
that we should be separating the traffic at the point of enqueueing,
and not after we dequeue it in qdisc_run.

The only reason to do the separation after dequeueing would be to
allow the TX queue selction to change in the time being.  However,
since I see absolutely no reason why we'd need that, it's just so
much simpler to separate them at qdisc_enqueue, and actually have
the same number of software queues as there are hardware queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, September 24, 2008 - 12:15 am

As matter of fact I can't figure out this idea of a prio at the root
or leaf either. Could you explain in which point do you expect the
gain? If it's about the locks, what kind of synchronization would be
used to assure packets from lower prio queues (or qdiscs?)  aren't
sent to free tx queues, while higher prio wait on stopped ones?

Thanks,
Jarek P.
--

From: Herbert Xu
Date: Wednesday, September 24, 2008 - 1:04 am

It is non-prio in the sense that it has other criteria for deciding

It's very simple really.  For a non-leaf prio you determine which
child qdisc to enqueue into using the priority.  For a leaf prio
you determine which software queue to enqueue into based on the
priority.

To put it another way, what I'm saying is that instead of duplicating
the qdiscs as we do now for pfifo_fast, we should make the leaf
qdiscs duplicate its software queues to match the hardware queues
it's feeding into.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, September 24, 2008 - 1:28 am

OK, it's too simple then. Could you make this more complex and

It looks like sch_multiq, but you probably mean something else...

Cheers,
Jarek P.
--

From: Jarek Poplawski
Date: Sunday, September 21, 2008 - 8:25 am

Hmm... I really can't see any difference - except getting rid of this

Jarek P.
--

From: Jarek Poplawski
Date: Sunday, September 21, 2008 - 2:57 am

So, do you advocate both to drop the packet and unconditionally

I agree this useless work should be avoided, but only with a reliable
(and not too expensive) test. Your test might be done for the last
packet in the queue, while all the previous packets (and especially
the first one) have a different state of the queue. This should work
well for uniqueue devs and multiqueues with dedicated qdiscs, but is
doubtful for multiqueues with one qdisc, where it actually should be
most needed, because of potentially complex multiclass configs with
this new problem of blocking at the head-of-line (Alexander's main
concern).

BTW, since this problem is strongly conected with the requeuing
policy, I wonder why you seemingly lost interest in this. I tried to
advocate for your simple, one level requeuing, but also Herbert's
peek, and Alexander's early detection, after some polish(!), should

I doubt this: I've just looked at this Andrew Gallatin's report, and
there is really a lot of net_tx_action, __netif_schedule, and guess
what: pfifo_fast_requeue in this oprofile...

Jarek P.
--

From: David Miller
Date: Sunday, September 21, 2008 - 3:18 am

From: Jarek Poplawski <jarkao2@gmail.com>

I'm trying to address one thing at a time.  I really want to
also encourage an audit of the drivers that trigger that condition,

Yes, thanks for reminding me about the the multiq qdisc head of line
blocking thing.

I really don't like the requeue/peek patches, because they resulted in
so much code duplication in the CBQ and other classful qdiscs.

Alexander's patch has similar code duplication issues.

Since I've seen the code duplication happen twice, I begin to suspect
we're attacking the implementation (not the idea) from the wrong
angle.

It might make review easier if we first attack the classful qdiscs and
restructure their internal implementation into seperate "pick" and a
"remove" operations.  Of course, initially it'll just be that
->dequeue is implemented as pick+remove.

On a similar note I think all of the ->requeue() uses can die

I see.
--

From: Jarek Poplawski
Date: Sunday, September 21, 2008 - 4:15 am

That's why I think you should reconsider this simple solution for now,
until somebody proves this is wrong or something else is better.

Jarek P.

[RESEND]
From: Jarek Poplawski <jarkao2@gmail.com>
Newsgroups: gmane.linux.network
Subject: Re: [RFC PATCH] sched: only dequeue if packet can be queued to
	hardware queue.
Date: Fri, 19 Sep 2008 09:17:53 +0000
Archived-At: <http://permalink.gmane.org/gmane.linux.network/106324>

...

Actually, here is my suggestion: I think, that your obviously more
complex solution should be compared with something simple which IMHO
has similar feature, i.e. limited overhead of requeuing.

My proposal is to use partly David's idea of killing ->ops->requeue(),
for now only the first two patches, so don't requeue back to the
qdiscs, but leave qdisc->ops->requeue() for internal usage (like in
HFSC's qdisc_peek_len() hack). Additionaly I use your idea of early
checking the state of tx queue to make this even lighter after
possibly removing the entry check from qdisc_run().

I attach my patch at the end, after original David's two patches.

Thanks,
Jarek P.

---------------> patch 1/3
Subject: [PATCH 1/9]: pkt_sched: Make qdisc->gso_skb a list.
Date: Mon, 18 Aug 2008 01:36:47 -0700 (PDT)
From: David Miller <davem@davemloft.net>
To: netdev@vger.kernel.org
CC: jarkao2@gmail.com
Newsgroups: gmane.linux.network


pkt_sched: Make qdisc->gso_skb a list.

The idea is that we can use this to get rid of
->requeue().

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/sch_generic.h |    2 +-
 net/sched/sch_generic.c   |   12 +++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 84d25f2..140c48b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -52,7 +52,7 @@ struct Qdisc
 	u32			parent;
 	atomic_t		refcnt;
 	unsigned long		state;
-	struct sk_buff		*gso_skb;
+	struct sk_buff_head	requeue;
 	struct ...
From: David Miller
Date: Monday, September 22, 2008 - 10:16 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Ok, that sounds reasonable.  I've added those three patches to net-next-2.6
and will push those out after some build tests.
--

From: Jarek Poplawski
Date: Tuesday, September 23, 2008 - 1:02 am

OK, then we have to say B and try this all. BTW, I guess, after this
change we could have similar effect as reported by Alexander Duyck
while testing his solution for this problem, namely the higher drop
rate in some cases, which I can only explain as: less time in
requeuing more time for new enqueuing. Of course, if I'm right, this
"bug" should be rather "fixed" with longer queues or some other
throttle mechanism.

Thanks,
Jarek P.

-------------------->

pkt_sched: Remove the tx queue state check in qdisc_run()

The current check wrongly uses the state of one (currently the first)
tx queue for all tx queues in case of non-default qdiscs. This check
mainly prevented requeuing loop with __netif_schedule(), but now it's
controlled inside __qdisc_run(), while dequeuing. The wrongness of
this check was first noticed by Herbert Xu.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&
-	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 
--

From: David Miller
Date: Tuesday, September 23, 2008 - 1:06 am

From: Jarek Poplawski <jarkao2@gmail.com>

Agreed and applied, thanks Jarek.
--

From: Jarek Poplawski
Date: Thursday, September 11, 2008 - 4:51 am

Very good point: "That can't be right"! But I'm not sure I got the
above idea: e.g. this new sch_multiq can have different packets
depending just on some unstopped tx_queue.

IMHO, the most reasonable test here is for all tx_queues of a qdisc
beeing stopped, but since this is quite heavy, probably we need an
additional qdisc flag for such an occasion.

Jarek P.
--

From: Herbert Xu
Date: Thursday, September 11, 2008 - 4:54 am

Yes we could do that too.  Although the head-of-qdisc approach
will eventually lead to the same result.  That is, as you pop
things off the head eventually you'll hit a packet that belongs
to the stopped queue and that'll then block the whole qdisc.

So I don't think there's anything inherently advantageous in
checking all the queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, September 11, 2008 - 5:10 am

Yes, but this is only because this current behaviour of blocking all
transmit by one stopped tx_queue is wrong (IMHO), and with sch_multiq
there should be a real advantage.

Cheers,
Jarek P.
--

From: Jarek Poplawski
Date: Thursday, September 11, 2008 - 5:34 am

BTW, I hope I wasn't misunderstood here. I don't mean any additional
checking of all the queues anywhere. It's only about setting such
a flag at the end of netif_stop_all_queues() and netif_stop_queue().

Jarek P.
--

From: David Miller
Date: Thursday, August 21, 2008 - 5:11 am

From: Jarek Poplawski <jarkao2@gmail.com>

Applied, thanks a lot Jarek.
--

From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 1:17 am

On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:

Here is my proposal again...

Jarek P.

-----------> (resend with changelog fixed only)

net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().

Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate(). Now, if this flag is seen
as unset under root_lock means a qdisc can't be netif_scheduled.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/dev.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..f67581b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1339,19 +1339,23 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 }
 
 
-void __netif_schedule(struct Qdisc *q)
+static inline void __netif_reschedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
+	struct softnet_data *sd;
+	unsigned long flags;
 
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_schedule(struct Qdisc *q)
+{
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
+		__netif_reschedule(q);
 }
 EXPORT_SYMBOL(__netif_schedule);
 
@@ -1974,15 +1978,15 @@ static void net_tx_action(struct softirq_action *h)
 
 			head = head->next_sched;
 
-			smp_mb__before_clear_bit();
-			clear_bit(__QDISC_STATE_SCHED, &q->state);
-
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) ...
From: Jarek Poplawski
Date: Thursday, August 14, 2008 - 4:24 am

On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:

BTW, here is my older doubt revisited, where I hope to be re-considered/
re-convinced, if possible...

Thanks,
Jarek P.

---------------->

pkt_sched: Destroy qdiscs under rtnl_lock again.

We don't need to trigger __qdisc_destroy() as an RCU callback because
the use of qdisc isn't controlled by RCU alone: after querying RCU
with synchronize_rcu() in dev_deactivate() we additionaly wait in a
loop checking some flags. After the loop is done there could be no
outstanding use of the qdisc, so call_rcu() doesn't make any sense.

On the other hand, current calling Qdisc's ->destroy() from a softirq
context without locking (rtnl) can break various things like:
qdisc_put_rtab(), tcf_destroy_chain() (e.g. u32_destroy()), and
probably more.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_generic.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4685746..e7379d2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -518,12 +518,8 @@ void qdisc_reset(struct Qdisc *qdisc)
 }
 EXPORT_SYMBOL(qdisc_reset);
 
-/* this is the rcu callback function to clean up a qdisc when there
- * are no further references to it */
-
-static void __qdisc_destroy(struct rcu_head *head)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
-	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
 #ifdef CONFIG_NET_SCHED
@@ -554,7 +550,7 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	if (qdisc->parent)
 		list_del(&qdisc->list);
 
-	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_destroy);
 
--

From: Jarek Poplawski
Date: Sunday, August 17, 2008 - 6:42 am

After problems while testing this by Denys in another thread
I withdraw this patch.

Thanks,


--

From: David Miller
Date: Sunday, August 17, 2008 - 2:34 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Well, I knew it was completely wrong from the beginning, sorry
to say :-)

This stuff can't be done outside of RCU, period.  I moved all of this
work into RCU for a reason, I really meant it, and none of the reasons
for that move have changed :-)

If we want to do it under RTNL we have to do something like schedule a
workqueue from the RCU handler and then take the RTNL there.
--

From: Jarek Poplawski
Date: Sunday, August 17, 2008 - 3:22 pm

Actually, I've only asked you to withdraw this patch for now, but I'm
still not convinced you're right. You should better show me first the
place where this can make a difference. (I think this test broke for
some other non RCU reason.) So, maybe you're right, but I've to check
this more.

BTW, I guess you've seen this other thread: "panic 2.6.27-rc3-git2,
qdisc_dequeue_head" where Denys and I fight with this new locking.
Alas, it looks to me as a real mess, and I currently try with this
previous idea of netdev_queue->qdisc_lock, which you didn't like too.
But, after looking at the current bugs shown by debugging I really 
think we'll have bugs here all the time without simplifying this.
I think my concept should work soon, but if you don't agree with
this at all we can stop and wait for better ideas.

Thanks,
Jarek P.



--

From: David Miller
Date: Sunday, August 17, 2008 - 3:32 pm

From: Jarek Poplawski <jarkao2@gmail.com>

It _might_ be ok once we are done sorting out the synchronization

I can't even follow your flurry of patches, and neither can the
tester :-)  I deleted the entire thread to be honest, hoping you
would come back with a simple analysis once you've worked things
out with the tester.

What is the real problem besides the correct notify_and_destroy()
issue you discovered?

The locking we have now is very simple:

1) Only under RTNL can qdisc roots change.

2) Therefore, sch_tree_lock() and tcf_tree_lock() are fully valid
   and lock the entire qdisc tree state, if and only if used under
   RTNL lock.

3) Before modifying a qdisc, we dev_deactivate(), which synchronizes
   with asynchronous TX/RX packet handling contexts.

4) The qdisc root and all children are protected by the root qdiscs
   lock, which is taken when asynchonous contexts need to blocked
   while modifying some root or inner qdisc's state.

Yes, of course, if you apply a hammer and add a bit lock at the
top of all of this it will fix whatever bugs remain, but as you
know I don't think that's the solution.

The only substance I've seen is that you've found a violation of #4 in
notify_and_destroy(), so great let's test the fix for that.
--

From: Jarek Poplawski
Date: Monday, August 18, 2008 - 1:12 pm

David Miller wrote, On 08/18/2008 12:32 AM:

Probably simple for you, but I'm not even sure of this. If it were
true there would be no this and a few other threads concerned with

I don't think it's true. What matters here is the qdisc lock. And
within the qdisc's tree only one such lock can matter because current
code doesn't use qdisc locks on lower levels. So we need to take this
one top lock. But as we have seen in notify_and_destroy() or
qdisc_watchdog() it's easy to do this wrong because you've always
analyze the tree plus activated/deactivated state. My proposal is to
simply have still this one lock but easy accessible at some well

Actually, after partly fixing this we have currently messed this up
again. We can't destroy a qdisc without RCU or some other delayed
method while we're holding a lock inside this - and this is a case
of root qdiscs... It's quite simple too, but why we've missed this
so easily?

As mentioned there was also this tricky qdisc_watchdog (or other
direct __netif_scheduling), but this is not all. Try to look at
qdisc_create() and gen_new_estimator() call. It passes
qdisc_root_lock(sch) somewhere. This is similar to qdisc_watchdog()
problem, but are you really sure we always save the right spinlock
here? I don't think so.

So, of course, if you think such problems are exceptional, and will
not return after current fixes, then we can continue, no problem
for me, but I'm not sure how about Denys or other testers and users.

Jarek P.
--

From: David Miller
Date: Monday, August 18, 2008 - 4:54 pm

From: Jarek Poplawski <jarkao2@gmail.com>

It sounds good in theory, but you cannot make this place be netdev_queue,
because multiple netdev_queue objects can point to the same qdisc.

That's why the lock isn't in netdev_queue any more, there is no longer

Yep, and that's the lock debugging thing which is triggering now.
Likely what I'll do is simply reinstate RCU but only for the freeing
of the memory, nothing more.

This keeps everything doing destruction under RTNL as desired,

If RTNL is held, we must be saving the correct lock.

Root qdisc and other aspects of qdisc configuration cannot be changing
when RTNL is held.  That is why I put RTNL assertion in
qdisc_root_lock() as this is the only valid situation where it may be
used.

-------------------- sch_generic.h --------------------
static inline spinlock_t *qdisc_root_lock(struct Qdisc *qdisc)
{
	struct Qdisc *root = qdisc_root(qdisc);

	ASSERT_RTNL();
	return qdisc_lock(root);
}
--

From: Herbert Xu
Date: Monday, August 18, 2008 - 5:05 pm

Is this the problem with dev_shutdown freeing the qdisc while
holding its lock?

Couldn't we just drop the lock before calling qdisc_destroy?

dev_shutdown can only be called after dev_deactivate, as such
all qdisc users should have disappeared by now (and if they
haven't, our waiting loop in dev_deactivate didn't work and
we should fix that).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Monday, August 18, 2008 - 5:11 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Yep, that's another possible approach.

Especially since we've bullet-proof'd dev_deactivate(), there
should be absolutely no more references to the qdisc any
longer.

I'll look into doing it this way, thanks.
--

From: David Miller
Date: Monday, August 18, 2008 - 9:07 pm

From: David Miller <davem@davemloft.net>

Ok, here it is.  I'll push this out to net-2.6 after I do
some testing here.

pkt_sched: Don't hold qdisc lock over qdisc_destroy().

Based upon reports by Denys Fedoryshchenko, and feedback
and help from Jarek Poplawski and Herbert Xu.

We always either:

1) Never made an external reference to this qdisc.

or

2) Did a dev_deactivate() which purged all asynchronous
   references.

So do not lock the qdisc when we call qdisc_destroy(),
it's illegal anyways as when we drop the lock this is
free'd memory.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/sch_api.c     |   13 ++-----------
 net/sched/sch_generic.c |    6 ------
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 7d7070b..d91a233 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -638,11 +638,8 @@ static void notify_and_destroy(struct sk_buff *skb, struct nlmsghdr *n, u32 clid
 	if (new || old)
 		qdisc_notify(skb, n, clid, old, new);
 
-	if (old) {
-		sch_tree_lock(old);
+	if (old)
 		qdisc_destroy(old);
-		sch_tree_unlock(old);
-	}
 }
 
 /* Graft qdisc "new" to class "classid" of qdisc "parent" or
@@ -1092,16 +1089,10 @@ create_n_graft:
 
 graft:
 	if (1) {
-		spinlock_t *root_lock;
-
 		err = qdisc_graft(dev, p, skb, n, clid, q, NULL);
 		if (err) {
-			if (q) {
-				root_lock = qdisc_root_lock(q);
-				spin_lock_bh(root_lock);
+			if (q)
 				qdisc_destroy(q);
-				spin_unlock_bh(root_lock);
-			}
 			return err;
 		}
 	}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6f96b7b..c3ed4d4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -518,8 +518,6 @@ void qdisc_reset(struct Qdisc *qdisc)
 }
 EXPORT_SYMBOL(qdisc_reset);
 
-/* Under qdisc_lock(qdisc) and BH! */
-
 void qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
@@ -712,14 +710,10 @@ static void ...
From: Ilpo Järvinen
Date: Monday, August 18, 2008 - 10:27 pm

After this the block became unnecessary...

-- 
 i.


--
[PATCH] pkt_sched: remove bogus block (cleanup)

...Last block local var got just deleted.

Signed-off-by: Ilpo J
From: David Miller
Date: Monday, August 18, 2008 - 10:30 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

Applied, thanks!
--

From: Jarek Poplawski
Date: Monday, August 18, 2008 - 11:46 pm

On Mon, Aug 18, 2008 at 09:07:01PM -0700, David Miller wrote:


Actually, I, and earlier Herbert, have written about destroying root
qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
prefer to leave here this lock for child qdiscs: they can remove some
common structures, so this needs more checking, and even if they don't
do this currently, there is no need to remove this possibility here.
Similarly, I'm not sure if removing BH protection is really needed
here.

And, btw., this is neither 1) nor 2) case according to the changelog,


??
+ /* Under BH for all, and qdisc_lock(qdisc) for child qdiscs only */

--

From: David Miller
Date: Tuesday, August 19, 2008 - 12:03 am

From: Jarek Poplawski <jarkao2@gmail.com>

Well you don't really know if this happens or not for sure
do you? :-)

Why don't you go make sure of this and report back what you
find?  I see no reason to account for something that cannot
happen.

It's better to have a consistent rule for qdisc_destroy()
rather than a bunch of special cases that are hard to audit.
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 12:23 am

Sure, this can be done, but needs some time. And removing such old
locking could be treacherous, so I'd say otherwise, let's leave it
where we are not sure (and where it's not necessary) until more
checking. Currently I think mostly of something like cls_u32(). And

I don't agree with this: there is a difference between doing total
destruction, when you are sure proper order doesn't matter, and
nobody will ever read after this.

Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 12:23 am

[Empty message]
From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 12:35 am

By convention, there was always this comment that destroy is under
sch_tree_lock(), so it was legal to depend on this. I'm not afraid
of somebody using such an under destroy qdisc - it's about a code

Sure, I'll try to look for such problems.

Cheers,
Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 12:46 am

No no no, it's not about qdisc_destroy at all.  If you're relying
on the lock around qdisc_destroy, then you're already too late.
The qdisc should have been removed before we get to qdisc_destroy.

It's the act of removal that's protected by the root lock, and
still is.  For example, in htb_graft we do sch_tree_lock before

Of course if you find any classful qdisc that does not hold the
tree lock when killing children, then please send patches.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 12:56 am

I'm not sure I can understand you: could you look at htb_destroy()
instead and think of this as a child qdisc of prio or another htb?
Having a top level "queue" lock guarantees there is no activity at
the whole tree at the moment.

Thanks,
Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 1:05 am

htb_destroy can either be called by qdisc_destroy or when a brand
new HTB qdisc fails construction.  The latter case is trivial since
the qdisc has never been used.

In the first case, as you have seen from my previous email, the
entire branch containing the HTB qdisc (that is, either the HTB
qdisc itself if it's being deleted directly, or the branch stemming
from its ancestor that's being deleted) must no longer have any
references to it at all apart from this thread of execution.

As such we can do whatever we want with it, including freeing it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 1:17 am

As I've written before I'm mainly concerned with things like
tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
this. So, if you don't have such concerns, let's forget it for now,
and after I look at this more maybe we'll get back to this discussion.

Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 1:23 am

Well I can't vouch for every single qdisc in the tree.  However,
what I can say is that as long as they respect the rules I outlined
earlier with regards to holding the root qdisc lock when deleting
or using children, then they'll work as expected.

You're definitely welcome to audit the qdiscs to make sure that
they are obeying the rules.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Tuesday, August 19, 2008 - 1:32 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Jarek may have a point about the u32 classifier.  So we
should think about it.

The hash tables and tp_u_common objects are shared, and
it does non-atomic refcounting during destruction, see
u32_destroy().

However, this all might be OK because all of this management
is performed only under the RTNL semaphore.
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 1:41 am

Sure, this all should be write protected. I'm concerned only about
the read side here.

Jarek P.
--

From: David Miller
Date: Tuesday, August 19, 2008 - 1:48 am

From: Jarek Poplawski <jarkao2@gmail.com>

If reference count hits zero, nobody can reference the object.
Reference counts only change under RTNL, which is my point.

The old qdisc object is taken away from global visibility inside of
the cops->graft() call done by qdisc_graft().  This handler already
must do whatever locking is necessary while removing the qdisc from
visibility from the packet path.

And by virtue of that, the old qdisc and anything it references will
no longer be visible to the packet processing path after cops->graft()
returns.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 1:50 am

I had a look and it seems to be OK to me.  Essentially we have
two sides in all this, the read side which is the path that
transmits packets, and the write side which is the control path.

As with a qdisc_destroy, u32_destroy can only be called once all
read sides have exited so we won't be racing against that.  In
fact if we were able to race against it then holding the lock is
no good anyway because this implies the u32 object is still
referenced by a qdisc tree and as such once we release the lock

Right, all other writers have been excluded by RTNL so we should
be the only thread with a reference to the u32 and we can decrement
the ref count non-atomically.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 1:39 am

That's my point - is there really a reason do this change without such
an audit if we are not forced at the moment? (I'd remind this way of
doing things was entirely legal according to comments.) I doubt, I'm
the right person for auditing this but as I said I'll have a look,
especially when there will be lack of those fascinating oopses and
warnings around.

Cheers,
Jarek P.
--

From: Herbert Xu
Date: Tuesday, August 19, 2008 - 1:55 am

No you misunderstood my point.  I wasn't saying that I'm not confident
that our qdiscs obey the rules, but rather that if any of them didn't,
then they're buggy and should be fixed.

In fact we're not really adding anything new here, the qdiscs were
not accessed under RCU uniformly.  If you go back in the tree prior
to the multi-qdisc stuff, you'll find that only dev_queue_xmit works
under RCU.  qdisc_restart does not and therefore deferring the
destruction to RCU is pointless anyway.

So in fact we've already been relying on the fact that by the time
qdisc_destroy comes about nobody on the read side (i.e., the packet
transmission path) should have a reference to it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 19, 2008 - 2:16 am

What difference does it make? You're not sure thinks will not break

Let's not discuss using such a qdisc by others but a possibility
that some common lists could be broken for readers from upper qdiscs.
(They were not deactivated.) Of course, if it's done properly with
some references before qdisc_destroy then it's all right. I'd prefer
to check this later yet.

Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 3:01 am

So, what I was most suspicious of, cls_u32, looks like safe wrt. this.
Congratulations for good estimation of this.

But how about this part in qdisc_destroy(): 
        if (qdisc->parent)
                list_del(&qdisc->list);

If we do this with child qdisc from qdisc_graft() it's without
deactivation. The rest of the tree can be dequeued in the meantime
and call qdisc_tree_decrease_qlen() (like hfsc, tbf, netem), which
uses qdisc_lookup() to access this list. We list_del() under rtnl
lock only, they lookup under sch_tree_lock(). Is it a bit unsafe
or I miss something?

Thanks,
Jarek P.
--

From: David Miller
Date: Thursday, August 21, 2008 - 3:05 am

From: Jarek Poplawski <jarkao2@gmail.com>

They hold RTNL as well.

Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
told you at least twice now asserts that RTNL is held :-)
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 3:11 am

Actually, I've called it wrong, they hold qdisc root lock, but they
certainly can't have rtnl_lock() at the moment!

Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 3:18 am

I mean here: hfsc_dequeue(), netem_dequeue() and tbf_dequeue().

Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:21 am

Where do they use qdisc->list?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:23 am

For the next development tree we could even add an ASSERT_RTNL to
qdisc_lookup and co to ensure that nobody abuses this interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 3:33 am

But if I get above right, you acknowledge the problem currently exists?

Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:51 am

Why don't you just point out the problems that you see rather
than hypothesising :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 4:20 am

I thought I described the problem quite clearly, except this error in
the lock name. I certainly can miss much more things than you. And,
after your answer I'm simply not sure if the question is still valid.

Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 4:26 am

OK I've lost track of what you were trying to say.  Could you please
just restate the problem you saw?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 4:55 am

Sure, here is a scenario:

cpu1					cpu2
rtnl_lock()
qdisc_graft()
// parent != NULL
->cops-graft()
notify_and_destroy()			qdisc_run()
					spin_lock(root_lock)
qdisc_destroy(old)			dequeue_skb()
					tbf_dequeue()
					qdisc_tree_decrease_qlen()	
					qdisc_lookup()
//deleting from qdisc_sleeping->list	//walking qdisc_sleeping->list
//under rtnl_lock() only		//under qdisc root_lock only
list_del(qdisc->list)			list_for_each_entry(txq_root)


Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 5:01 am

Good catch.  Longer term we should fix it so that it doesn't do
the silly lookup at run-time.  In fact we'll be getting rid of
requeue so there will be no need to do this in TBF at all.

However, for now please create a patch to put a pair of root locks
around the list_del in qdisc_destroy.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 5:19 am

Since qdisc_destroy() is used for destroying root qdisc too, isn't it
better to get this lock back in notify_and_destroy() like this:

if (old) {
	if (parent)
		sch_tree_lock()
	qdisc_destroy()
	if (parent)
		sch_tree_unlock()
}
	

Thanks,
Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 5:22 am

Ugly as it may look, I think this is probably the best fix for now.

For 2.6.28 we can fix it properly and remove this eyesore :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, August 21, 2008 - 5:27 am

From: Herbert Xu <herbert@gondor.apana.org.au>

This looks even worse, actually.

If we just unlinked this thing, we don't want anyone finding
it, even before we grab this lock, to adjust queue counts.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 5:35 am

You're right, this doesn't work at all.  In fact it's been broken
even before we removed the root lock.  The problem is that we used
to have one big linked list for each device.  That was protected
by the device qdisc lock.  Now we have one list for each txq and
qdisc_lookup walks every single txq.  This means that no single
qdisc root lock can protect this anymore.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 5:48 am

How about going back to a single list per-device again? This list
is only used on the slow path (well anything that tries to walk
a potentially unbounded linked list is slow :), and qdisc_lookup
walks through everything anyway.

We'll need to then add a new lock to protect this list, until we
remove requeue.

Actually just doing the locking will be sufficient.  Something like
this totally untested patch (I've abused your tx global lock):

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..3f5f9b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -202,16 +202,25 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 {
 	unsigned int i;
+	struct Qdisc *q;
+
+	spin_lock_bh(&dev->tx_global_lock);
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
-		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
+		struct Qdisc *txq_root = txq->qdisc_sleeping;
 
 		q = qdisc_match_from_root(txq_root, handle);
 		if (q)
-			return q;
+			goto unlock;
 	}
-	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+unlock:
+	spin_unlock_bh(&dev->tx_global_lock);
+
+	return q;
 }
 
 static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c3ed4d4..292a373 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -526,8 +526,10 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
 
+	spin_lock_bh(&dev->tx_global_lock);
 	if (qdisc->parent)
 		list_del(&qdisc->list);
+	spin_unlock_bh(&dev->tx_global_lock);
 
 #ifdef CONFIG_NET_SCHED
 	qdisc_put_stab(qdisc->stab);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: ...
From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 5:55 am

Alas I've to have a break now, anyway I think for now "my" proposal
should be safer since it worked like this before... But of course
after good checking your patch should be better.

--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 6:12 am

Well your patch doesn't work and my patch doesn't compile :)

The problem is that qdisc_lookup walks through every txq's list while
the sch_tree_lock protects a single txq only.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 11:58 am

I don't think there is such a problem. I thought you and David were
concerned with something trying to find and use this qdisc: that's why
I wrote it's nobody ancestor at the moment. sch_tree_lock() should be
enough for now because in the current implementation we have only one
root qdisc with pointers copied to every dev_queue. At least I can't
see nothing more in qdisc_create() and qdisc_graft(). So,
qdisc_lookup() seems to be designed for the future (or to do this
lookup more exactly with additional loops...). 

Cheers,
Jarek P.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 2:14 pm

On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:

Of course, except the state before qdisc_create() and qdisc_graft()
with default qdiscs, but it doesn't seem to matter here.

Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:23 pm

We've got at least the RX and TX queues.  That makes two locks and
two lists.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 1:49 am

RX currently doesn't add anything to the list.

Jarek P.
--

From: David Miller
Date: Friday, August 22, 2008 - 1:55 am

From: Jarek Poplawski <jarkao2@gmail.com>

That's correct, currently ingress qdiscs only support a hierarchy of
single root and that's it.
--

From: Herbert Xu
Date: Friday, August 22, 2008 - 3:07 am

We seem to be talking about different things.

Yes the ingress hierachy has a single root, i.e., it's a tree.  But
that has nothing to do with what I was talking about.  I'm talking
about the list at dev->rx_queue.qdisc_sleeping->list which is
certainly not guaranteed to be empty.

If you look at qdisc_create you'll find that every time we create
a non-root ingress qdisc we add it to that list (we have to,
otherwise qdisc_lookup doesn't work at all for ingress qdiscs).
So when somebody on the TX side does a qdisc_lookup they may be
walking the RX list without any protection.  Similarly, if somebody
on the ingress side does qdisc_lookup they may walk the TX lists
without protection.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Friday, August 22, 2008 - 3:27 am

From: Herbert Xu <herbert@gondor.apana.org.au>

It is guarenteed to be empty.

Only root qdiscs go to rx_queue.qdisc_sleeping, and such qdiscs

We don't allow non-root ingress qdiscs.  All ingress qdiscs
always take the device graft path, and always have a parent

Ingress data paths do not do qdisc_lookup().  There is only root
allowed for ingress and thus rx_queue's non-default qdiscs.

Add some assertions and run some test tc commands if you don't believe
me :)
--

From: Herbert Xu
Date: Friday, August 22, 2008 - 4:02 am

Ah yes, I forgot that the ingress qdisc is special, and that's
why we have the ifb device.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 4:38 am

As a matter of fact your doubts around this enlightened me only now
there is something "wrong" here... This qdisc_lookup(), even if
there were all these multi RX and TX things implemented, still
shouldn't matter because what qdisc_tree_decrease_qlen() could be
interested in is only one qdisc tree. So it looks like current
implementation of qdisc_lookup() is an overkill for this anyway.

Jarek P.
--

From: David Miller
Date: Friday, August 22, 2008 - 4:42 am

From: Jarek Poplawski <jarkao2@gmail.com>

Yes, that is true.

We could add true parent backpointers for this.  Speaking of which,
look at the existing __parent hack that's there in struct Qdisc for
CBQ :-)
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 5:09 am

I guess, we need to establish first how much it's needed on fast
paths. But anyway, it seems something like qdisc_match_from_root()
should be enough in this qdisc_tree_decrease_qlen().

Jarek P.
--

From: Herbert Xu
Date: Friday, August 22, 2008 - 5:11 am

Or we could just replace requeue with a peek interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Friday, August 22, 2008 - 5:18 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Yes, and as per discussions over the past week or so, that would
allow us to eliminate basically everything except the netem usage.

Netem is trying to create packet reordering by requeueing (which is
logically like a head insert) instead of enqueueing (which is
logically like a tail insert).
--

From: Herbert Xu
Date: Friday, August 22, 2008 - 5:45 am

Why does it use requeue when tfifo's enqueue will insert this
at the front of the queue anyway based on time_to_send anyway?

Perhaps it's trying to force reordering when the user replaces
tfifo with some other qidsc? But that doesn't make sense since
netem relies on tfifo to implement all the other features such
as delay and jitter.

So how about

1) Change netem_graft to disallow the replacement of tfifo;
2) Simply use enqueue instead of requeue in netem_enqueue?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Sunday, August 24, 2008 - 4:26 pm

On Fri, 22 Aug 2008 22:45:57 +1000


Tfifo is there only to add the jitter based reordering. Netem has other
better kinds of reordering as well.

Netem has to be able to put TBF in as a child qdisc. This is how loss
plus rate control is done.

Requeue was the natural way to do this based on the API's available
at the time.
--

From: Herbert Xu
Date: Sunday, August 24, 2008 - 4:49 pm

What about making TBF a child of tfifo instead? Alternatively
we could merge tfifo into netem.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Sunday, August 24, 2008 - 5:29 pm

On Mon, 25 Aug 2008 09:49:47 +1000

But then you couldn't replace tfifo with pfifo or tbf??
--

From: Herbert Xu
Date: Tuesday, August 26, 2008 - 12:35 am

Couldn't the user install TBF as a parent of netem instead?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Tuesday, August 26, 2008 - 12:47 am

In fact, having tfifo there all the time gives us something
that we couldn't do before.  Conceptually, whether TBF is above
or below netem corresponds to a network topology where the shaping
occurs after or before the segment that netem is simulating,
respectively.

If you actually replaced tfifo with TBF (as we do now), then the
shaping always occurs after the segment simulated by netem.  That
is, this is pretty much the same as having TBF as the parent.

BTW, the use of the CB area conflicts with the new pkt_len stuff
so either netem or the pkt_len code needs to be fixed.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Tuesday, August 26, 2008 - 5:24 am

On Tue, 26 Aug 2008 17:47:02 +1000

The problem with netem as child of TBF is that TBF counts the number
of packets in the queue to determine the rate. Therefore TBF gets confused
about the rate because of the large number of packets that are held in
netem when delaying. 

In an earlier version, I did rate control in netem but jamal thought
doing layering was better and it has worked until the redesign.
--

From: Herbert Xu
Date: Tuesday, August 26, 2008 - 5:41 am

(Adding Jamal to the cc)


OK I'm probably missing something.  I can't find any code in TBF
that looks at the number of packets held in the queue.  All it
does is look at the dequeued packet and whether we have enough
tokens to send it right now.

In any case, looking at the number of packets in the queue sounds
broken for TBF as the packets could be held in upper-level queues
which are invisible.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Tuesday, August 26, 2008 - 5:50 am

On Tue, 26 Aug 2008 22:41:53 +1000

Last time I tried TBF(100kbit) { netem(+100ms) } it gave different answers
than netem(+100ms) { TBF(100kbit) }.  

I would prefer a peek() to the current dequeue/requeue.
An alternative would be to have netem keep a parallel data structure with
the time to send for all packets, but that would be assuming the underlying
qdisc's were work conserving.
--

From: Herbert Xu
Date: Tuesday, August 26, 2008 - 5:56 am

The peek() interface isn't really appliacable for netem since the
packet that it's requeueing wasn't dequeued in the first place.

In any case, what I'm trying to say is that netem should really
have its own queue (e.g., just fold tfifo in) to implement the
reordering and delays.

This does not prevent the user from creating children of netem
such as TBF to simulate a network environment where you have
loss/delay/jitter after traffic goes through a shaper.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Bastian Bloessl
Date: Wednesday, August 27, 2008 - 5:17 am

This might be because in the 2nd case you waste tokens when netem 
requeues to TBF.
But TBFs requeue() can't restore tokens because as David said netem uses 
it to enqueue reordered packets.
--

From: David Miller
Date: Wednesday, August 27, 2008 - 2:32 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Netem should be just fine, it's using the opaque ->data[]
blob at the end of qdisc_skb_cb which was created exactly
for this purpose I imagine. :)

Let me know if I missed something.

--

From: Herbert Xu
Date: Wednesday, August 27, 2008 - 2:56 am

You're right I missed that bit :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 5:25 am

...Or with a kfree_skb() interface ;) That's just to establish.

Jarek P.
--

From: David Miller
Date: Saturday, August 23, 2008 - 5:15 am

From: Jarek Poplawski <jarkao2@gmail.com>

Yes, I designed it for the future.

Nothing really uses this capability currently.
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 1:40 pm

[Empty message]
From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:24 pm

Sure, that was just an illustration of what I meant.  It doesn't
even compile :)

Feel free to fix this up into a real patch with a new lock.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 1:41 am

On Fri, Aug 22, 2008 at 08:24:47AM +1000, Herbert Xu wrote:

It looks like adding to the list needs similar protection, but if I
exagerated here let me know.

Thanks,
Jarek P.

--------------->

pkt_sched: Fix qdisc list locking

Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.

With feedback from Herbert Xu and David S. Miller.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |   44 +++++++++++++++++++++++++++++++++++++++-----
 net/sched/sch_generic.c |    5 ++---
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 853fe83..b786a5b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -78,6 +78,7 @@ extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops,
 
 extern int register_qdisc(struct Qdisc_ops *qops);
 extern int unregister_qdisc(struct Qdisc_ops *qops);
+extern void qdisc_list_del(struct Qdisc *q);
 extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle);
 extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle);
 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 45f442d..e35b8d8 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -199,19 +199,53 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 	return NULL;
 }
 
+/*
+ * This lock is needed until some qdiscs stop calling qdisc_tree_decrease_qlen()
+ * without rtnl_lock(); currently hfsc_dequeue(), netem_dequeue(), tbf_dequeue()
+ */
+static ...
From: Herbert Xu
Date: Friday, August 22, 2008 - 3:14 am

Yes I agree.  List addition is certainly not safe for walkers
without memory barriers.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Friday, August 22, 2008 - 2:27 am

I made an error in the name of this new lock in the changlog,
so I decided to fix this in ...the patch.

Sorry,
Jarek P.

---------------> (take 2)

pkt_sched: Fix qdisc list locking

Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.

With feedback from Herbert Xu and David S. Miller.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |   44 +++++++++++++++++++++++++++++++++++++++-----
 net/sched/sch_generic.c |    5 ++---
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 853fe83..b786a5b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -78,6 +78,7 @@ extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops,
 
 extern int register_qdisc(struct Qdisc_ops *qops);
 extern int unregister_qdisc(struct Qdisc_ops *qops);
+extern void qdisc_list_del(struct Qdisc *q);
 extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle);
 extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle);
 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 45f442d..e7fb9e0 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -199,19 +199,53 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 	return NULL;
 }
 
+/*
+ * This lock is needed until some qdiscs stop calling qdisc_tree_decrease_qlen()
+ * without rtnl_lock(); currently hfsc_dequeue(), netem_dequeue(), tbf_dequeue()
+ */
+static DEFINE_SPINLOCK(qdisc_list_lock);
+
+static void ...
From: Herbert Xu
Date: Friday, August 22, 2008 - 3:15 am

Good catch! I'm not sure whether this would actually break but
it certainly makes me feel a lot better :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Friday, August 22, 2008 - 3:28 am

From: Herbert Xu <herbert@gondor.apana.org.au>


Thankfully list_del() on an empty list works or we'd have tons of
reports :)
--

From: David Miller
Date: Friday, August 22, 2008 - 3:23 am

From: Jarek Poplawski <jarkao2@gmail.com>

Looks good Jarek, thanks!

Since these requeue failure cases are slow paths, this shouldn't have
any performance impact either.

I'll apply this, thanks!
--

From: Jarek Poplawski
Date: Thursday, August 21, 2008 - 5:49 am

I don't think there could be such a problem, since nobody should look
for such a destroyed qdisc: they look for their ancestors only.
Anyway, I can't do this patch before evening, so I wait for
suggestions or you could simply do it as you wish, no problem.

Cheers,
Jarek P.
--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 5:51 am

It's not about looking for the destroyed qdisc, it's about walking
the list and accidentally hitting the destroyed qdisc and wandering
into lalaland.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Thursday, August 21, 2008 - 5:06 am

From: Jarek Poplawski <jarkao2@gmail.com>

Grrr... :-)

Note that this only happens when my arch nemesis, ->requeue(), fails.
Same applies to the netem case, and hfsc's "peek".

All other qdisc_tree_decrease_qlen() users hold RTNL.

Really, it proves ->requeue() should die, and be replaced with "peek"
and "unlink" methods.

--

From: Herbert Xu
Date: Thursday, August 21, 2008 - 3:18 am

This list should only be used by things like qdisc_lookup which
occurs under the RTNL.  The actual packet processing certainly
should not be walking a linked list.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Tuesday, August 12, 2008 - 3:02 pm

So, since it's currently impossible, here is an alternative solution.

Jarek P.

------------>

pkt_sched: Protect gen estimators under est_lock.

gen_kill_estimator() required rtnl_lock() protection, but since it is
moved to an RCU callback __qdisc_destroy() let's use est_lock instead.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/gen_estimator.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 57abe82..a89f32f 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -99,7 +99,7 @@ struct gen_estimator_head
 
 static struct gen_estimator_head elist[EST_MAX_INTERVAL+1];
 
-/* Protects against NULL dereference */
+/* Protects against NULL dereference and RCU write-side */
 static DEFINE_RWLOCK(est_lock);
 
 static void est_timer(unsigned long arg)
@@ -185,6 +185,7 @@ int gen_new_estimator(struct gnet_stats_basic *bstats,
 	est->last_packets = bstats->packets;
 	est->avpps = rate_est->pps<<10;
 
+	write_lock_bh(&est_lock);
 	if (!elist[idx].timer.function) {
 		INIT_LIST_HEAD(&elist[idx].list);
 		setup_timer(&elist[idx].timer, est_timer, idx);
@@ -194,6 +195,7 @@ int gen_new_estimator(struct gnet_stats_basic *bstats,
 		mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
 
 	list_add_rcu(&est->list, &elist[idx].list);
+	write_unlock_bh(&est_lock);
 	return 0;
 }
 
@@ -212,7 +214,6 @@ static void __gen_kill_estimator(struct rcu_head *head)
  * Removes the rate estimator specified by &bstats and &rate_est
  * and deletes the timer.
  *
- * NOTE: Called under rtnl_mutex
  */
 void gen_kill_estimator(struct gnet_stats_basic *bstats,
 	struct gnet_stats_rate_est *rate_est)
@@ -226,17 +227,17 @@ void gen_kill_estimator(struct gnet_stats_basic *bstats,
 		if (!elist[idx].timer.function)
 			continue;
 
+		write_lock_bh(&est_lock);
 		list_for_each_entry_safe(e, n, &elist[idx].list, list) {
 			if (e->rate_est != rate_est || ...
From: David Miller
Date: Wednesday, August 13, 2008 - 3:20 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Looks good, applied, thanks Jarek.
--

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem by David Witbrodt on Monday, August 11, 2008 - 12:05 pm. (1 message)

Next thread: [PATCH] pkt_sched: Add BH protection for qdisc_stab_lock. by Jarek Poplawski on Monday, August 11, 2008 - 2:42 pm. (3 messages)