Hi
Unfortunately this is only panic report.
I used cgroup net_cls subsystem, then kernel panic occured.
I attach panic message and kernel config in this mail's last paragraph.
If my operation is wrong, could you tell me how to use net_cls and
where the documentation is.
# But I think panic is very bad even if my operation is wrong.
System Environment:
kernel:
* 2.6.29.1 on x86_64
* 2.6.29.3 on x86_64
* 2.6.30-rc6 on x86_64
(panic occurred same sequence but I couldn't confirm if it was
same problem.
Because crash couldn't analyze kdump of 2.6.30-rc6.
This is just the crash utility(analysis tool) problem. X-<)
tc:
* iproute2-2.6.29-1
(download from http://devresources.linux-foundation.org/dev/iproute2/download/)
How to reproduce:
1. mount net_cls subsystem
# mkdir /cgroup
# mount -t cgroup -onet_cls none /cgroup
2. set qdisc, class by tc
# tc qdisc add dev bond0 root handle 1: htb default 30
# tc class add dev bond0 parent 1:1 classid 1:10 htb rate 10mbit
# tc class add dev bond0 parent 1:1 classid 1:20 htb rate 20mbit
# tc class add dev bond0 parent 1:1 classid 1:30 htb rate 30mbit
3. set net_cls.classid (classify to classid 1:10)
# echo 0x1000a > /cgroup/net_cls.classid
4. set filter in order to use net_cls
# tc filter add dev bond0 protocol ip parent 1: prio 1 cgroup
-> panic occured!
Step 3 and 4, I referred to the following mails, because I couldn't find any other useful documentation.
http://kerneltrap.org/mailarchive/linux-netdev/2008/10/14/3653914/thread
[Panic message]
------------------------------------------------------------------
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff80447650>] cls_cgroup_classify+0x4d/0x9c
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:24:00.0/local_cpus
CPU 2
Modules linked in: sch_sfq ...Hi,
On Thu, 21 May 2009 09:22:56 +0900
I investigated this problem, and I found a bug in tc_ctl_tfilter() in net/sched/cls_api.c.
When 'tc filter add' command is executed and proto-tcf does not exist,
tcf_ctl_tfilter() allocates, initializes and chanins proto-tcf(tp) to
cops->tcf_chain()'s chain before calling tp->ops->change().
If tp->ops->change() returns an error, tcf_ctl_tfilter() returns an error
too, but proto-tcf(tp) is not unchained yet.
I think tcf_ctl_tfilter() shouldn't chain proto-tcf to the chain before it returns
an error.
Signed-off-by: Minoru Usui <usui@mxm.nes.nec.co.jp>
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0759f32..756148b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -266,11 +266,6 @@ replay:
goto errout;
}
- spin_lock_bh(root_lock);
- tp->next = *back;
- *back = tp;
- spin_unlock_bh(root_lock);
-
} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
goto errout;
@@ -314,8 +309,17 @@ replay:
}
err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
- if (err == 0)
- tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
+ if (err) {
+ tcf_destroy(tp);
+ goto errout;
+ }
+
+ spin_lock_bh(root_lock);
+ tp->next = *back;
+ *back = tp;
+ spin_unlock_bh(root_lock);
+
+ tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
errout:
if (cl)
--
Minoru Usui <usui@mxm.nes.nec.co.jp>
--
I didn't verify this too much, so I might be wrong, but it looks like cls_cgroup_classify() does things a bit different than others (doesn't check the 'head' etc. for NULL), so maybe you should consider fixing it instead? (Btw., the tc classifier maintainer added to Cc). Cheers, --
OOPS! Others mostly don't check this either, so my suggestion was wrong. Sorry, Jarek P. --
Hmm... Or maybe I wasn't so wrong; it seems classifiers which don't assign the head during init do this check for NULL later. Jarek P. --
Thanks for the CC Jarek. It is good practise to initialize things in init() and use attributes in change() to set them. I will read the thread and check out the code path then respond more intelligently. cheers, jamal --
Hi there, Sorry - cant help you on earlier question in the thread on syntax of cls_group config (but we can revisit after resolving this). You should always copy the maintainer if you want quick answers (for cls_group case Thomas Graf). On your patch: I think you have found a real issue (I have a strong feeling it has everything to do with your config process) Comments below: This is incorrect. tp may already exist and you dont want to destroy for failure to change its parameters. You also dont want to reattach an existing tp because it succeeded in parameter change. So the soln is to check if this is a new tp and then do what you did above... Did that make sense? cheers, jamal --
Hi Minoru, I hate to do this to you after i made suggestions on how to make the changes.... (as an adult i hate it when people do it to me;->) But sometimes code helps. So what i meant is the attached patch. I havent even compiled it yet. If it works, please submit it and add yourself as the author (and a signed-off from me). Then we can revisit the init() issue in cls_group.. You should also cc tgraf in your cls_grp config questions. cheers, jamal
...
} else {
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
err = -EEXIST;
if (n->nlmsg_flags & NLM_F_EXCL)
goto errout;
break;
Probably this case needs tcf_destroy() too.
Cheers,
--
No - that if stmnt will fail if this is a new filter being created. cheers, jamal --
If somebody runs tc add filter with a new priority but existing handle a newly created (and not linked now) tc would be handled with goto errout and would leak, I guess. Cheers, Jarek P. --
This would imply the classifier is buggy. I will stare at the different
Excellent point - there could be buggy user space apps that will do
that. Minoru change the check to:
+ if (n->nlmsg_type == RTM_NEWTFILTER &&
+ (n->nlmsg_flags&NLM_F_CREATE &&
+ n->nlmsg_flags&NLM_F_EXCL)) {
cheers,
jamal
--
I couldnt find any case where this is possible in the current code. If you have a specific example, we need to fix it. cheers, jamal --
I'm almost sure these commands (right or wrong) can trigger such a leak: $ sudo tc qdisc add dev lo root handle 1: htb $ sudo tc filter add dev lo proto ip pref 1 handle 800::1 u32 match u32 0 0 classid 1:1 $ sudo tc filter add dev lo proto ip pref 2 handle 800::1 u32 match u32 0 0 classid 1:1 RTNETLINK answers: File exists Cheers, Jarek P. --
But then, there could be "tc filter replace" with only this (n->nlmsg_type == RTM_NEWTFILTER && (n->nlmsg_flags&(NLM_F_CREATE))) which can't get here with a newly created tp, I guess. Cheers, Jarek P. --
Right - because in that case tp would already exist and would You are good - A tip of my hat and a wag of my finger at you;-> Ok, now i looked a lot closer at all 8 classifiers and u32 is the only one that can fault. It just needs to be fixed. I will wait until Minoru's patch and then i will submit a fix for it. Of course the commands from user space are a little rude ;-> cheers, jamal --
But how about that (of course extremely rude) case "tc filter replace" is run with a new prio? Thanks, Jarek P. --
Jarek, sir, handyman extraordinaire, handsome devil, and lover of kittens I humbly opine that we need to handle that case. How about going back to your original idea of defining tp_created? With apologies to Minoru (he must be thinking we are lunatics by now), how does the attached changed patch look to you? Before you throw another rock, there is another issue which will be caused by this rude misconfig: "replace" really means "get rid of the old and add this new one". But for the last 50 years we do not "get rid of the old". I cant think of a clean way to do it sans shaving one of the kittens. One simple thing to do is to printk a warning when detecting this error. I think one needs to draw a line where bad config affects your life - in this case i dont think it is worth big changes.. cheers, jamal
It looks good to me, because it's simple and really understandable. I think below issue should separate from above patch, because it's easy to come to understand a patch. How about that? --
On Mon, Jun 01, 2009 at 09:03:30AM -0400, jamal wrote:
Actually, I'd insist with the old rock and handling that other rude
u32 case, at least until it's fixed in place. So I attach my version
of your patch (additionally I removed a pair of braces because of
checkpatch warning).
Alas, I still think we don't need to change so much in -stable to
fix the cls_cgroup oops, so I attach a patch which I think is
enough for -stable and probably -net too. It could be "reverted"
in -net-next just after applying cls_api patch. Of course, treat
it only as my humble proposal, and feel free to recommend to David
Of course I'm against any printk on shaving the kitten...
Cheers,
Jarek P.
-----------------------> patch #1
net/sched/cls_api.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0759f32..09cdcdf 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -135,6 +135,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
unsigned long cl;
unsigned long fh;
int err;
+ int tp_created = 0;
if (net != &init_net)
return -EINVAL;
@@ -266,10 +267,7 @@ replay:
goto errout;
}
- spin_lock_bh(root_lock);
- tp->next = *back;
- *back = tp;
- spin_unlock_bh(root_lock);
+ tp_created = 1;
} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
goto errout;
@@ -296,8 +294,11 @@ replay:
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
err = -EEXIST;
- if (n->nlmsg_flags & NLM_F_EXCL)
+ if (n->nlmsg_flags & NLM_F_EXCL) {
+ if (tp_created)
+ tcf_destroy(tp);
goto errout;
+ }
break;
case RTM_DELTFILTER:
err = tp->ops->delete(tp, fh);
@@ -314,8 +315,18 @@ replay:
}
err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
- if (err == 0)
+ if (err == 0) {
+ if (tp_created) {
+ spin_lock_bh(root_lock);
+ tp->next = *back;
+ *back = ...On Mon, Jun 01, 2009 at 10:49:57PM +0200, Jarek Poplawski wrote:
Since there are possible other oopses in cls_cgroup, here is an update
of patch #2.
Jarek P.
--------------------------> patch #2 (take 2)
net/sched/cls_cgroup.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 1ab4542..edb71e1 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -101,6 +101,8 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
struct cgroup_cls_state *cs;
int ret = 0;
+ if (!head)
+ return -1;
/*
* Due to the nature of the classifier it is required to ignore all
* packets originating from softirq context as accessing `current'
@@ -218,6 +220,9 @@ static void cls_cgroup_walk(struct tcf_proto *tp, struct tcf_walker *arg)
{
struct cls_cgroup_head *head = tp->root;
+ if (!head)
+ return;
+
if (arg->count < arg->skip)
goto skip;
@@ -236,6 +241,9 @@ static int cls_cgroup_dump(struct tcf_proto *tp, unsigned long fh,
unsigned char *b = skb_tail_pointer(skb);
struct nlattr *nest;
+ if (!head)
+ return -1;
+
t->tcm_handle = head->handle;
nest = nla_nest_start(skb, TCA_OPTIONS);
--
Sure, it doesnt complicate anything - Minoru, this is the version to go My view is the same - that the second patch doesnt fix the root cause; and it is not that complicated to fix the root cause. So I humbly disagree with you. The issue is that a classifer (cls_group in this example) is being misconfigured. It rejects the config - but the tp has already been added. It then tries to use the tp in the fast and fails. If you look as closely as you did with the patch i posted, youd find ways to construct similar hostile misconfigs for other classifiers. You just need to create the scenario where the attributes will fail to validate. I actually suspect the most common scenario for such a failure is not that head is null (I doubt in Minoru case that allocation will fail); rather it is some reference to head->something. cheers, jamal --
The patch #2 is obviously worse and fixes less (of course it still needs testing for Minoru's case), but I'm 100% confident it can't introduce any regression (neither take 1 nor 2), which is much harder to say about patch #1, considering various "rude" configs we could miss (but we could give them some time to show off). But, as I've written before, I'm really (really) OK with your decision. Cheers, Jarek P. --
Thanks for the courtesy. I note Dave already swallowed Minoru's patch; so lets move from there. Yes, there's a possibility of a regression - I (and so are you) are only recently evolved humans; we are not perfect... yet ;-> So i would agree with Minoru testing your patch as plan B in case the applied one starts causing trouble. BTW, ok - here's a quick untested, uncompiled fix to the u32 classifier to fix the first rock (which you already worked around in your changes to the included patch). No rush to submit for now.. On the second rock you threw so violently, after some reflection, I think it is ok to send a replace twice with different priorities. The second one will be added and the old not deleted, but if the user has chosen the correct priority, then things will work out just fine. And if they want they have to explicitly delete the one they dont want. It is also not illegal to do a "replace" for installing instead of "add". So the only other things left to do from this exercise are (no rush in any of them): a) remove all "buckets" from underneath other classifiers b) get consistency across all classifiers in usage of setup API If you want to do this - go ahead; else i plan on tackling it probably when stable 2.6.31 kicks in. cheers, jamal
On Tue, Jun 02, 2009 at 09:16:27AM -0400, jamal wrote: Thanks for your courtesy as well. Alas, I'm not sure I can fully understand the current patch. You seem to redefine the ->get() method usage, so it looks for handle only for configured tp's. It might be I definitely prefer you doing this and me asking "rude" questions.;-) Cheers, --
After the second look I have some questions: - if it's really aimed to skip checking by ->get() tp's before they're configured in ->change(), maybe instead of using tp_c to check this it would be simpler to generally skip calling ->get() for newly created tp's? - otherwise the current method probably needs adding a tp_c check for NULL in u32_destroy()? - it seems this method would also need adding a 'handle' lookup to the u32_change(); otherwise its 'handle' parameter isn't controlled for for uniqueness, unless I miss something? Cheers, --
Apologies for the latency. Lets just ignore the need for these changes since the patch fixes them for now. I would still like to make the changes i suggested but later with more thought put into them. cheers, jamal --
On Mon, 01 Jun 2009 18:06:03 -0400 I've tested Jarek's patch #1(for cls_api.c). It works fine to me. Thanks Jarek. Tested-by: Minoru Usui <usui@mxm.nes.nec.co.jp> If we have to test Jarek's patch #2, I'll test it tomorrow. What do you think Jamal? -- Minoru Usui <usui@mxm.nes.nec.co.jp> --
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Actually, as far as I can remember, it's your patch with slight modifications by Jamal and me. So please Minoru, resend this patch as Jamal previously suggested: with your changelog and Jamal's, plus maybe mine (if you don't mind) signed-off-by's. Thanks, Jarek P. --
On Tue, 2 Jun 2009 07:24:36 +0000 OK. Thank you for your advise. I'll resend the patch as a new thread. -- Minoru Usui <usui@mxm.nes.nec.co.jp> --
Yes please - even if it is for reasons of givein Jarek some peace of mind;-> If something goes seriously wrong with other classifiers because of the patch, and Jareks patch works, we can back out the current patch and fix it with that approach. cheers, jamal --
On Tue, 02 Jun 2009 09:19:25 -0400 I've tested Jarek's patch #2 without patch #1. It works fine, too. Thanks Jarek. Tested-by: Minoru Usui <usui@mxm.nes.nec.co.jp> -- Minoru Usui <usui@mxm.nes.nec.co.jp> --
Thank you very much. So, as Jamal wrote, let's wait - hoping it won't be needed... Jarek P. Btw., it seems your box needs time update: Subject: Re: [BUG] net_cls: Panic occured when net_cls subsystem use Date: Thu, 4 Jun 2009 13:41:33 +0900 ... Original-Received: from mail03.kamome.nec.co.jp (mail03.kamome.nec.co.jp [10.25.43.7]) by mailsv.nec.co.jp (8.13.8/8.13.4) with ESMTP id n5455gP4006057; Thu, 4 Jun 2009 14:06:27 +0900 (JST) --
On Thu, Jun 04, 2009 at 06:34:45AM +0000, Jarek Poplawski wrote: Or rather there was some other delay. I shouldn't bother... Sorry, Jarek P. --
Hi, jamal On Fri, 29 May 2009 09:46:36 -0400 Thank you for your advice. I'll ask Thomas about how to use cls_cgroup. -- Minoru Usui <usui@mxm.nes.nec.co.jp> --
Hi Thomas Original mail sent to netdev and containers before. (http://www.spinics.net/lists/netdev/msg97745.html) I have a question about how to use cls_cgroup. On Thu, 21 May 2009 09:22:56 +0900 The cause of panic will fix soon. (Now I'll make a patch and will test it.) I want to know how to use cls_cgroup correctly, so could you tell me how to use it if my operation is wrong. Now, I think I might have to specify 'handle' with tc command line, is this true? But when I specified 'handle', I faced oops. X-P ---------------------------------------------------------------------------------------------------- Jun 1 16:16:58 StingerG kernel: BUG: unable to handle kernel NULL pointer dereference at (null) Jun 1 16:16:58 StingerG kernel: IP: [<ffffffff8044753b>] cls_cgroup_change+0xf7/0x1f5 Jun 1 16:16:58 StingerG kernel: PGD 21181f067 PUD 20c40e067 PMD 0 Jun 1 16:16:58 StingerG kernel: Oops: 0000 [#1] SMP Jun 1 16:16:58 StingerG kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:0c:00.1/irq Jun 1 16:16:58 StingerG kernel: CPU 7 Jun 1 16:16:58 StingerG kernel: Modules linked in: sch_htb ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp autofs4 hidp rfcomm l2cap bluetooth sunrpc bonding dm_mirror dm_region_hash dm_log dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac ipv6 parport_pc lp parport e1000e ide_cd_mod sg rtc_cmos cdrom i5000_edac rtc_core button i2c_i801 serio_raw edac_core shpchp i2c_core rtc_lib pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Jun 1 16:16:58 StingerG kernel: Pid: 9202, comm: tc Not tainted 2.6.29.4.local #5 Express5800/120Rg-1 [N8100-1241] Jun 1 16:16:58 StingerG kernel: RIP: 0010:[<ffffffff8044753b>] [<ffffffff8044753b>] cls_cgroup_change+0xf7/0x1f5 Jun 1 16:16:58 StingerG kernel: RSP: 0018:ffff88021259d9b8 EFLAGS: 00010246 Jun 1 16:16:58 StingerG kernel: RAX: 00000000fffffffe RBX: ffff88020edd3b80 RCX: 0000000000000000 Jun ...
Hi Thomas
On Mon, 1 Jun 2009 18:12:01 +0900
I investigated this problem, and I found a bug in cls_cgroup_change() in cls_cgroup.c.
cls_cgroup_change() expected tca[TCA_OPTIONS] was set from user space properly,
but tc in iproute2-2.6.29-1 (which I used) didn't set it.
I saw the current source code of tc in git, I found the code for cls_cgroup in iproute2/tc/f_cgroup.c.
It set tca[TCA_OPTIONS].
git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
# Unfortunately, current source coude of iproute2 couldn't compile, because it occured compile error. X-P
# But this is another issue.
If we always use a newest iproute2 in git when we use cls_cgroup, we don't face this oops probably.
But I think, kernel shouldn't panic regardless of use program's behaviour.
Signed-off-by: Minoru Usui <usui@mxm.nes.nec.co.jp>
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index cc29b44..2ec0adc 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -167,6 +167,9 @@ static int cls_cgroup_change(struct tcf_proto *tp, unsigned long base,
struct tcf_exts e;
int err;
+ if (!tca[TCA_OPTIONS])
+ return -EINVAL;
+
if (head == NULL) {
if (!handle)
return -EINVAL;
--
Minoru Usui <usui@mxm.nes.nec.co.jp>
--
This patch looks OK to me, but I guess you should add some title (plus "[PATCH]" in the subject), and a changelog (with lines width <= 80). Cheers, --
Hi, Jarek On Mon, 8 Jun 2009 07:48:14 +0000 Thank you for your comments. I'll resend my patch as a start of a new thread. -- Minoru Usui <usui@mxm.nes.nec.co.jp> --
