This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.24. Please verify if it still should be listed. Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=10391 Subject : 2.6.25-rc7/8: Another resume regression Submitter : Mark Lord <lkml@rtr.ca> Date : 2008-04-03 15:06 (6 days old) References : http://lkml.org/lkml/2008/4/3/283 --
Today I've been using 2.6.25-rc8 with an old embedded build system here for my empegs. One shell script calls out to /usr/bin/ftp to transfer an image to a remote system, and then read it back again and compare. The compare is failing, most (but not all) of the time, but only on 2.6.25-rc8, not on 2.6.24. Verified by switching back and forth between kernel versions for a short spell. The ftp client is netkit-ftp 0.17-16 on Kubuntu feisty. Switching to ncftpput/ncftpget avoids it on 2.6.25, but I wonder where the problem is. Too many things in the chain to easily debug. -ml --
.. Now verified that the data loss occurs in the outbound direction. The readback data is the same, regardless of which client s/w is used. So something in 2.6.25 is incompatible with the ftp client binary, or libs, that are installed here. Or some other problem. ?? --
Or maybe it uses sendfile, and that is broken? Also, try using ethtool to turn off TSO and/or checksumming on your NIC (if it is not wireless), and see if behavior changes... Jeff --
.. No, it uses read()/write() calls (from the strace). .. The failing FTP client software issues a close() on the socket after the final data write(). This close seems to be propagated to the other end before the data is fully received. I suppose a wireshark capture is next, once I dig out my ancient hub so we can sniff it from an independent box. -ml --
..
Meanwhile, here is the strace of the FTP client from the host side.
Nothing strange -- it opens the socket, the file, and does read()/write()
pairs to move all of the data down the line to the remote.
It then close()s both of them, and gets the "426 Connection failed"
response after the remote end sees a premature -EPIPE from sock_recvmsg().
Cheers
execve("/usr/bin/ftp", ["ftp", "10.0.0.26"], [/* 39 vars */]) = 0
brk(0) = 0x9b36000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f70000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=104744, ...}) = 0
mmap2(NULL, 104744, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f56000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libreadline.so.5", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0\317\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=196560, ...}) = 0
mmap2(NULL, 199764, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7f25000
mmap2(0xb7f51000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c) = 0xb7f51000
mmap2(0xb7f55000, 3156, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7f55000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libncurses.so.5", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240\362"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=268600, ...}) = 0
mmap2(NULL, 273860, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7ee2000
mmap2(0xb7f1c000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x39) = 0xb7f1c000
close(3) ...Or, you could do "git-bisect" if it is reproducible. --yoshfuji --
.. If I had the time right now, maybe. But it would be far more useful for whoever has been working on the stack to suggest some possible/likely commits to look at instead. -ml --
'git bisect run <cmd>' will automatically find your problem for you, if it's reproducible, scriptable, and you have a second box. Jeff --
From: Mark Lord <lkml@rtr.ca> Personally all I see is that one side closes the socket before all data packets received have been read into the application, resulting in a (correct) reset going out. I can't think of any change we've made over the course of this release that would change behvaior in that area. So you will likely need to bisect. --
.. Or I can ignore it, like the net developers, since I have a workaround. And then we'll see what other apps are broken upon 2.6.25 final release. Really, folks. Bug reports are intended to *help* the developers, not something to be thrown back in their faces. There do seem to have been a *lot* of changes around the tcp closing/close code (as I see from diff'ing 2.6.24 against latest -git). *Somebody* is responsible for those changes. That particular *somebody* ought to volunteer some help here, reducing the mountain of commits to a big handful or two. Cheers --
Sure, if you count in all whitespace/indentation/code moving changes to
I might help if would add netdev on cc list in case you really want to
Those touching fin/close are mostly whitespace/move things, so I doubt
that you find these useful but in case you insist, here's the list:
056834d9f6f6eaf4cc7268569e53acab957aac27 [TCP]: cleanup tcp_{in,out}put.c style
058dc3342b71ffb3531c4f9df7c35f943f392b8d [TCP]: reduce tcp_output's indentation levels a bit
490d5046930276aae50dd16942649bfc626056f7 [TCP]: Uninline tcp_set_state
In addition, there's this one (...though I have read it number of times
through and still cannot catch something that would cause the wrongness
you're seeing):
e870a8efcddaaa3da7e180b6ae21239fb96aa2bb [TCP]: Perform setting of common
control fields in one place
There's very little really on interesting side I can think of, mostly
thinks are congestion control related changes... ...maybe either one of
these could cause something unpleasant in some corner case:
bd515c3e48ececd774eb3128e81b669dbbd32637 [TCP]: Fix TSO deferring
0e3a4803aa06cd7bc2cfc1d04289df4f6027640a [TCP]: Force TSO splits to MSS boundaries
...e.g., if the latter causes a return with zero limit under some
conditions, tso_fragment might generate, well, interesting packets and
never finish if the condition persists but.
--
i.
--.. Oh.. I didn't know about that list. How does that differ from linux-net ? .. That matches my own assessment there, too: lot's of whitespace changes, and not much real code difference on most paths. Bummer. :) -ml --
On Thu, 10 Apr 2008, Mark Lord wrote: > Ilpo J
From: Mark Lord <lkml@rtr.ca> It's a two way street, we asked for a bisect which helps us a lot. In fact, lately I notice a strong unwillingness to bisect on your part, in particular. --
Bisecting is a time-consuming process. If unwillingness to bisect is unacceptable in a bug reporter then people who don't have the time to bisect must stop reporting the problems they encounter. --=20 Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
On 10/04/2008, Tilman Schmidt <tilman@imap.cc> wrote: I hope that was a joke and that I just don't get it. Are you really saying that if somebody find a bug they shouldn't bother reporting it unless they are willing to spend hours and hours of work to get it fixed? The way I see it, the burden of debugging and fixing bugs is mainly on the developers of the code that breaks. You can't blame users for using the code, triggering bugs and then reporting the breakage. Users who report bugs are doing us all a great service regardless of their ability or willingness to do more work than just the initial report. If bugs don't get reported they'll never get fixed. Even a bad bug report with no follow up at all still allows us to use it to gauge how often a specific bug is being hit and thus how important it may be to fix it. You can't expect users to know how to debug a problem or even bisect it. A user may not even be able to compile a custom kernel but she may still hit a bug and do us the favour of reporting it. It should be the job of the developer of the code to investigate the bug following a users report. Sure it's great when users can bisect, provide test cases, debug the problem completely themselves or even provide a patch, but you can't expect that. And in my oppinion you certainly can't just hide behind "the user doesn't want to bisect so I won't fix this" and use that as an excuse for the code being buggy. I hope most people take bug reports more seriously than that. When people discover bugs in my own code I thank them and feel a bit ashamed that I didn't do my work properly and it then becomes very important to me to make sure I squash the bug. The more the user can help the better, but if they cannot help beyond telling me what broke and how, then that's fine too. I still want to nail the bug and I'll just have to do more work myself, but it becomes a matter of personal and professional pride to hunt down the bug. We need to be grateful to users who r...
From: "Jesper Juhl" <jesper.juhl@gmail.com> [ The person you are replying to was being sarcastic, BTW. ] That's not the case we're talking about in this specific instance. In this particular case the user is more than capable of bisecting, he just isn't willing to invest the time. And I'm supposed to be willing to invest the time to analyze the TCP dumps or whatever to diagnose the problem? And I guess I should do this for every single networking bug report or issue? Who is going to clone me and the rest of the core networking developers so that this is actually tenable? That's ludicrious, I don't have a reproducer, this person does. And if they bisect, we'll know _exactly_ what change introduced the problem. Then I can use my brain to figure out the correct way to resolve the problem. Bisecting is a mindless activity that saves developers tons of time. What people don't get is that this is a situation where the "end node principle" applies. When you have limited resources (here: developers) you don't push the bulk of the burdon upon them. Instead you push things out to the resource you have a lot of, the end nodes (here: users), so that the situation actually scales. --
.. Duh.. more like, "If I take 5-8 hours to attempt a bisect (which may not even work), then that's 5-8 hours I do not get paid for." Gotta eat, dude. Anyways, here's five hours of free consulting for you: git-bisect start # bad: [7180c4c9e09888db0a188f729c96c6d7bd61fa83] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 git-bisect bad 7180c4c9e09888db0a188f729c96c6d7bd61fa83 # good: [49914084e797530d9baaf51df9eda77babc98fa8] Linux 2.6.24 git-bisect good 49914084e797530d9baaf51df9eda77babc98fa8 # bad: [e5dfb815181fcb186d6080ac3a091eadff2d98fe] [NET_SCHED]: Add flow classifier git-bisect bad e5dfb815181fcb186d6080ac3a091eadff2d98fe # good: [00e0b8cb74ed7c16b2bc41eb33a16eae5b6e2d5c] b43: reinit on too many PHY TX errors git-bisect good 00e0b8cb74ed7c16b2bc41eb33a16eae5b6e2d5c # good: [42d545c9a4c0d3faeab658a40165c3da2dda91b2] x86: remove depends on X86_32 from PARAVIRT & PARAVIRT_GUEST git-bisect good 42d545c9a4c0d3faeab658a40165c3da2dda91b2 # good: [6232665040f9a23fafd9d94d4ae8d5a2dc850f65] Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86 git-bisect good 6232665040f9a23fafd9d94d4ae8d5a2dc850f65 # good: [e5723b41abe559bafc52591dcf8ee19cc131d3a1] [ALSA] Remove sequencer instrument layer git-bisect good e5723b41abe559bafc52591dcf8ee19cc131d3a1 # good: [461e2c78b153e38f284d09721c50c0cd3c47e073] [ALSA] hda-codec - Add Conexant 5051 codec support git-bisect good 461e2c78b153e38f284d09721c50c0cd3c47e073 # good: [1987e7b4855fcb6a866d3279ee9f2890491bc34d] [AX25]: Kill ax25_bind() user triggable printk. git-bisect good 1987e7b4855fcb6a866d3279ee9f2890491bc34d # good: [58a3c9bb0c69f8517c2243cd0912b3f87b4f868c] [NETFILTER]: nf_conntrack: use RCU for conntrack helpers git-bisect good 58a3c9bb0c69f8517c2243cd0912b3f87b4f868c # good: [32948588ac4ec54300bae1037e839277fd4536e2] [NETFILTER]: nf_conntrack: annotate l3protos with const git-bisect good 32948588ac4ec54300bae1037e839277fd4536e2 # bad: [e83a2ea850bf0c0c81c6754440809...
From: Mark Lord <lkml@rtr.ca> Thanks Mark. Pavel can you take a look? I suspect that the namespace changes or gets NULL'd out somehow and this leads to the resets because the socket can no longer be found. Perhaps it's even a problem with time-wait socket namespace propagation. --
.. My system here is now set up for quick/easy retest, if you have any suggestions or patches to try out. Thanks guys. --
Please try this, from net-2.6.26 tree. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> ---- From 8d9f1744cab50acb0c6c9553be533621e01f178b Mon Sep 17 00:00:00 2001 From: Daniel Lezcano <dlezcano@fr.ibm.com> Date: Fri, 21 Mar 2008 04:12:54 -0700 Subject: [PATCH] [NETNS][IPV6] tcp - assign the netns for timewait sockets Copy the network namespace from the socket to the timewait socket. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> --- net/ipv4/inet_timewait_sock.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 876169f..717c411 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -124,6 +124,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat tw->tw_hash = sk->sk_hash; tw->tw_ipv6only = 0; tw->tw_prot = sk->sk_prot_creator; + tw->tw_net = sk->sk_net; atomic_set(&tw->tw_refcnt, 1); inet_twsk_dead_node_init(tw); __module_get(tw->tw_prot->owner); -- 1.4.4.4 -- YOSHIFUJI Hideaki @ USAGI Project <yoshfuji@linux-ipv6.org> GPG-FP : 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA --
Too late, but still Acked-by: Pavel Emelyanov <xemul@openvz.org> Sorry, guys, but my timezone does not allow me to react in time to found bugs :( So, when I wake up in the morning I usually just find out that someone has caught a BUG made by me and someone --
.. Works perfectly, thanks. Looks obvious, too. Push it out to Linus now for 2.6.25. Thanks! --
From: Mark Lord <lkml@rtr.ca> Will do, thanks for testing. --
From: Mark Lord <lkml@rtr.ca> And if I invest my spare time on your bug how does this statement apply to me? Or does it only apply to you? Every single argument you make that supports why you should not be investing the necessary time into the bug applies equally to the very developers you are so quickly to quip at and want help from. --
I think you got it backwards. Mark and other bug reporters (including, at times, yours truly) are helping you and other developers to make Linux better. Most of the times I report a bug, I am not asking for help - I have no personal need to get it fixed, as I can easily avoid it, and I only report it to give developers like you a chance to fix it before it really hurts someone - and I gather that Mark has been in a similar position wrt to the bug in question. So what would you have us do? Not report the bugs we find so that you don't have to invest your spare time on "our" bugs? Report them and accept a rebuke for our "unwillingness" to do even more benevolent work than we already did? Report only those for which we really need a fix, and are consequently willing to invest additional time? Thanks, Tilman
From: Tilman Schmidt <tilman@imap.cc> I appreciate the bug reports, believe me. The issue is which of the limited developer resources get put onto which bugs. A developer who does this for fun is going to prioritize to things that are pleasant and interesting to work on, and also a good effective use of their time. So people prioritize. Therefore, my point is, the net result is that user have a direct influence on which bugs get worked on with the highest priority and thus get fixed faster. And those are the ones that have the most information available, and in particular bisec results when appropriate. --
.. It's not "my bug". I'm just the first person to notice, take time to report it, and even hand it to you on a platter (bisect). It's *your* bug -- you signed off on the commit. Cheers --
From: Mark Lord <lkml@rtr.ca> I sign off on basically every networking commit, does that mean I have to fix every networking bug and every networking bug is "mine"? Of course not, that doesn't scale at all. What does scale is a combination of good fully formed bug reports from users combined with the efforts of the global developer pool. Linus signs off on every patch from Andrew Morton he puts into the tree, which is a lot, but does Linus work on every bug introduced by one of those patches and are such bugs "his" bugs? Of course he doesn't, and of course not. They get pushed up to the person who wrote the patch once identified as such, and the patch is reverted if the developer is unresponsive and this will have consequences for patches they submit in the future. I still think you have a very self-centered attitude about things. This is about distributing effort, not forcing it upon individuals or a constrained resource. If I get hit by a bus, networking bugs would still get fixed if handled properly. And it's a win-win situation. The incentive for a capable user to do a bisect or whatever else is that if they do it their bug gets fixed quickly. That is the free market economy of Linux kernel bug reporting. It addresses the issue that in reality we'll never fix all bugs, and therefore we prioritize. And therefore if there is a bisected bug report and also another one from a user who refuses to do that, guess which bug gets worked on with a higher priority and which bug gets fixed first? --
this argument is a fallacy because it assumes that the Linux kernel is a closed ecosystem and i'm really surprised to see you advance this economic argument. i remind you: Linux is very much not a closed ecosystem. ... and hence, your "free market economy of bugs" that in essence strongly suggests users to do bisections when they find bugs in networking, works exactly the way you did not intend it to work: it pushes users towards other OSs. It pushes them towards Solaris, FreeBSD, MacOS and even Windows. That happens because the barrier to getting bugs fixed is _increased_ - and users might find it easier to participate in the ecosystem of other OSs - instead of having to compete with "each other" for the attention of the head honcho (you). You have a unique position within Linux: through a decade of hard and excellent work you have built a quasi-monopoly to all things networking commits: if you say about something that it should go into networking it will, if you say that it should stay out, it wont go in. So it is fundamentally _you_ who determines the feature/fix ratio in the networking code, and it is _you_ who determines the amount of bugs users have to find! There's no real competition for your position - it would take years for anyone to replace you. (and it would be a shame and a loss - you do your job so well) No doubt about it: bisection is very nice, it's one of the best things that happened to Linux debuggability in the past 2 years, i use it heavily myself, but please do _not_ require it from testers and users. They dont have nice 32-way Niagara's to build a kernel in 1 minute. They dont have nice virtualization to do easy bisection. Take bisection as an additional gift/tool but dont make it a semi-required aspect of your subsystem. Pretty please. And _PLEASE_ realize that the networking bug-count has been created primarily by _you_, because it is you who throttles the amount of new code in new kernel releases. If you cannot cop...
From: Ingo Molnar <mingo@elte.hu> I don't. I ask for a bisection when it is appropriate and I think other avenues will not bear fruit in a reasonable amount of time. Thanks for the arbitrary diatribe about my contributions over the years and accusations that I have some kind of monopoly over the networking code and fixes to it. I really appreciate that. --
i'm glad i misunderstood you. My impression from reading this thread was that you preferred reporters who do bisection (which is fine so far), to you certainly do have a fair amount of exclusivity in determining the dosage of networking commits. Dont get me wrong, you earned it and you deserve it - not the least because you do it best. Ingo --
.. Absolutely, though to a varying degree. That's the responsibility that goes with the role of a subsystem maintainer. I once had such a role, and gave it up when I felt I could no longer keep up. You still keep refering to it as "your (my) bug". It's not. I had nothing to do with it, other than stumbling over it. When people stumble over a libata bug, I look hard to see if my code could possibly cause it. Jeff looks even harder, because he's the current subsystem dude for libata. I never suggest a user search through a mountain of unrelated commits for something I've screwed up on. I give more directed help, patches to collect more relevant information, and patches to try and resolve it. The last thing I'd ever do, is diss the reporter. Regards. --
Like it or not, when you're the owner of the only box that can reliably reproduce an error condition, it's your bug. Been there, done that, plenty of times.
Thanks for the advice. I'll keep it in mind next time I have to decide whether to report a bug I'm stumbling over. T.
Well, the fact is, reporting bugs is always welcome. However, it may not be immediately obvious what causes the bug to appear as well as the bug need not be readily reproducible on any other system than yours, at least at the moment. In which case whether or not the bug will be fixed depends on the reporter. Namely, if the reporter wants and has the time to provide developers with additional information, the bug has a good chance to be fixed. Otherwise, it'll probably stay there until there's a more persistent reporter or it's fixed as a result of a related change. So, if people ask you to do a bisection, they probably mean "we don't see what the problem is and can't reproduce it, so please get us more information, otherwise we won't know how to fix it". In that case, you could provide them with a reproducible test case just as well. That said, there may be some developers who just don't want to spend time on analysing code and put the burden of finding the offending change on the reporter, but I don't think it's common practice. Thanks, Rafael --
Very true. One other thing which might get confusing/frustrating on the user side is that currently, Linux is the *only* product which requires the bug reporter to find the fault change (yes, I know, it's scalable). All other products the reporter uses work differently: the reporter contacts the editor/author/support/... and briefly describes his problem. Support asks him for a bit more details, remains silent for some time, then comes up with a patched version to confirm that the bug is fixed. So it is understandable from the user's standpoint that Linux appears quite complex to report bugs. But we should remind users that LKML is *not* a place to get free kernel support, but it's a *development* mailing list, and that it is somewhat expected that developers ask reporters for more development related contribution. But if the reporter does not want to/cannot do much more, we should not aggress him, and point it to other places instead (eg: at least create an entry in bugzilla so that their report is not lost, and they have a chance to get contacted when the fix is known). Regards, Willy --
It's a pretty common procedure for compilers (gcc, llvm) too, although they have the advantage that given a test case usually someone else can run the bisect procedure because they do not depend on the underlying hardware That's unfortunately not the case for most kernel bugs, although sometimes it is possible given a hardware independent test case. And while most of the kernel code is drivers and arch, a lot of it is still pretty hardware independent, so at least in some cases it is possible to submit test cases and then let someone else (like a bug master) do the bisect. Of course it is unclear if producing a submittable test case will be actually any faster than just running bisect for the user. That said I agree it's a big burden to run bisect for everything because it can take very long (especially if the problem is not trivially reproducable) It would be fair at least if maintainers always gave some candidate commit ids when asking for bisect for likely changes that could have matched the bug. Then those could be checked quickly first before doing the full run. While that will not always work it would be still a useful short cut and save a lot of time for the reporter. -Andi --
And most of all, the reporter would not feel like the bisection is Willy --
Well it is proportional to the quality of the bug report. If it very vague enough often there is no other good answer. If it comes with already some debugging or good logs or a good test case etc. I agree just saying "please bisect" is not very nice (but sometimes it might be still needed if code review doesn't find anything) Perhaps there should be a document somewhere explaining this which can be easily pointed to. -Andi --
That's not true, for several regressions I reported to the Wine Bugzilla
I had been asked to git bisect for the commit that broke it.
And I'd actually assume that it's quite common for git using open source
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--hm, who does this - i've seen networking folks do it but does anyone else do it? Such cases are _clear_ abuse of users and they'll do the obvious thing: vote with their feet. I only ask people to bisect it when all other avenues fail - and even then i try to make it clear that bisection is just something they can _optionally_ do to speed things up (it's never required), and that it's a pure opt-in. doing _kernel_ bisection is totally hard at the moment - it disrupts the user way too much and causes many hours of work for most users. [ Requiring bisection for userspace projects might be more doable. (but even there's it's wrong when it's not automated completely and where a failure pattern is not deterministic.) ] Ingo --
It depends. Sometimes the bisection can be done in qemu/kvm/xen or similar tools. At least if the problem is not too hardware dependent. And more and more people actually run in such environments. I can also do it faster with autoboot or nfs root/powerswitch, but admittedly that's a very specialized setup most people don't have. Still I agree with your basic point that it should be only last resort. -Andi --
Hi. Bugs are bugs, they either depend on hardware or do not. There is no perfect world where after reporting subtle bug it will be fixed. It is not Linux, it is everywhere. Bugs are only fixed when they have major impact. Only. Either by having exploit, or crash, or good testcase. Or bisect result. This just a tool to help both parties. And a huge help for regressions. Yeah, spent two weeks kicking all possible stuff around and eventually drop that namespace patch at all to find where the problem was. We started to move further. Bisect is just a tool. It is not something developers throw into user when they do not want to work. This _is_ a help, which allows both to solve problem in the fastest way. If the same would be done on developers machine and huge patches would be sent to jump between changesets, that would be a real 'work closely with the reporter working out why the reporter's failure was occurring'? You pointed it yourself: several days of back-and-forth. With this helping automation tool called bisect bug was resolved in 15 There is also global warming tendency. IIRC. Bugs _are_ fixed, Andrew. And developers did not change suddenly to selfish bastards who do not care for users. They just developed a tool, which greatly helps to both and saves lots of users time, since regression gets fixed with this tool really quickly. Bisect is not asked to be performed without a reason. For subtle bug it is the fastest way, but otherwise there might be a long conversation. And even in this really subtle case there was a dialog. Bisect automation does not add kind relations though, but we can ask Linus to add couple of smiles into the output. -- Evgeniy Polyakov --
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> In fact, this is what Andrew's so-called "back and forth with the bug reporter" used to mainly consist of. Asking the user to try this patch or that patch, which most of the time were reverts of suspect changes. Which, surprise surprise, means we were spending lots of time bisecting things by hand. We're able to automate this now and it's not a bad thing. --
To be honest, at least in one case no one reacted to my report(s) until I ran a bisection and then it turned up an obviously broken patch. The breakage was so obvious that if anyone had actually looked at the code in question, he would have see it immediately. Things like this are very disappointing and have a very negative impact on bug reporters. We should do our best to avoid them. Thanks, Rafael --
Shit happens. This is a matter of either bug report or those who were in the copy list. There are different people and different situations, in which they do not reply. -- Evgeniy Polyakov --
Well less shit would happen if developers would take the time to at least test their patches before they were submitted. It like we will just have the poor user do our testing for us. What kind of testing do developers do. I been a linux user and have followed the LKML for a number of years and have yet to see any test plans for any submitted patches. My $.02 Steve Clark -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
You haven't looked closely then. While it's not very common there is a non trivial number of patches who describe how they got tested in the patch description. -Andi --
cross-posted to git for the suggestion at the bottom I've been reading LKML for 11 years now, I've tested kernels and reported a few bugs along the way. the expectation is that the submitter should have tested the patches before submitting them (where hardware allows). but that "where hardware allows" is a big problem. so many issues are dependant on hardwre that it's not possible to test everything. there are people who download, compile and test the tree nightly (with farms of machines to test different configs), but they can't catch everything. expecting the patches to be tested to the point where there are no bugs is unreasonable. bisecting is a very powerful tool, but I do think that sometimes developers lean on it a bit much. taking the attitude (as some have) that 'if the reporter can't be bothered to do a bisection I can't be bothered to deal with the bug' is going way too far. if a bug can be reproduced reliably on a test system then bisecting it may reveal the patch that introduced or unmasked the bug (assuming that there aren't other problems along the way), but if the bug takes a long time to show up after a boot, or only happens under production loads, bisecting it may not be possible. that doesn't mean that the bug isn't real, it just means that the user is going to have to stick with an old version until there is a solution or work-around. even in the hard-to-test situations, the reporter is usually able to test a few fixes, but there's a big difference between going to management and saying "the kernel guru's think that this will help, can we test it this weekend" 2-3 times and doing a bisection that will take 10-15 cycles to find the problem. it's very reasonable to ask the reporter if they can bisect the problem, but if they say that they can't, declaring that they are out of luck is not reasonable, it just means that it's going to take more thinking to find the problem instead of being able to let the mechanical bisect pr...
[...] Agreed. The difficulty is that only the developer knows how confident he is in his code. Even the subsystem maintainer does not know, which is the real issue since as long as the code is not identified, he does not know whom to ping. And I think that it might help if we could add a "Trust" rating to the patches we submit, similarly to "Tested-By" or "Signed-off-by". We could use 1 to 5. Basically, when the patch was completed at 3am and just builds, it's more likely 1/5. When it has been stressed for 1 week, it would be 4/5. 5/5 would only be used in backports of known working code, for some wide-used external patches, or for trivial patches (eg: doc/whitespace fixes). The goal would clearly not be to just trust patches with a high rate (since they might break when associated with others), but for the subsystem maintainer to quickly check if there are some of them the author does not 100% trust, in which case he could ping the author to check if his patch *may* cause the reported problem. What makes this rating system delicate is that the rate cannot be changed afterwards. But after all, that's not much of a problem. A bug may very well reveal itself one year after the code was merged, so it's really the developer's estimation which matters. For this to be efficiently used, we would need git-commit to accept a new "-T <rating>" argument with the following possible values : 0: untested (default) 1: builds 2: seems to be working 3: passed basic non-regression tests 4: survived stress testing at the developer's 5: known to be working for a long time somewhere else I'm sure many people would find this useless (or in fact reject the idea because it would show that most code will be rated 1 or 2), but I really think it can help subsystem maintainers make the relation between a reported bug and a possible submitter. Willy --
On Mon, Apr 14, 2008 at 06:39:39AM +0200, Willy Tarreau wrote:
I have a related proposal: let us require all patches to be stamped
with Discordian *and* Eternal September dates. In triplicate. While
we are at it, why don't we introduce new mandatory headers like, say
it,
X-checkpatch: {Yes,No}
X-checkpatch-why-not: <string>
X-pointless: <number from 1 to 69, going from "1: does something useful" all
the way to "68: aligns right ends of lines in comments">
X-arbitrary-rules-added-to-CodingStyle: <number> (should be present if
and only if X-pointless: 69 is present).
Come to think of that, we clearly need a new file in Documentation/*,
documenting such headers. Why don't we organize a subcommittee^Wnew maillist
devoted to that? That would provide another entry route for contributors,
lowering the overall entry barriers even further...
Seriously, looks like Andi is right - we've got ourselves a developing
beaurocracy. As in "more and more ways of generating activity without
doing anything even remotely useful". Complete with tendency to operate in
the ways that make sense only to beaurocracy in question and an ever-growing
set of bylaws...
--No. The problem we're discussing here is the apparently-large number of bugs which are in the kernel, the apparently-large number of new bugs which we're adding to the kernel, and our apparent tardiness in addressing them. Do you agree with these impressions, or not? If you do agree, what would you propose we do about it? --
Does that mean you're not going to take patches that align the right end of lines in comments? :-( Rene. --
On Mon, 14 Apr 2008 21:13:41 +0200
erm, was that ":-(" supposed to be a ":-)"?
I don't like to merge patches which fix typos and spellos and grammaros
in comments, simply because I'd be buried in the things. I do take such
fixes for user-visible text (Documentation/, kerneldoc comments and
printks).
Right-justification of comments would fall rather a long way below spelling
fixes.
--The ":-(" was supposed to add to the implicitly obvious ":-)". That is, was
You, particularly, seem to be very good at picking up trivia. I've posted
completely trivial patches from time to time for small things I encounter
while looking at something else. Things at the "are people going to look
funny at me for even bothering or..." level but you picking them up means
it's still useful to post, so I sometimes do.
Now, in fact, Linux as a _whole_ doesn't seem bad at accepting that kind of
small janitorial stuff but I have been noticing some backlash to it as well.
I'm not sure it's worse or better than historically, but the "checkpatch
syndrome" certainly triggers more of it.
Al specifically wanted more new eyes but the way to reward those new eyes is
accepting their small changes. Al also specifically doesn't like those small
changes when at the level of the automated and semi-brainless checkpatch level.
I believe the janitorial work has been over-organized, both through the
kernel-janitors and checkpatch since while these are very useful in guiding
a newbie in _what_ to do they cause "automated" huge tree-wide trivia storms
which people then don't react overly favourable to and the new eyes who did
all that work of generating it all dim again...
Frankly, the kernel really is fairly complex these days when starting at 0.
Much more complex certainly than, say, back in 2.0 or 2.2 days and while
Al's scenario of per-subsystem reviews might be good, I don't believe it's
very realistic. Companies don't pay to have those done and for newbies it's
generally too complex since understanding most parts of the kernel fully,
requires understanding most of the rest kernel rather well also.
So you get the really promising newbies? Yeah, that, or you don't get anyone
and if some promising newbies are building up 137 part checkpatch inspired
patchsets that don't help none.
So, what am I saying (what _am_ I saying?!?) ...
I seemed to observe somewhat of an interna...In addition to obvious "we need testing and something better than bugzilla to keep track of bugs"? Real review of code in tree and patches getting into the tree. And the latter part _must_ be done on each entry point. Any git tree that acts as injection point really needs a working mechanism of some sort that would do that; afterwards it's too late, since review of the stuff getting into mainline on a massive merge is sadly impractical. I don't know any formal mechanism that could take care of that; no more than making sure that no backdoors are injected into the tree. It really has to be a matter of trust for tree maintainers and community around the subsystem. Git is damn good at killing the merge bottleneck. Too good, since it hides the review bottleneck. And we get equivalents of self-selected communities that had been problem for "here's our CVS, here's monthly dump from it, apply" kind of setups. It _is_ better, since one can get to commit history (modulo interesting issues with merge nodes and conflict resolution). But in practice it's not good enough - the patches going in during a merge (especially for a tree that collects from secondaries) are not visible enough. And it's too late at that point, since one has to do something monumentally ugly to get Linus revert a large merge. On the scale of Great IDE Mess in 2.5... linux-next might help with the last part, but I don't think it really deals with the first one. It certainly helps to some extent, but... We need higher S/N on l-k. We need people looking into the subsystem trees as those grow and causing a stench when bad things are found, with design issues getting brought to l-k if nothing else helps. We need tree maintainers understanding that review, including out-of-community one, is needed (the need of testing is generally better understood - I _hope_). We need more people reading the fscking source. Subsystem by subsystem. Without assumption that code is not broken. With mechanism collating ...
There is currently little incentive for developers to perform review. It's difficult work, and is generally not rewarded or recognized, except in often quite negative ways. There is a small handful of people who do a lot of review, but they are exceptional in various ways. OTOH, writing code is relatively simple, and is much more highly rewarded: - People tend to get paid to write kernel code, but not so much to review it. - Things like "who made the kernel" statistics and related articles ignore code review. - Creating new features is perceived as the highest form of contribution for general developers, and likely important as career currency (similar to the publish or perish model in the academic world). I don't know how to solve this, but suspect that encouraging the use of reviewed-by and also including it in things like analysis of who is contributing, selection for kernel summit invitations etc. would be a start. At least, better than nothing. - James -- James Morris <jmorris@namei.org> --
Would it be hard to keep count of the number of errors introduced by author and reviewer? --
I'm not subscribed to the kernel mailing list, so please include me in the cc if you don't reply to the git list (which I am subscribed to). Git is participating in Google Summer of Code this year and I've proposed to write a 'git statistics' command. This command would allow the user to gather data about a repository, ranging from "how active is dev x" to "what did x work on in the last 3 weeks". It's main feature however, would be an algorithm that ranks commits as being either 'buggy', 'bugfix' or 'enhancement'. (There are several clues that can aid in determining this, a commit msg along the lines of "fixes ..." being the most obvious.) In the light of this recent discussion, especially the part on "keeping count of the number of errors introduced by author and reviewer?
