A thread on the Linux Kernel mailing list discussed the process in place for reporting, bisecting and fixing bugs. In response to a suggestion that some of the issues could be solved by introducing new procedures, Al Viro retorted, "we've got ourselves a developing beaurocracy. As in 'more and more ways of generating activity without doing anything even remotely useful'. Complete with tendency to operate in the ways that make sense only to bureaucracy in question and an ever-growing set of bylaws..." Later in the thread, David Miller agreed and noted that ,"the resulting 'bureaucracy' or whatever you want to call it is perceived to undercut the very thing that makes the Linux kernel fun to work on. It's still largely free form, loose, and flexible. And that's a notable accomplishment considering how much things have changed. That feeling is why I got involved in the first place, and I know it's what gets other new people in and addicted too."
Andrew Morton tried to return the discussion to its original topic, "the problem we're discussing here is the apparently-large number of bugs which are in the kernel, the apparently-large number of new bugs which we're adding to the kernel, and our apparent tardiness in addressing them." Al noted that some of the problem is that git is so efficient at merging code, "the patches going in during a merge (especially for a tree that collects from secondaries) are not visible enough. And it's too late at that point, since one has to do something monumentally ugly to get Linus revert a large merge. On the scale of Great IDE Mess in 2.5..." Another suggestion was made to replace bugzilla with something better, to which Andrew replied, "swapping out bugzilla for something else wouldn't help. We'd end up with lots of people ignoring a good bug tracking system just like they were ignoring a bad one."
From: David Miller <davem@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 10, 8:24 pm 2008 From: Mark Lord <lkml@rtr.ca> Date: Thu, 10 Apr 2008 20:16:11 -0400 > Duh.. more like, "If I take 5-8 hours to attempt a bisect (which may not > even work), then that's 5-8 hours I do not get paid for." And if I invest my spare time on your bug how does this statement apply to me? Or does it only apply to you? Every single argument you make that supports why you should not be investing the necessary time into the bug applies equally to the very developers you are so quickly to quip at and want help from. --
From: Mark Lord <lkml@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 10, 8:27 pm 2008 David Miller wrote: > From: Mark Lord <lkml@rtr.ca> > Date: Thu, 10 Apr 2008 20:16:11 -0400 > >> Duh.. more like, "If I take 5-8 hours to attempt a bisect (which may not >> even work), then that's 5-8 hours I do not get paid for." > > And if I invest my spare time on your bug > how does this statement apply to me? .. It's not "my bug". I'm just the first person to notice, take time to report it, and even hand it to you on a platter (bisect). It's *your* bug -- you signed off on the commit. Cheers --
From: David Miller <davem@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 10, 8:39 pm 2008 From: Mark Lord <lkml@rtr.ca> Date: Thu, 10 Apr 2008 20:27:14 -0400 > It's *your* bug -- you signed off on the commit. I sign off on basically every networking commit, does that mean I have to fix every networking bug and every networking bug is "mine"? Of course not, that doesn't scale at all. What does scale is a combination of good fully formed bug reports from users combined with the efforts of the global developer pool. Linus signs off on every patch from Andrew Morton he puts into the tree, which is a lot, but does Linus work on every bug introduced by one of those patches and are such bugs "his" bugs? Of course he doesn't, and of course not. They get pushed up to the person who wrote the patch once identified as such, and the patch is reverted if the developer is unresponsive and this will have consequences for patches they submit in the future. I still think you have a very self-centered attitude about things. This is about distributing effort, not forcing it upon individuals or a constrained resource. If I get hit by a bus, networking bugs would still get fixed if handled properly. And it's a win-win situation. The incentive for a capable user to do a bisect or whatever else is that if they do it their bug gets fixed quickly. That is the free market economy of Linux kernel bug reporting. It addresses the issue that in reality we'll never fix all bugs, and therefore we prioritize. And therefore if there is a bisected bug report and also another one from a user who refuses to do that, guess which bug gets worked on with a higher priority and which bug gets fixed first? --
From: Mark Lord <lkml@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 10, 9:23 pm 2008 David Miller wrote: > From: Mark Lord <lkml@rtr.ca> > Date: Thu, 10 Apr 2008 20:27:14 -0400 > >> It's *your* bug -- you signed off on the commit. > > I sign off on basically every networking commit, does that mean I have > to fix every networking bug and every networking bug is "mine"? .. Absolutely, though to a varying degree. That's the responsibility that goes with the role of a subsystem maintainer. I once had such a role, and gave it up when I felt I could no longer keep up. You still keep refering to it as "your (my) bug". It's not. I had nothing to do with it, other than stumbling over it. When people stumble over a libata bug, I look hard to see if my code could possibly cause it. Jeff looks even harder, because he's the current subsystem dude for libata. I never suggest a user search through a mountain of unrelated commits for something I've screwed up on. I give more directed help, patches to collect more relevant information, and patches to try and resolve it. The last thing I'd ever do, is diss the reporter. Regards. --
From: <Valdis.Kletnieks@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 11, 3:58 pm 2008 On Thu, 10 Apr 2008 21:23:54 EDT, Mark Lord said: > You still keep refering to it as "your (my) bug". > It's not. I had nothing to do with it, other than stumbling over it. Like it or not, when you're the owner of the only box that can reliably reproduce an error condition, it's your bug. Been there, done that, plenty of times.
From: Tilman Schmidt <tilman@...> Subject: Re: 2.6.25-rc8: FTP transfer errors [0]Date: Apr 11, 6:27 pm 2008 On Fri, 11 Apr 2008 15:58:42 -0400, Valdis.Kletnieks@vt.edu [1] wrote: > On Thu, 10 Apr 2008 21:23:54 EDT, Mark Lord said: >=20 >> You still keep refering to it as "your (my) bug". >> It's not. I had nothing to do with it, other than stumbling over it. >=20 > Like it or not, when you're the owner of the only box that can reliably= > reproduce an error condition, it's your bug. Thanks for the advice. I'll keep it in mind next time I have to decide whether to report a bug I'm stumbling over. T.
From: Rafael J. Wysocki <rjw@...> Subject: Reporting bugs and bisection (was: Re: 2.6.25-rc8: FTP transfer errors) [1]Date: Apr 13, 2:40 pm 2008 On Saturday, 12 of April 2008, Tilman Schmidt wrote: > On Fri, 11 Apr 2008 15:58:42 -0400, Valdis.Kletnieks@vt.edu [2] wrote: > > On Thu, 10 Apr 2008 21:23:54 EDT, Mark Lord said: > > > >> You still keep refering to it as "your (my) bug". > >> It's not. I had nothing to do with it, other than stumbling over it. > > > > Like it or not, when you're the owner of the only box that can reliably > > reproduce an error condition, it's your bug. > > Thanks for the advice. I'll keep it in mind next time I have to decide > whether to report a bug I'm stumbling over. Well, the fact is, reporting bugs is always welcome. However, it may not be immediately obvious what causes the bug to appear as well as the bug need not be readily reproducible on any other system than yours, at least at the moment. In which case whether or not the bug will be fixed depends on the reporter. Namely, if the reporter wants and has the time to provide developers with additional information, the bug has a good chance to be fixed. Otherwise, it'll probably stay there until there's a more persistent reporter or it's fixed as a result of a related change. So, if people ask you to do a bisection, they probably mean "we don't see what the problem is and can't reproduce it, so please get us more information, otherwise we won't know how to fix it". In that case, you could provide them with a reproducible test case just as well. That said, there may be some developers who just don't want to spend time on analysing code and put the burden of finding the offending change on the reporter, but I don't think it's common practice. Thanks, Rafael --
From: Andrew Morton <akpm@...> Subject: Re: Reporting bugs and bisection (was: Re: 2.6.25-rc8: FTP transfer errors) [2]Date: Apr 13, 3:18 pm 2008 On Sun, 13 Apr 2008 20:47:30 +0200 Willy Tarreau <w@1wt.eu> wrote: > One other thing which might get confusing/frustrating on the > user side is that currently, Linux is the *only* product which requires > the bug reporter to find the fault change That's because many (probably most) Linux bugs are dependent upon the hardware which they run on, and developers cannot reproduce the failure on their hardware. Other software products don't have that problem. That being said.. four or five years ago, developers would often work closely with the reporter working out why the reporter's failure was occurring. Several days of back-and-forth. We dont' do that as much nowadays - there's a tendency to a) throw the problem back at the reporter, often asking them to bisect. If the reporter is running a distro kernel (eg: Fedora) then that's quite hard, and often isn't a think they have knowledge to do. So they'll just disappear. Or b) just ignore the report altogether. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org [3] More majordomo info at http://vger.kernel.org/majordomo-info.html [4]
From: Evgeniy Polyakov <johnpol@...> Subject: Re: Reporting bugs and bisection (was: Re: 2.6.25-rc8: FTP transfer errors) [4]Date: Apr 13, 4:21 pm 2008 Hi. On Sun, Apr 13, 2008 at 12:18:31PM -0700, Andrew Morton (akpm@linux-foundation.org [5]) wrote: > > One other thing which might get confusing/frustrating on the > > user side is that currently, Linux is the *only* product which requires > > the bug reporter to find the fault change > > That's because many (probably most) Linux bugs are dependent upon the > hardware which they run on, and developers cannot reproduce the failure on > their hardware. Other software products don't have that problem. Bugs are bugs, they either depend on hardware or do not. There is no perfect world where after reporting subtle bug it will be fixed. It is not Linux, it is everywhere. Bugs are only fixed when they have major impact. Only. Either by having exploit, or crash, or good testcase. Or bisect result. This just a tool to help both parties. And a huge help for regressions. If bug would exist for years, bisection unlikely to help. > That being said.. four or five years ago, developers would often work > closely with the reporter working out why the reporter's failure was > occurring. Several days of back-and-forth. Yeah, spent two weeks kicking all possible stuff around and eventually drop that namespace patch at all to find where the problem was. We started to move further. Bisect is just a tool. It is not something developers throw into user when they do not want to work. This _is_ a help, which allows both to solve problem in the fastest way. If the same would be done on developers machine and huge patches would be sent to jump between changesets, that would be a real 'work closely with the reporter working out why the reporter's failure was occurring'? You pointed it yourself: several days of back-and-forth. With this helping automation tool called bisect bug was resolved in 15 minutes after completion. Completion itself took couple of hours. > We dont' do that as much nowadays - there's a tendency to > > a) throw the problem back at the reporter, often asking them to bisect. > If the reporter is running a distro kernel (eg: Fedora) then that's > quite hard, and often isn't a think they have knowledge to do. So > they'll just disappear. Or > > b) just ignore the report altogether. There is also global warming tendency. IIRC. Bugs _are_ fixed, Andrew. And developers did not change suddenly to selfish bastards who do not care for users. They just developed a tool, which greatly helps to both and saves lots of users time, since regression gets fixed with this tool really quickly. Bisect is not asked to be performed without a reason. For subtle bug it is the fastest way, but otherwise there might be a long conversation. And even in this really subtle case there was a dialog. Bisect automation does not add kind relations though, but we can ask Linus to add couple of smiles into the output. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org [6] More majordomo info at http://vger.kernel.org/majordomo-info.html [7]
From: David Miller <davem@...> Subject: Re: Reporting bugs and bisection [7]Date: Apr 13, 4:35 pm 2008 From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Mon, 14 Apr 2008 00:21:18 +0400 > If the same would be done on developers machine and huge patches would > be sent to jump between changesets, that would be a real 'work closely > with the reporter working out why the reporter's failure was occurring'? In fact, this is what Andrew's so-called "back and forth with the bug reporter" used to mainly consist of. Asking the user to try this patch or that patch, which most of the time were reverts of suspect changes. Which, surprise surprise, means we were spending lots of time bisecting things by hand. We're able to automate this now and it's not a bad thing. --
From: Rafael J. Wysocki <rjw@...> Subject: Re: Reporting bugs and bisection (was: Re: 2.6.25-rc8: FTP transfer errors) [7]Date: Apr 13, 4:33 pm 2008 On Sunday, 13 of April 2008, Evgeniy Polyakov wrote: > Hi. > > On Sun, Apr 13, 2008 at 12:18:31PM -0700, Andrew Morton (akpm@linux-foundation.org [8]) wrote: > > > One other thing which might get confusing/frustrating on the > > > user side is that currently, Linux is the *only* product which requires > > > the bug reporter to find the fault change > > > > That's because many (probably most) Linux bugs are dependent upon the > > hardware which they run on, and developers cannot reproduce the failure on > > their hardware. Other software products don't have that problem. > > Bugs are bugs, they either depend on hardware or do not. > There is no perfect world where after reporting subtle bug it will be > fixed. It is not Linux, it is everywhere. Bugs are only fixed when > they have major impact. Only. Either by having exploit, or crash, > or good testcase. Or bisect result. > > This just a tool to help both parties. And a huge help for regressions. > If bug would exist for years, bisection unlikely to help. > > > That being said.. four or five years ago, developers would often work > > closely with the reporter working out why the reporter's failure was > > occurring. Several days of back-and-forth. > > Yeah, spent two weeks kicking all possible stuff around and eventually > drop that namespace patch at all to find where the problem was. We > started to move further. > > Bisect is just a tool. It is not something developers throw into user > when they do not want to work. This _is_ a help, which allows both to > solve problem in the fastest way. > > If the same would be done on developers machine and huge patches would > be sent to jump between changesets, that would be a real 'work closely > with the reporter working out why the reporter's failure was occurring'? > > You pointed it yourself: several days of back-and-forth. > With this helping automation tool called bisect bug was resolved in 15 > minutes after completion. Completion itself took couple of hours. > > > We dont' do that as much nowadays - there's a tendency to > > > > a) throw the problem back at the reporter, often asking them to bisect. > > If the reporter is running a distro kernel (eg: Fedora) then that's > > quite hard, and often isn't a think they have knowledge to do. So > > they'll just disappear. Or > > > > b) just ignore the report altogether. > > There is also global warming tendency. IIRC. > > Bugs _are_ fixed, Andrew. And developers did not change suddenly to > selfish bastards who do not care for users. They just developed a tool, > which greatly helps to both and saves lots of users time, since > regression gets fixed with this tool really quickly. Bisect is not asked > to be performed without a reason. To be honest, at least in one case no one reacted to my report(s) until I ran a bisection and then it turned up an obviously broken patch. The breakage was so obvious that if anyone had actually looked at the code in question, he would have see it immediately. Things like this are very disappointing and have a very negative impact on bug reporters. We should do our best to avoid them. Thanks, Rafael --
From: Evgeniy Polyakov <johnpol@...> Subject: Re: Reporting bugs and bisection (was: Re: 2.6.25-rc8: FTP transfer errors) [8]Date: Apr 13, 4:54 pm 2008 On Sun, Apr 13, 2008 at 10:33:49PM +0200, Rafael J. Wysocki (rjw@sisk.pl [9]) wrote: > Things like this are very disappointing and have a very negative impact on bug > reporters. We should do our best to avoid them. Shit happens. This is a matter of either bug report or those who were in the copy list. There are different people and different situations, in which they do not reply. -- Evgeniy Polyakov --
From: Stephen Clark <sclark46@...> Subject: Re: Reporting bugs and bisection [9]Date: Apr 13, 6:24 pm 2008 Evgeniy Polyakov wrote: > On Sun, Apr 13, 2008 at 10:33:49PM +0200, Rafael J. Wysocki (rjw@sisk.pl [10]) wrote: >> Things like this are very disappointing and have a very negative impact on bug >> reporters. We should do our best to avoid them. > > Shit happens. This is a matter of either bug report or those who were in > the copy list. There are different people and different situations, in > which they do not reply. > Well less shit would happen if developers would take the time to at least test their patches before they were submitted. It like we will just have the poor user do our testing for us. What kind of testing do developers do. I been a linux user and have followed the LKML for a number of years and have yet to see any test plans for any submitted patches. My $.02 Steve Clark -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
From: <david@...> Subject: Re: Reporting bugs and bisection [10]Date: Apr 13, 7:51 pm 2008 cross-posted to git for the suggestion at the bottom On Sun, 13 Apr 2008, Stephen Clark wrote: > Evgeniy Polyakov wrote: >> On Sun, Apr 13, 2008 at 10:33:49PM +0200, Rafael J. Wysocki (rjw@sisk.pl [11]) >> wrote: >>> Things like this are very disappointing and have a very negative impact on >>> bug >>> reporters. We should do our best to avoid them. >> >> Shit happens. This is a matter of either bug report or those who were in >> the copy list. There are different people and different situations, in >> which they do not reply. >> > Well less shit would happen if developers would take the time to at least > test their patches before they were submitted. It like we will just have the > poor user do our testing for us. What kind of testing do developers do. I > been a linux user and have followed the LKML for a number of years and have > yet to see > any test plans for any submitted patches. I've been reading LKML for 11 years now, I've tested kernels and reported a few bugs along the way. the expectation is that the submitter should have tested the patches before submitting them (where hardware allows). but that "where hardware allows" is a big problem. so many issues are dependant on hardwre that it's not possible to test everything. there are people who download, compile and test the tree nightly (with farms of machines to test different configs), but they can't catch everything. expecting the patches to be tested to the point where there are no bugs is unreasonable. bisecting is a very powerful tool, but I do think that sometimes developers lean on it a bit much. taking the attitude (as some have) that 'if the reporter can't be bothered to do a bisection I can't be bothered to deal with the bug' is going way too far. if a bug can be reproduced reliably on a test system then bisecting it may reveal the patch that introduced or unmasked the bug (assuming that there aren't other problems along the way), but if the bug takes a long time to show up after a boot, or only happens under production loads, bisecting it may not be possible. that doesn't mean that the bug isn't real, it just means that the user is going to have to stick with an old version until there is a solution or work-around. even in the hard-to-test situations, the reporter is usually able to test a few fixes, but there's a big difference between going to management and saying "the kernel guru's think that this will help, can we test it this weekend" 2-3 times and doing a bisection that will take 10-15 cycles to find the problem. it's very reasonable to ask the reporter if they can bisect the problem, but if they say that they can't, declaring that they are out of luck is not reasonable, it just means that it's going to take more thinking to find the problem instead of being able to let the mechanical bisect process narrow things down for you. it may mean that the developer will need to make a patch to instrament an old (working) kernel that has minimal impact on that kernel so that the reporter can run this to gather information about what the load is so that the developer can try to simulate it on a new (non-working) kernel in theory everyone has a test environment that lets them simulate everything in their production envrionment. in practice this is only true at the very low end (where it's easy to do) and the very high end (where it's so critical that it's done no matter how much it costs). Everyone else has a test environment that can test most things, but not everything. As such when they run into a problem they may not be able to do lots of essentially random testing. elsewhere in this thread someone said that the pre-git way was to do a manual bisect where the developer would send patches backing out specific changes to find the problem. one big difference between tat and bisecting the problem is that the manual process was focused on the changes in the area that is suspected of causing the problem, while the git bisect process goes after all changes. this makes it much more likely that the tester will run into unrelated problems along the way. I wonder if it would be possible to make a variation of git bisect that only looked at a subset of the tree when picking bisect points (if you are looking for a e1000 bug, testing bisect points that haven't changed that driver won't help you for example). If this can be done it would speed up the reporters efforts, but will require more assistance from the developers (who would need to tell the reporters what subtrees to test) so it's a tradeoff of efficiancy vs simplicity. David Lang --
From: Willy Tarreau <w@...> Subject: Re: Reporting bugs and bisection [11]Date: Apr 14, 12:39 am 2008 On Sun, Apr 13, 2008 at 04:51:34PM -0700, david@lang.hm [12] wrote: > cross-posted to git for the suggestion at the bottom > > On Sun, 13 Apr 2008, Stephen Clark wrote: > > >Evgeniy Polyakov wrote: > >>On Sun, Apr 13, 2008 at 10:33:49PM +0200, Rafael J. Wysocki (rjw@sisk.pl [13]) > >>wrote: > >>>Things like this are very disappointing and have a very negative impact > >>>on bug > >>>reporters. We should do our best to avoid them. > >> > >>Shit happens. This is a matter of either bug report or those who were in > >>the copy list. There are different people and different situations, in > >>which they do not reply. > >> > >Well less shit would happen if developers would take the time to at least > >test their patches before they were submitted. It like we will just have > >the poor user do our testing for us. What kind of testing do developers > >do. I been a linux user and have followed the LKML for a number of years > >and have yet to see > >any test plans for any submitted patches. > > I've been reading LKML for 11 years now, I've tested kernels and reported > a few bugs along the way. > > the expectation is that the submitter should have tested the patches > before submitting them (where hardware allows). but that "where hardware > allows" is a big problem. so many issues are dependant on hardwre that > it's not possible to test everything. > > there are people who download, compile and test the tree nightly (with > farms of machines to test different configs), but they can't catch > everything. > > expecting the patches to be tested to the point where there are no bugs is > unreasonable. [...] Agreed. The difficulty is that only the developer knows how confident he is in his code. Even the subsystem maintainer does not know, which is the real issue since as long as the code is not identified, he does not know whom to ping. And I think that it might help if we could add a "Trust" rating to the patches we submit, similarly to "Tested-By" or "Signed-off-by". We could use 1 to 5. Basically, when the patch was completed at 3am and just builds, it's more likely 1/5. When it has been stressed for 1 week, it would be 4/5. 5/5 would only be used in backports of known working code, for some wide-used external patches, or for trivial patches (eg: doc/whitespace fixes). The goal would clearly not be to just trust patches with a high rate (since they might break when associated with others), but for the subsystem maintainer to quickly check if there are some of them the author does not 100% trust, in which case he could ping the author to check if his patch *may* cause the reported problem. What makes this rating system delicate is that the rate cannot be changed afterwards. But after all, that's not much of a problem. A bug may very well reveal itself one year after the code was merged, so it's really the developer's estimation which matters. For this to be efficiently used, we would need git-commit to accept a new "-T <rating>" argument with the following possible values : 0: untested (default) 1: builds 2: seems to be working 3: passed basic non-regression tests 4: survived stress testing at the developer's 5: known to be working for a long time somewhere else I'm sure many people would find this useless (or in fact reject the idea because it would show that most code will be rated 1 or 2), but I really think it can help subsystem maintainers make the relation between a reported bug and a possible submitter. Willy --
From: Al Viro <viro@...> Subject: Re: Reporting bugs and bisection [13]Date: Apr 14, 1:39 am 2008 On Mon, Apr 14, 2008 at 06:39:39AM +0200, Willy Tarreau wrote: [snip] > I'm sure many people would find this useless (or in fact reject the > idea because it would show that most code will be rated 1 or 2), > but I really think it can help subsystem maintainers make the relation > between a reported bug and a possible submitter. I have a related proposal: let us require all patches to be stamped with Discordian *and* Eternal September dates. In triplicate. While we are at it, why don't we introduce new mandatory headers like, say it, X-checkpatch: {Yes,No} X-checkpatch-why-not: <string> X-pointless: <number from 1 to 69, going from "1: does something useful" all the way to "68: aligns right ends of lines in comments"> X-arbitrary-rules-added-to-CodingStyle: <number> (should be present if and only if X-pointless: 69 is present). Come to think of that, we clearly need a new file in Documentation/*, documenting such headers. Why don't we organize a subcommittee^Wnew maillist devoted to that? That would provide another entry route for contributors, lowering the overall entry barriers even further... Seriously, looks like Andi is right - we've got ourselves a developing beaurocracy. As in "more and more ways of generating activity without doing anything even remotely useful". Complete with tendency to operate in the ways that make sense only to beaurocracy in question and an ever-growing set of bylaws... --
From: Andrew Morton <akpm@...> Subject: Re: Reporting bugs and bisection [13]Date: Apr 14, 2:24 am 2008 On Mon, 14 Apr 2008 06:39:43 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Mon, Apr 14, 2008 at 06:39:39AM +0200, Willy Tarreau wrote: > > [snip] > > > I'm sure many people would find this useless (or in fact reject the > > idea because it would show that most code will be rated 1 or 2), > > but I really think it can help subsystem maintainers make the relation > > between a reported bug and a possible submitter. > > I have a related proposal: let us require all patches to be stamped > with Discordian *and* Eternal September dates. In triplicate. While > we are at it, why don't we introduce new mandatory headers like, say > it, > > X-checkpatch: {Yes,No} > X-checkpatch-why-not: <string> > X-pointless: <number from 1 to 69, going from "1: does something useful" all > the way to "68: aligns right ends of lines in comments"> > X-arbitrary-rules-added-to-CodingStyle: <number> (should be present if > and only if X-pointless: 69 is present). > > Come to think of that, we clearly need a new file in Documentation/*, > documenting such headers. Why don't we organize a subcommittee^Wnew maillist > devoted to that? That would provide another entry route for contributors, > lowering the overall entry barriers even further... > None of the above was particularly useful. > > Seriously, looks like Andi is right - we've got ourselves a developing > beaurocracy. As in "more and more ways of generating activity without > doing anything even remotely useful". Complete with tendency to operate in > the ways that make sense only to beaurocracy in question and an ever-growing > set of bylaws... No. The problem we're discussing here is the apparently-large number of bugs which are in the kernel, the apparently-large number of new bugs which we're adding to the kernel, and our apparent tardiness in addressing them. Do you agree with these impressions, or not? If you do agree, what would you propose we do about it? --
From: Al Viro <viro@...> Subject: Re: Reporting bugs and bisection [13]Date: Apr 14, 3:23 am 2008 On Sun, Apr 13, 2008 at 11:24:41PM -0700, Andrew Morton wrote: > No. The problem we're discussing here is the apparently-large number of > bugs which are in the kernel, the apparently-large number of new bugs which > we're adding to the kernel, and our apparent tardiness in addressing them. > > Do you agree with these impressions, or not? > > If you do agree, what would you propose we do about it? In addition to obvious "we need testing and something better than bugzilla to keep track of bugs"? Real review of code in tree and patches getting into the tree. And the latter part _must_ be done on each entry point. Any git tree that acts as injection point really needs a working mechanism of some sort that would do that; afterwards it's too late, since review of the stuff getting into mainline on a massive merge is sadly impractical. I don't know any formal mechanism that could take care of that; no more than making sure that no backdoors are injected into the tree. It really has to be a matter of trust for tree maintainers and community around the subsystem. Git is damn good at killing the merge bottleneck. Too good, since it hides the review bottleneck. And we get equivalents of self-selected communities that had been problem for "here's our CVS, here's monthly dump from it, apply" kind of setups. It _is_ better, since one can get to commit history (modulo interesting issues with merge nodes and conflict resolution). But in practice it's not good enough - the patches going in during a merge (especially for a tree that collects from secondaries) are not visible enough. And it's too late at that point, since one has to do something monumentally ugly to get Linus revert a large merge. On the scale of Great IDE Mess in 2.5... linux-next might help with the last part, but I don't think it really deals with the first one. It certainly helps to some extent, but... We need higher S/N on l-k. We need people looking into the subsystem trees as those grow and causing a stench when bad things are found, with design issues getting brought to l-k if nothing else helps. We need tree maintainers understanding that review, including out-of-community one, is needed (the need of testing is generally better understood - I _hope_). We need more people reading the fscking source. Subsystem by subsystem. Without assumption that code is not broken. With mechanism collating the questions asked and answers given. Ideally we need growing documentation of core subsystems and data structures, with explicit goal of helping reviewers new to an area to find their way around it. And yes, I'm guilty of procrastinating on that - several half-finished pieces on VFS-related stuff are sitting locally ;-/ We need gregkh to get real and stop assuming that two Signed-off-by are equivalent to "reviewed at least twice", while we are at it ;-) We need people to realize that warnings are useful as triage tools - not as "Ug see warning. Warning bad. Ug fix that line. Warning go away. Ug changeset count grow. Ug happy.", but as machine-assisted part of finding confused areas of code. With human combining signals from different warnings to get statistically useful triage strategies (note that aforementioned making gcc/sparse/whatnot to STFU by local change has a lovely potential of distorting those signals and actually _hiding_ crap code). Maybe we need a list a-la linux-arch for tree maintainers to coordinate stuff - obviously open not only for those. We really need to get around to doing triage of remaining stuff in -mm, BTW - again, guilty for not getting through such on VFS-related stuff in there. Hopefully linux-next trees will eventually vacuum most of the pile in... As for the bug that got this thread started... I'd say that asking to bisect was reasonable in this particular case. The following DSW mixed into the thread very soon went the way of all DSW (OK, it hadn't godwinated yet, at least in the parts I've seen, so there's still way to go, but...) --
From: Andrew Morton <akpm@...> Subject: Re: Reporting bugs and bisection [13]Date: Apr 14, 4:04 am 2008 On Mon, 14 Apr 2008 08:23:28 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Sun, Apr 13, 2008 at 11:24:41PM -0700, Andrew Morton wrote: > > > No. The problem we're discussing here is the apparently-large number of > > bugs which are in the kernel, the apparently-large number of new bugs which > > we're adding to the kernel, and our apparent tardiness in addressing them. > > > > Do you agree with these impressions, or not? > > > > If you do agree, what would you propose we do about it? > > In addition to obvious "we need testing and something better than bugzilla > to keep track of bugs"? Swapping out bugzilla for something else wouldn't help. We'd end up with lots of people ignoring a good bug tracking system just like they were ignoring a bad one. (And I don't think developers and maintainers _should_ spend time mucking in bug-tracking systems. They should have helpers who do all the triaging/tracking/routing/closing work for them, and then provide other developers with the results, letting them know what they should be spending time on. But there's a manpower problem). > Real review of code in tree and patches getting into > the tree. > > And the latter part _must_ be done on each entry point. Any git tree > that acts as injection point really needs a working mechanism of some > sort that would do that; afterwards it's too late, since review of > the stuff getting into mainline on a massive merge is sadly impractical. > > I don't know any formal mechanism that could take care of that; no more > than making sure that no backdoors are injected into the tree. It really > has to be a matter of trust for tree maintainers and community around > the subsystem. > > Git is damn good at killing the merge bottleneck. Too good, since it > hides the review bottleneck. And we get equivalents of self-selected > communities that had been problem for "here's our CVS, here's monthly > dump from it, apply" kind of setups. It _is_ better, since one can > get to commit history (modulo interesting issues with merge nodes and > conflict resolution). But in practice it's not good enough - the patches > going in during a merge (especially for a tree that collects from > secondaries) are not visible enough. And it's too late at that point, > since one has to do something monumentally ugly to get Linus revert > a large merge. On the scale of Great IDE Mess in 2.5... > > linux-next might help with the last part, but I don't think it really > deals with the first one. It certainly helps to some extent, but... > > We need higher S/N on l-k. We need people looking into the subsystem > trees as those grow and causing a stench when bad things are found, > with design issues getting brought to l-k if nothing else helps. We > need tree maintainers understanding that review, including out-of-community > one, is needed (the need of testing is generally better understood - I > _hope_). > > We need more people reading the fscking source. Subsystem by subsystem. > Without assumption that code is not broken. With mechanism collating > the questions asked and answers given. Ideally we need growing documentation > of core subsystems and data structures, with explicit goal of helping > reviewers new to an area to find their way around it. And yes, I'm > guilty of procrastinating on that - several half-finished pieces on > VFS-related stuff are sitting locally ;-/ > > We need gregkh to get real and stop assuming that two Signed-off-by are > equivalent to "reviewed at least twice", while we are at it ;-) > > We need people to realize that warnings are useful as triage tools - > not as "Ug see warning. Warning bad. Ug fix that line. Warning go away. > Ug changeset count grow. Ug happy.", but as machine-assisted part of > finding confused areas of code. With human combining signals from > different warnings to get statistically useful triage strategies (note > that aforementioned making gcc/sparse/whatnot to STFU by local change > has a lovely potential of distorting those signals and actually _hiding_ > crap code). > > Maybe we need a list a-la linux-arch for tree maintainers to coordinate > stuff - obviously open not only for those. > > We really need to get around to doing triage of remaining stuff in -mm, > BTW - again, guilty for not getting through such on VFS-related stuff > in there. Hopefully linux-next trees will eventually vacuum most of the > pile in... That all sounds good and I expect few would disagree. But if it is to happen, it clearly won't happen by itself, automatically. We will need to force it upon ourselves and the means by which we will do that is process changes. The thing which is being disparaged as "bureaucracy". The steps to be taken are: a) agree that we have a problem b) agree that we need to address it c) identify the day-to-day work practices which will help address it (as you have done) d) identify the process changes which will force us to adopt those practices e) implement those process changes. I have thus far failed to get us past step a). --
From: Arjan van de Ven <arjan@...> Subject: Re: Reporting bugs and bisection [13]Date: Apr 14, 10:43 am 2008 On Mon, 14 Apr 2008 01:04:12 -0700 > > The steps to be taken are: > > a) agree that we have a problem > I for one do not agree that we have a problem. Based on actual data on oopses (which very clearly excludes other kinds of bugs, so I know I only see part of the story) we are doing reasonably well. Lets look at the 2.6.25 cycle. We got a total of roughly 2700 reports of oopses/warn_ons from users. (This may sound high to those of you only reading lkml, but this includes automatically collected oopses from Fedora 9 beta testers). Out of these 2700, the top 20 issues account for 75% of the total reports. Out of these 20 issues, 9 were from still out of tree drivers (wireless.git and drm.git included in F9). These were caught before they even got close to mainline. The remaining 11 issues can be split in 1) The ones we caught and fixed 2) TCP/IP warnings that DaveM and co are chasing down hard (but have trouble finding reproducers) 3) An EXT3 bug that in theory can cause data corruption, but in practice seems to happen after you yank out a USB stick with an EXT3 filesystem on (so it can't corrupt the disk data). Ted is working on this 4) A bug (double free) that hits in the skb layer, probably caused by a bug in the ipv4 code (a first analysis + potential patch was mailed to netdev this weekend) 5) sysfs "existing file added" warning, mostly in the USB stack (gregkh claims he fixed this recently, I'm not entirely sure he got all cases) And when I look beyond the first 20, the same pattern arises, we fixed the majority of the issues before -rc9. At position 25 we have less than 20 reports per bug. At position 35 we have less than 10 reports per bug. At position 50 we have less than 5 reports per bug. Conclusion there: the bugs people actually hit fall of dramatically; there's a core set of issues that gets hit a lot, the rest quickly gets reduced to noise levels. To me this does not sound like we have a huge quality problem because 1) The distribution of the bugs is such that there is a relatively small set of core issues that are widely hit, and then there's a near exponential drop after that 2) We are fixing the important bugs by and large before they hit a release (important as defined by the number of people actually hitting the bug) I'll be writing a report with more details about this soon with more analysis and statistics (I'll be looking at more detail around the top 25 issues, when they got introduced, when they got fixed etc) -- If you want to reach me at my work email, use arjan@linux.intel.com [14] For development, discussion and tips for power savings, visit http://www.lesswatts.org [15] --
From: Andrew Morton <akpm@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 1:51 pm 2008 On Mon, 14 Apr 2008 07:43:49 -0700 Arjan van de Ven <arjan@infradead.org> wrote: > On Mon, 14 Apr 2008 01:04:12 -0700 > > > > The steps to be taken are: > > > > a) agree that we have a problem > > > > > I for one do not agree that we have a problem. > > Based on actual data on oopses (which very clearly excludes other kinds of bugs, so I know I only see part of the story) > we are doing reasonably well. Lets look at the 2.6.25 cycle. > We got a total of roughly 2700 reports of oopses/warn_ons from users. (This may sound high to those of you only reading > lkml, but this includes automatically collected oopses from Fedora 9 beta testers). > Out of these 2700, the top 20 issues account for 75% of the total reports. > > Out of these 20 issues, 9 were from still out of tree drivers (wireless.git and drm.git included in F9). These were > caught before they even got close to mainline. > The remaining 11 issues can be split in > 1) The ones we caught and fixed > 2) TCP/IP warnings that DaveM and co are chasing down hard (but have trouble finding reproducers) > 3) An EXT3 bug that in theory can cause data corruption, but in practice seems to happen after you yank out a USB stick > with an EXT3 filesystem on (so it can't corrupt the disk data). Ted is working on this > 4) A bug (double free) that hits in the skb layer, probably caused by a bug in the ipv4 code > (a first analysis + potential patch was mailed to netdev this weekend) > 5) sysfs "existing file added" warning, mostly in the USB stack > (gregkh claims he fixed this recently, I'm not entirely sure he got all cases) > > And when I look beyond the first 20, the same pattern arises, we fixed the majority of the issues before -rc9. > At position 25 we have less than 20 reports per bug. At position 35 we have less than 10 reports per bug. > At position 50 we have less than 5 reports per bug. Conclusion there: the bugs people actually hit fall of dramatically; > there's a core set of issues that gets hit a lot, the rest quickly gets reduced to noise levels. > > > To me this does not sound like we have a huge quality problem because > 1) The distribution of the bugs is such that there is a relatively small set of core issues > that are widely hit, and then there's a near exponential drop after that > 2) We are fixing the important bugs by and large before they hit a release > (important as defined by the number of people actually hitting the bug) > > > > I'll be writing a report with more details about this soon with more analysis and statistics > (I'll be looking at more detail around the top 25 issues, when they got introduced, when they got fixed etc) Well OK. But I don't think we can generalise from oops-causing bugs all the way to all bugs. Very few bugs actually cause oopses, and oopses tend to be the thing which developers will zoom in on and pay attention to. If we had metrics on "time goes backwards" or anything containing "ASUS", things might be different. --
From: David Miller <davem@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 4:30 am 2008 From: Andrew Morton <akpm@linux-foundation.org> Date: Mon, 14 Apr 2008 01:04:12 -0700 > That all sounds good and I expect few would disagree. But if it is to > happen, it clearly won't happen by itself, automatically. We will need to > force it upon ourselves and the means by which we will do that is process > changes. The thing which is being disparaged as "bureaucracy". > > The steps to be taken are: > > a) agree that we have a problem ... > I have thus far failed to get us past step a). A lot of people, myself included, subconsciously don't want to get past step a) because the resulting "bureaucracy" or whatever you want to call it is perceived to undercut the very thing that makes the Linux kernel fun to work on. It's still largely free form, loose, and flexible. And that's a notable accomplishment considering how much things have changed. That feeling is why I got involved in the first place, and I know it's what gets other new people in and addicted too. Nobody is "forced" to do anything, and I notice you used the word "force" in d) :-) And I realize this relaxed attitude goes hand in hand with reduced quality and occaisionally more bugs. In many ways, I'm happy with that tradeoff at least wrt. how that works out for the subsystems I'm responsible for. We can ask more subsystem tree maintainers to run their trees more strictly, review patches more closely, etc. But, be honest, good luck getting that from the guys who do subsystem maintainence in their spare time on the weekends. The remaining cases should know better, or simply don't care. --
From: Andrew Morton <akpm@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 6:15 am 2008 On Mon, 14 Apr 2008 01:30:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Date: Mon, 14 Apr 2008 01:04:12 -0700 > > > That all sounds good and I expect few would disagree. But if it is to > > happen, it clearly won't happen by itself, automatically. We will need to > > force it upon ourselves and the means by which we will do that is process > > changes. The thing which is being disparaged as "bureaucracy". > > > > The steps to be taken are: > > > > a) agree that we have a problem > ... > > I have thus far failed to get us past step a). > > A lot of people, myself included, subconsciously don't want to > get past step a) because the resulting "bureaucracy" or whatever > you want to call it is perceived to undercut the very thing > that makes the Linux kernel fun to work on. > > It's still largely free form, loose, and flexible. And that's > a notable accomplishment considering how much things have changed. > That feeling is why I got involved in the first place, and I know > it's what gets other new people in and addicted too. > > Nobody is "forced" to do anything, and I notice you used the > word "force" in d) :-) OK, I was going to let this pass, but I changed my mind. You carefully deleted my text so that you could misquote it, thereby flagrantly misrepresenting everything I said. Here it is again: : The steps to be taken are: : : a) agree that we have a problem : : b) agree that we need to address it : : c) identify the day-to-day work practices which will help address it (as : you have done) : : d) identify the process changes which will force us to adopt those practices : : e) implement those process changes. Forcing a discipline upon oneself is totally different from having it forced upon you by someone else. Each step will need general agreement and buyin, otherwise none of it will (or should) work. --
From: David Miller <davem@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 6:41 am 2008 From: Andrew Morton <akpm@linux-foundation.org> Date: Mon, 14 Apr 2008 03:15:30 -0700 > You carefully deleted my text so that you could misquote it, thereby > flagrantly misrepresenting everything I said. Not the intention, but anyways: > Here it is again: > > : The steps to be taken are: > : > : a) agree that we have a problem > : > : b) agree that we need to address it > : > : c) identify the day-to-day work practices which will help address it (as > : you have done) > : > : d) identify the process changes which will force us to adopt those practices > : > : e) implement those process changes. > > Forcing a discipline upon oneself is totally different from having it > forced upon you by someone else. > > Each step will need general agreement and buyin, otherwise none of it will > (or should) work. The "force" is to "us" which is a group. And I imagine that newcomers will be expected to adopt these "practices". So in effect, they will be "forced" into the process changes as well. I'm getting more and more sensitive to issues on this level over time, because I realize that the fundamental issue in all human group issues is getting people to "want" to do things. And "force", in any form, tends to be incompatible with "want". And in particular, people will often even shun things they "want" when it is "forced" to them. --
From: Andi Kleen <andi@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 5:46 am 2008 David Miller <davem@davemloft.net> writes: > > It's still largely free form, loose, and flexible. I think Al's point was that we need far more "free form, loose and flexible" work for reviewing code. As in people going over trees and just checking it for anything suspicious and going over existing code and checking it for anything suspicious and going also over mailing list patch posts. And also maintainers who appreciate such review. And checking it for anything suspicious does not mean running only checkpatch.pl or even just sparse, but actually reading it and trying to make sense of it. I don't see that really as conflicting with your goals. It would be some more work for the maintainers to handle more such feedback because they would need to process comments from such "free form reviewers". Some of them will undoutedly be wrong and that will take some time away from processing features (and bugs) but I suspect it would be still worth it. On the other hand it would also take some work away from processing bugs, but as Andrew mentions earlier it looks like significant parts of the boring areas of bug reports (like getting basic information from reporter etc.) could be "out-sourced" to bug masters. And I think being a bug master is an excellent way for someone who isn't a great coder to contribute in excellent ways to Linux (far more than someone e.g. running checkpatch.pl ever could) The challenging thing is also to make sure that the quality of comments stays high. That means more focus on logic and functionality than on form. If the reviewer just goes over the coding style or trivialities I don't think that will improve Linux really. I think the problem is often that people think kernel code must be very complicated and they don't even dare try to understand it. But frankly a lot of the kernel code is not really that complicated logic wise and also doesn't need too specialized knowledge to understand. So I am optimistic that there are a lot of people out there who would be qualified to do some logic review. Really Linux needs a better "reviewing culture" and also a better "bug processing culture" > We can ask more subsystem tree maintainers to run their trees more > strictly, review patches more closely, etc. But, be honest, good luck > getting that from the guys who do subsystem maintainence in their > spare time on the weekends. The remaining cases should know better, > or simply don't care. In my experience weekend maintainers tend to be better at sharing out work. As in they usually (ok there are exceptions) more work including review work on the mailing lists, while my impression is that paid for maintainers tend to have tendency for more centralized "cathedral" tree maintenance. That is with them trying to keep everything under control and effectively much more stuff going on the background out of public view. But the sharing out of work and less centralization is what we really want here I think. Anyways I'm not saying all paid-for maintainers are like this, but there is certainly a trend I think. I admit I personally went through both phases in several projects. When you're really focussed on something it is tempting to do the "keep things under control" central model, but in the end it is the wrong way to go. -Andi --
From: David Miller <davem@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 2:39 am 2008 From: Andrew Morton <akpm@linux-foundation.org> Date: Sun, 13 Apr 2008 23:24:41 -0700 > Do you agree with these impressions, or not? I think things are improving. I wrote or merged in ~10 bugs in the last hour, for example. And I also agree with Al's point, which was embedded in his humorous and obviously sarcastic suggestions, in that adding beurocracy isn't the answer. We already have too much and it scares developers away. Sure you don't want crap getting into the tree (for too long), but it is important to be careful to define crap properly. For example, inundating patch submitters with more requirements, especially ones involving automatons like checkpatch, is in the end bad. We can improve the quality of stuff going in and be flexible at the same time. --
From: David Miller <davem@...> Subject: Re: Reporting bugs and bisection [15]Date: Apr 14, 2:43 am 2008 From: David Miller <davem@davemloft.net> Date: Sun, 13 Apr 2008 23:39:59 -0700 (PDT) > I wrote or merged in ~10 bugs in the last hour, for example. Bug fixes! I meant "fixes" I swear! That's quite a Freudian slip if I ever saw one. --
Related links:
- Archive of above thread [15]
- Archive of above thread [15]