This is starting to get beyond frustrating for me. Yesterday, I spent the whole day bisecting boot failures on my system due to the totally untested linux/bitops.h optimization, which I fully analyzed and debugged. Today, I had hoped that I could get some work done of my own, but that's not the case. Yet another bootup regression got added within the last 24 hours. I don't mind fixing the regression or two during the merge window but THIS IS ABSOLUTELY, FUCKING, REDICULIOUS! The tree breaks every day, and it's becomming an extremely non-fun environment to work in. We need to slow down the merging, we need to review things more, we need people to test their fucking changes! --
Well, I must say I second that. I'm not seeing regressions myself this time (well, except for the one that Jiri fixed), but I did find a few of them during the post-2.6.24 merge window and I wouldn't like to repeat that experience, so to speak. IMO, the merge window is way too short for actually testing anything. I rebuild the kernel once or even twice a day and there's no way I can really test it. I can only check if it breaks right away. And if it does, there's no time to find out what broke it before the next few hundreds of commits land on top of that. Thanks, Rafael --
On Wed, 30 Apr 2008 21:36:57 +0200 <jumps up and down> There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1! _anything_ which appears in 2.6.x-rc1 and which wasn't in 2.6.x-mm1 was snuck in too late (OK, apart from trivia and bugfixes). If we decide that we need to fix the oh-shit-lets-slam-this-in-and-hope problem then I expect we can do so, via fairly relible means. But the first attempt at solving it should be to ask people to not do that. --
The problem I see with both -mm and linux-next is that they tend to be better at finding the "physical conflict" kind of issues (ie the merge itself fails) than the "code looks ok but doesn't actually work" kind of issue. Why? The tester base is simply too small. Now, if *that* could be improved, that would be wonderful, but I'm not seeing it as very likely. I think we have fairly good penetration these days with the regular -git tree, but I think that one is quite frankly a *lot* less scary than -mm or -next are, and there it has been an absolutely huge boon to get the kernel into the Fedora test-builds etc (and I _think_ Ubuntu and SuSE also started something like that). So I'm very pessimistic about getting a lot of test coverage before -rc1. Maybe too pessimistic, who knows? Linus --
First of all:
I 100% agree with Andrew that our biggest problems are in reviewing code
and resolving bugs, not in finding bugs (we already have far too many
unresolved bugs).
But although testing mustn't replace code reviews it is a great help,
especially for identifying regressions early.
Finding testers should actually be relatively easy since it doesn't
require much knowledge from the testers.
And it could even solve a second problem:
It could be a way for getting newbies into kernel development.
We actually do only rarely have tasks suitable as janitor tasks for
newbies, and the results of people who do neither know the kernel
nor know C running checkpatch on files in the kernel have already
been discussed extensively...
I'll try to do this:
- create some Wiki page
- get a mailing list at vger
- point newbies to this mailing list
- tell people there which kernels to test
- figure out and document stuff like how to bisect between -next kernels
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--On Thu, 1 May 2008 03:31:25 +0300 I would argue instead that we don't know which bugs to fix first. We're never going to fix all bugs, and to be honest, that's ok. As long as we fix the important bugs, we're doing really well. And at least for the kerneloops.org reported issues, we're doing quite ok. For me, 'important' is a combination of effect of the bug and the number of people it'll hit. A compiler warning on parisc is less important than easy to trigger filesystem corruption in ext3 that way; more people will hit it and the effect is more grave. For oopses and WARN_ON()'s were getting to the hang of this now with kerneloops.org, at least for the oopses that aren't really hard fatal. One thing I learned at least is that lkml is a poor representation of what people actually hit; it's a very very selective audience. oopses/warnons are only a subset of the bugs of course... but still. So there's a few things we (and you / janitors) can do over time to get better data on what issues people hit: 1) Get automated collection of issues more wide spread. The wider our net the better we know which issues get hit a lot, and plain the more data we have on when things start, when they stop, etc etc. Especially if you get a lot of testers in your project, I'd like them to install the client for easy reporting of issues. 2) We should add more WARN_ON()s on "known bad" conditions. If it WARN_ON()'s, we can learn about it via the automated collection. And we can then do the statistics to figure out which ones happen a lot. 3) We need to get persistent-across-reboot oops saving going; there's some venues for this --
That might be OK.
But our current status quo is not OK:
Check Rafael's regressions lists asking yourself
"How many regressions are older than two weeks?"
The kernel Bugzilla curerntly knows about 212 open regression bugs.
(And many more have not made it into Bugzilla.)
We have unmaintained and de facto unmaintained parts of the kernel where
No disagreement on this, its just a different issue than our bug fixing
problem.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--On Thu, 1 May 2008 14:30:38 +0300 "ext4 doesn't compile on m68k". YAWN. Wrong question... "How many bugs that a sizable portion of users will hit in reality are there?" And how many people are hitting those issues? If a part of the kernel is really important to enough people, there tends to be someone who stands up to either fix the issue or start de-facto maintaining that part. And yes I know there's parts where that doesn't hold. But to be honest, there's not that many of them that have active development (and thus get the biggest No it's not! Knowing earlier and better which bugs get hit is NOT different --
"Kernel oops while running kernbench and tbench on powerpc" took more
than 2 months to get resolved, and we ship 2.6.25 with this regression.
Granted that compared to x86 there's not a sizable portion of users
crazy enough to run Linux on powerpc machines...
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--That was a very subtle bug that only showed up on one particular powerpc machine. I was not able to replicate it on any of the powerpc machines I have here. Nevertheless, we found it and we have a fix for it. I think that's an example of the process working. :) Paul. --
Was it even a regression in the classical sense of the word? Seemed more of a latent bug that was simply never triggered before. josh --
That's right. The bug has been there basically forever (i.e. since before 2.6.12-rc2 ;) and no-one has been able to trigger it reliably before. Paul. --
But for users this is a recent regression since 2.6.24 worked
and 2.6.25 does not.
If this problem was on x86 Linus himself and some other core developers
would most likely have debugged this issue and Linus would have delayed
the release of 2.6.25 for getting it fixed there.
And stuff that "only showed up on one particular machine" often shows up
on many machines (we only know in hindsight) and the "one particular
machine" is often due to the fact that of the many machines that might
trigger a regression only one was used for testing this -rc kernel.
This not in any way meant against you personally, and due to the fact
that the powerpc port is among the better maintained parts of the kernel
this regression eventually got fixed, but in many other parts of the
kernel this would have been one more of the many regressions that were
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--Totally and utterly immaterial. If it's a timing-related bug, as far as developers are concerned, nothing they did introduced the problem. So anybody who think s that "process" should have caught it is just being stupid. Adrian, you're one of the absolutely *worst* in the camp of "everything should be perfect". You really need to realize that reality is messy, and things cannot be pefect. You also need to realize and *understand* that aiming for "good" is actually much BETTER than trying to aim for "perfect". Perfect is the enemy of good. Linus --
So I would like to ask you what an user should do when facing what is probably a timing-related bug, as it appears I have the bad luck of hitting one. See for example my comments after this one http://bugzilla.kernel.org/show_bug.cgi?id=10117#c11 This same problem is still present with yesterday's git, and sometimes it hangs without hpet=disable and sometimes it doesn't. (And never with hpet=disable in the boot command line) And when it hangs I can see only _one_ "Switched to high resolution mode on CPU x" message before the hang point, and when it boots fine there is always the two of them in sequence: Switched to high resolution mode on CPU 1 Switched to high resolution mode on CPU 0 And using vga=6 or vga=0x0364 makes a difference in the probability of hanging. I am just waiting -rc1 to be released to send an email with my problem again, as I am unable to debug this myself. I think this is ok from my part, right? --
Quite frankly, it will depend on the bug. If it's *reliably* timing-related (which sounds crazy, but is not at all unheard of), it can be reliably bisected down to some totally unrelated commit that doesn't actually introduce the problem at all, but that reliably turns it on or off. That can be very misleading, and can cause us to basically revert a good commit, only to not actually fix the bug (and possibly re-introduce the bug that the reverted commit tried to fix). But sometimes it gives us a clue where the timing problem is. But quite frankly, that seems to be the exception rather than the rule. There have been issues that literally seemed to depend on things like cacheline placement etc, where changing config options for code that was never actually even *run* would change timing just enough to show a bug pseudo-reliably or not at all. The good news is that those timing issues are really quite rare. Tha bad news is that when they happen, they are almost totally Hey, it may well be a HPET+NOHZ issue. But it could also be that HPET is .. and yeah, these kinds of really odd and obviously totally unrelated issues are a sign of a bug that is either simply hardware instability or very subtly timing-related. The reason I mention hardware instability is that there really are bugs that happen due to (for example) power supply instabilities. Brownouts under heavy load have been causes of problems, but perhaps surprisingly, so has _idle_ time thanks to sleep-states! The latter is probably due to bad powr conditioning on the CPU power lines, where the huge current swings (going at high CPU power to low, and back again) not only have made soem motherboards "sing" (or "hum", depending on frequency) but also causes voltage instability and then the CPU crashes. Am I saying that's the reason you see problems? Probably not. Most instabilities really are due to kernel bugs. But hardware instabilities do Yes. You've been a good bug reporter, and...
It happens a bit before that because when it hangs it doesn't print the above lines, and when it does not hang these lines are Yes you are right. When I have luck and the boot succeeds my Sony laptop A few days ago I found this message in lkml in reply to a hpet patch http://lkml.org/lkml/2007/5/7/361 in which the reporter also had a similar hang, which was cured by hpet=disable. So it is in my TODO list to try to check out if that patch is in the current -git and whether it can be reverted somehow (I added Venki to the Cc: now) Thanks a lot for the answer! --
It depends on whether we are HPET is being force detected based on the chipset or whether it was exported by the BIOS in ACPI table. If it was force enabled and above patch is having any effect, then you In any case, off late there seems to be quite a few breakages that are related to HPET/timer interrupts. One of them was on a system which has HPET being exported by BIOS http://bugzilla.kernel.org/show_bug.cgi?id=10409 And the other one where we are force enabling based on chipset http://bugzilla.kernel.org/show_bug.cgi?id=10561 And then we have hangs once in a while reports by you, Roman and Mark here http://bugzilla.kernel.org/show_bug.cgi?id=10377 http://bugzilla.kernel.org/show_bug.cgi?id=10117 Thanks, Venki --
.. Yeah. This particular bug first appeared when NOHZ & HPET were added. Somebody once suggested it had something to do with an SMI interrupt happening in the midst of HPET calibration or some such thing. But nobody who works on the HPET code has ever shown more than a casual interest in helping to track down and fix whatever the problem is. Cheers --
I said I was waiting for -rc1 to be released to send another email about my HPET problem, but curiously with v2.6.26-rc1-6-gafa26be my laptop did not hang after 30+ boots and counting. Somewhere between 2.6.25-07000-(something) and the above kernel something happened which changed significantly the probability of hanging during boot. I could not boot more than 3 times in a row without hanging with kernels up to 2.6.25-07000 (approximately), and now I am still booting v2.6.26-rc1-6-gafa26be a few times a day and no hangs yet. Yesterday I started a "reverse" bisection, trying to find which commit "fixed" it, but I still didn't finish (but it is past -7200). Of course I am not sure if after the 100th boot the latest -git Well, I would like to thank Venki for his effort because he even answered some private emails from me about this issue and is tracking the bugzillas about it. --
.. My experience with this bug, since 2.6.20 or so, has been that it comes and goes with even the most innocent change in the .config file, like turning frame pointers on/off. Cheers --
I never actually saw a statement to that effect (i.e. that 2.6.24 worked) from Kamalesh. I think people assumed that because he reported it against version X that version X-1 worked, but we don't If I had been able to replicate it, or if it had been seen on more than one machine, I would probably have asked Linus to wait while we fixed it. There's a risk management thing happening here. Delaying a release is a negative thing in itself, since it means that users have to wait longer for the improvements we have made. That has to be balanced against the negative of some users seeing a regression. It's not an absolute, black-and-white kind of thing. In this case, for a bug being seen on only one machine, of a somewhat unusual configuration, I considered it wasn't worth asking to delay the release. Paul. --
He reported it as
[BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
No general disagreement on this.
And my example was not in any way meant against you - it's actually
unusual and positive that a bug that once got the attention of being
on the regression lists gets fixed later.
Even worse is the situation with regressions people run into when
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--Precisely. Cherry-picking a single example such as the 68k thing and then Another fallacy which Arjan is pushing (even though he doesn't appear to have realised it) is "all hardware is the same". Well, it isn't. And most of our bugs are hardware-specific. So, I'd venture, most of our bugs don't affect most people. So, over time, by Arjan's "important to enough people" observation we just get more and more and more unfixed bugs. And I believe this effect has been occurring. And please stop regaling us with this kerneloops.org stuff. It just isn't very interesting, useful or representative when considering the whole problem. Very few kernel bugs result in a trace, and when they do they are usually easy to fix and, because of this, they will get fixed, often quickly. I expect netdevwatchdogeth0transmittimedout.org would tell a different story. One thing which muddies all this up is that bug reporters vanish. Over the years I have sent thousands and thousands of ping emails to people who have reported bugs via email, three to six months after the fact. Some were solved - maybe a fifth. About the same proportion of reporters reply and give some reason why they cannot work on the bug. In the majorty of cases people don't reply at all and I suspect they're in the same category of cannot-work-on-the-bug. And why can't they work on the bug? Usually, because they found a workaround. People aren't going to spend months sitting in front of a non-functional computer waiting for kernel developers to decide if their machine is important enough to fix. They will find a workaround. They will buy new hardware. They will discover "noapic" (234000 google hits and rising!). They will swap it with a different machine. They will switch to a different distro which for some reason doesn't trigger the bug. They will use an older kernel. They will switch to Solaris. Etcetera. People are clever - they will find a way to get around it. I figure that after a bug is reported w...
On Thu, 1 May 2008 08:49:19 -0700 no I'm pushing "some classes of hardware are much more popular/relevant I did not say "most people". I believe "most people" aren't hitting bugs right now (or there would be a lot more screaming). What I do believe is that *within the bugs that hit*, even the hardware specific ones, there's a clear prioritization by how many people hit now that's a fallacy of your own.. if you care about that one, it's 1) trivial to track and/or 2) could contain a WARN_ON_ONCE(), at which point it's automatically tracked. (and more useful information I suspect, since it suddenly has a full backtrace including driver info in it) By your argument we should work hard to make sure we're better at creating traces for cases we detect something goes wrong. if it's a hardware bug there's little we can do. If it's a hardware specific bug, yeah then it becomes a function of how Given that a normal PC has maybe 10 components... yes we don't want bugcreep that affects common hardware over time. At the same time, by your argument, a bug that hits a piece of hardware of which 5 are made (or left on this planet) is equally important to This statement is so rediculous and self contradicting to what you said before that I'm not even going to respond to it. --
"popular/relevant" is hard to define.
E.g. if we'd go after "popular" we should only keep architectures like
ARM and x86 and ditch architectures like ia64 and s390 that have puny
userbases.
If your "or have the hardware in general" is meant seriously you have to
convince people that ARM must become a very high priority.
No matter whether one supports your "there's a clear prioritization"
view or not it anyway doesn't currently work since the areas covered by
people testing -rc kernels don't even remotely map the most popular
kerneloops.org catches the easiest to solve bugs (there's a trace) and
helps in getting them fixed.
That's a very good thing.
And if we get more bugs into this easy to resolve state that would be
even better.
But it's only a small part of the complete picture of incoming bug
reports.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--So the question is if we have a thousand bugs which only affect one person each, and 70 million Linux users, how much should we beat up ourselves that 1,000 people can't use a particular version of the Linux kernel, versus the 99.9% of the people for which the kernel works just fine? Sometimes, we can't make everyone happy. At the recent Linux Collaboration Summit, we had a local user walk up to a microphone, and loosely paraphrased, said, "WHINE WHINE WHINE WHINE I have have a $30 DVD drive that doesn't work with Linux. WHINE WHINE WHINE WHINE WHINE What are *you* going to do to fix my problem?" Some people like James responded very diplomatically, with "Well, you have to understand, the developer might not have your hardware, and there's a lot of broken out here, etc., etc." What I wanted to tell this user was, "Ask not what the Linux development community can do for you. Ask what *you* can do for Linux?" Suppose this person had filed a kernel bugzilla bug, and it was one of the hundreds or thousands of non-handled bugs. Sure, it's a tragedy that bugs pile up. But if they pile up because of crappy hardware, that's not a major tragedy. If we can figure out how to blacklist it, and move on, Hey, in this particular case, if this user worked around the problem by buying new hardware, it was probably the right solution. As far as we know we don't have a systematic problem where huge numbers DVD drives aren't working, so if there are a few odd ball ones that are out there, we just CAN'T self-flagellate ourselves that we're not ... and maybe we can't solve hardware bugs. Or that crappy hardware isn't worth holding back Linux development. And I'm not sure ignoring it is that horrible of a thing. And in practice, if it's a hardware bug in something which is very common, it *will* get noticed very quickly and fixed. But if it's in a hardware bug in some rare piece of hardware, the user is going to have to either (a) help us fix it, or (b) decide that his time is more ...
On Thu, 1 May 2008 13:24:34 -0400 Many, many of these are regressions. If old-linux works on that hardware then new-linux can too. (still wants to know what we did 2-3 years ago which caused thousands of people to have to resort to using noapic and other apic-related boot option workarounds) --
Forcing APIC even when the BIOS didn't support them. -Andi --
Perhaps 2-3 years ago more people started using more hardware that implements APIC. ;-) -- Steve --
And actually, core kernel developers are best for writing new bugs. Really, the way I started out learning how the kernel ticks was to go and try to solve some bugs that I was seeing (this was years ago). I get people asking that they want to learn to be a kernel developer and they ask what new feature should they work on? Well, honestly, the last thing a newbie kernel developer should be doing is writing new bugs. We need to send them to a URL that lists all the known bugs and have them pick one, any one, and have them solve it. This would be the best way to learn part of the kernel. I even find that I understand my own code better when I'm in the debugging phase. People here mention differnt places to look at code, and besides the kerneloops.org I really don't even know where to look for bugs, because I haven't seen a URL to point me to. The next time someone asks me how to get started in kernel programming, I would love to tell them to go and look here, and solve the bugs. I'm guessing that I should just point them to: http://janitor.kernelnewbies.org/ and tell them to focus on real bugs (not just comments and such) to get fixed if they really want to learn the kernel. -- Steve --
On Thu, 1 May 2008 12:38:23 -0400 (EDT) bugzilla.kernel.org is, umm, improving. It would be an intersting exercise for someone to spend a few days seeing how many of the bugzilla reports they personally can reproduce. I'd guess "zero". There's a lesson in that. The problem with bugzilla will be that it will be hard to find reports where the reporter will be able to work with you on the fix - we've let them go cold. The most fruitful place to find fixable bugs is linux-kernel. People who report bugs there are sufficiently motivated to have actually sent the email and the bug is still recent, so they probably haven't done the Solaris install yet. --
Agreed. Thanks, Rafael --
<boggle> How about "a bug which we just added"? One which is repeatable. Repeatable by a tester who is prepared to work with us on resolving it. Those bugs. Rafael has a list of them. We release kernels when that list still has tens of unfixed regressions dating back up to a couple of months. --
On Thu, 1 May 2008 01:13:46 -0700 I know he does. But I will still argue that if that is all we work from, and treat all of those equally, we're doing the wrong thing. I'm sorry, but I really do not consider "ext4 doesn't compile on m68k" which is on that list to be as relevant as a "i915 drm driver crashes" bug which is among us for a while and not on that list, just based on the total user base for either of those. Does that mean nobody should fix the m68k bug? Someone who cares about m68k for sure should work on it, or if it's easy for an ext4 developer, sure. But if the ext4 person has to spend 8 hours on it figuring cross compilers, I say we're doing something very wrong here. (no offense to the m68k people, but there's just a few of you; maybe I should have picked voyager instead) Maybe that's a "boggle" for you; but for me that's symptomatic of where we are today: We don't make (effective) prioritization decisions. Such decisions are hard, because it effectively means telling people "I'm sorry but your bug is not yet important". That's unpopular, especially if the reporter is very motivated on lkml. And it will involve a certain amount of non-quantifiable judgement calls, which also means we won't always be right. Another hard thing is that lkml is a very self-selective audience. A bug may be reported three times there, but never hit otherwise, while another bug might not be reported at all (or only once) while thousands and thousands of people are hitting it. Not that we're doing all that bad, we ARE fixing the bugs (at least the oopses/warnings) that are frequently hit. So I wouldn't blindly say we're doing a bad job at prioritizing. I would rather say that if we focus only on what is left afterwards without doing a reality check, we'll *always* have a negative view of quality, since there will *always* be bugs we don't fix. Linux well over ten million users (much more if you count embedded devices). A lot of them will have "standard" hardware, and a bunch of the...
On that note, I'd really like to see better binary availability of cross
compilers. While it's improved over the last few years mostly due to the
crossgcc stuff it's still a pain. Ideally, they would be available through
the distribution package manager even but failing that some dedicated place
on kernel.org with x86->lots and some of the more widely used other
combinations would quite definitely be good. Perhaps not really directly
relevant to this thread as such, but still good.
Andrew maintain{s,ed} a number of them at
http://userweb.kernel.org/~akpm/cross-compilers/
But as you see, most of the stuff there is really old again...
Rene
--You're most welcome to help out Vegard to do this: http://www.kernel.org/pub/tools/crosstool/ --
You could also use ct-ng: http://ymorin.is-a-geek.org/dokuwiki/projects/crosstool Works excellent for me :) cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
Ah, thanks, lovely, just new I see (and yes, I meant s/grossgcc/crosstool/). Good thing. I'll check it out and see if there's anything to add. Rene. --
It's not that clear-cut, either. Something which manifests itself as a build failure or an immediate test failure on m68k alone, might actually turn out to cause subtle data corruption on other platforms. You can't always know that it isn't important, just because it only shows up in some esoteric circumstances. You only really know how important it was _after_ you've fixed it. That obviously doesn't help us to prioritise. -- dwmw2 --
Ideally, you'd do an analysis first and then prioritize, based on the severity of the bug, its exposure, how easy it is it fix, etc. If while doing that you already have a fix at hand, you're almost done :) Recursively, there's the problem of which bugs you analyze first. I'm inclined to say that you want to analyze most if not all bug reports in higher priority than working on fixing non-critical bug. Benny --
On Thu, 01 May 2008 13:42:44 +0100 absolutely. I'm not going to argue that prioritization is easy. Or that we'll be able to get it right all the time. --
And leave unfixed all the regressions introduced in earlier kernel versions and known at the time of the release of that version but still present in the current version? Not to mention all the other bugs reported by users of That can be true for not-so-recently introduced bugs too. There are so many bugs out there and developers tend to focus on new ones leaving a lot of others unattended, both important and not so important ones. Which ones should someone focus on? Maybe on the ones that someone (helped) introduce him/herself. Maybe that should even sometimes be prioritized over introducing new bugs^W^W^Wdoing new development. --
<big_snip /> Hi folks, what do you think about Gentoo's "bug-wrangler" concept ? Maybe could do something similar: An Tester group (which eg. should be the entry point for newbies), is responsible for receiving bug reports from users (maybe even distro maintainers who're not directly involved in kernel dev.). They try to reproduce the bugs and find out as much as they can, then file a report to the actual kernel devs (just critical bugs are directly kicked to the devs with high priority). Maybe this group could also keep users informed about fixes and give some upgrade advise, etc. This way we can build an good technical support (independent from distributors ;-P), newbies can learn on the job and te load on kernel devs is reduced, so they can better concentrate on their core competences. What do you think about this ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
Andrew already does more or less this.
The problems are:
- kernel bugs tend to very quickly reach the state where you need expert
knowledge in some area, and there's definitely not much room for
newbies in bug handling
- "try to reproduce the bugs" works for much software, but in the
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--From: Adrian Bunk <bunk@kernel.org> kernel-testers@vger.kernel.org has been created, feel free to use it --
Thanks :-)
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--One thing is that we keep fragmenting the tester base by adding new confidence levels: we now have -mm, -next, mainline -git, mainline -rc, mainline release, stable, distro testing, and distro release (and some distros even have aggressive versus conservative tracks.) Furthermore, thanks to craniorectal immersion on the part of graphics vendors, a lot of users have to run proprietary drivers on their "main work" systems, which means they can't even test newer releases even if they would dare. This fragmentation is largely intentional, of course -- everyone can pick a risk level appropriate for them -- but it does mean: a) The lag for a patch to ride through the pipeline is pretty long. b) The section of people who are going to use the more aggressive trees for "real work" testing is going to be small. -hpa --
And another problem is that often, it's hard to get good "real work" coverage over the whole tree. I just discovered an apparent borkage somewhere in the networking/wireless area that seems to have gotten into Linus's tree somewhere between 24-rc8 and 24-final, just because I haven't beaten on my wireless card in the last few weeks, so I didn't notice a regression in 'ip link show' related to the rfkill switch...
Since I poke my head out of the foxhole every once in a while with a relatively late-breaking bug report, I thought I should chime in... Mr. Anvin has pretty much nailed it... As the kernel development process has evolved, which "confidence level" I select has evolved as well. The thing that *hasn't* changed through the years is, I tend to pick a "confidence level" that is appropriately close to "mainline" and has an update release schedule roughly compatible with my ability to keep up with it. Specifically, if it takes me several hours to download a patch set, apply it, build the new kernel, and test on multiple platforms/architectures, then the update release schedule is probably going to have to be no more often than twice a week if I'm going to be at all interested in even trying to keep up with it. In 2008, the "-rcX" updates are a good fit. In the not-too-distant past, keeping up with 2.5.X.Y was no problem. Yes, I realize I don't *have* to test every revision level in every major tree, but I don't have to think about which one to pick for testing if I can keep up with the update release schedule :-). -- ------------------------------------------------------------------------ Bob Tracy | "I was a beta tester for dirt. They never did rct@frus.com | get all the bugs out." - Steve McGrew on /. ------------------------------------------------------------------------ --
On Wed, 30 Apr 2008 13:31:08 -0700 (PDT) Well. We'll see. linux-next is more than another-tree-to-test. It is (or will be) a change in our processes and culture. For a start, subsystem maintainers can no longer whack away at their own tree as if the rest of use don't exist. They now have to be more mindful of merge issues. Secondly, linux-next is more accessible than -mm: more releases, more stable, better tested by he-who-releases it, available via git:// etc. It should be very easy for developers to do their weekly "does linux-next boot" test. Plus, of course, people who complain about merge-window breakage only to find that the breakage was already in linux-next except they didn't test it will not have a leg to stand on. I feared that linux-next wouldn't work: that Stephen would stomp off in disgust at all the crap people send at him. But in fact it seems to be going very well from that POV. I get the impression that we're seeing very little non-Stephen testing of linux-next at this stage. I hope we can ramp that up a bit, initially by having core developers doing at least some basic sanity testing. linux-next does little to address our two largest (IMO) problems: inadequate review and inadequate response to bug and regression reports. But those problems are harder to fix.. --
Probably it would make sense also for distro vendors to make linux-next snapshosts available in their development distro branches (redhat's rawhide, opensuse's factory, etc), to make it easier to test by those users who are willing to test if it works in their environment, but don't want to compile kernels themselves. -- Jiri Kosina --
I try to test linux-next on a few SATA test boxes, but it's definitely Agreed... any lead time on linux-next testing would be great. Jeff --
Andrew, the latter thing is a very good point. For me personally, the fact that -mm is not available via git is the major obstacle for trying your tree more frequently than just a few times per year. How difficult it would be to switch to git for you? I guess there are good reasons for still using the source code management system from the last century; please correct me if I'm wrong, but I believe that using a modern SCM system could For busy (or lazy) people like myself, the big problem with linux-next are the frequent merge breakages, when pulling the tree stops with "you are in the middle of a merge conflict". Perhaps, there is a better way to resolve this without just removing the whole repo and cloning it once again - this is what I'm doing, please flame me for stupidity or ignorance if I simply am not aware of some git feature that could be useful in such cases. Finally, while the list is at it, I'd like to make another technical comment. My development zoo is a pretty fast 4-way Xeon server, where I keep a handful of trees, a few cross-toolchains, Qemu, etc. The network setup in our organization is such that I can use git only over http from that server. This cannot be changed, it's the company policy. In view of that, it's a pity that quite a few tree owners don't make sure that http access to their trees works (I added Ingo to the Cc: list in the hope that this will be corrected soon for the x86 tree, which I am using quite extensively), and I have to use a much slower machine (a two and a half year old laptop) for these trees. Please see this: <<<<<<< [dmitri.vorobiev@amber ~]$ git clone http://www.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git Initialized empty Git repository in /home/dmitri.vorobiev/linux-2.6-x86/.git/ Getting alternates list for http://www.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git Also look at http://www.kernel.org/home/ftp/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ Getting pack list for http:...
On Thu, 01 May 2008 01:42:59 +0400 Every -mm release if available via git://, as described in the release announcements. The scripts which do this are a bit cantankerous but I believe they do work. <tests it> Fatal, I expect. A tool which manages source-code files is just the wrong Really? Doesn't Stephen handle all those problems? It should be a clean Don't know what to do about that, sorry. An off-site git->http proxy might work, but I doubt if anyone has written the code. --
Would you mind using stgit? That you way have the queue patch functionality, yet a simple git-push -f will send the whole patch stack over to a repo (without the stgit bits that is), leaving what looks like a regular tree with just lots of recent commits. Does not even need extra scripts to do a Indeed, assuming the remote is set up and you have a local branch, `git reset --hard mm/master` after a fetch is the thing. But be sure not to have any changed files. --
Andrew Morton пишет: But there is another solution, which I believe is straightforward: have the tree maintainer set up his tree properly. --
It should indeed be a clean fetch, but I wonder if Dmitri perhaps does a "git pull" - which will do the fetch, but then try to _merge_ that fetched state into whatever the last base Dmitri happened to have. Dmitry: you cannot just "git pull" on linux-next, because each version of linux-next is independent of the next one. What you should do is basically # Set this up just once.. git remote add linux-next git://git.kernel.org/pub/scm/linux/kernel/git/sfr/linux-next.git and then after that, you keep on just doing git fetch linux-next git checkout linux-next/master which will get you the actual objects and check out the state of that remote (and then you'll normally never be on a local branch on that tree, git will end up using a so-called "detached head" for this). IOW, you should never need to do any merges, because Stephen did all those in linux-next already. Linus --
Just to add some emphasis here - this is something that took me a long time to figure out, and since it is the pattern for dealing with the x86 trees and with the mm git tree and with linux-next, it would help if it were documented somewhere (not that I can imagine where). Once you know it, it becomes obvious, but try staring at a merge conflict for a while trying to figure out what to do, and it gets frustrating. I wonder if we can guess how many testers abandon the mm git tree or the linux-next tree because of this. It might be nice if git supported a command like git-remote-help or something that would fetch a predefined help file from a remote tree that describes the workflow for that tree. But at least with an extra reply to this mail, it might creep higher in the google search results when looking for merge conflicts with linux-next. -- Kevin Winchester --
Linus, thanks a lot for the detailed explanation. Indeed, it seems that I foolishly tried to duplicate Stephen's work. In the future I'll do as you suggest here. --
That "howto" should probably be added to the linux-next announcements... (CC'ing Stephen) --
