Re: My position on general ``RAS'' tool support infrastructure

Previous thread: Re: [linux-dvb] [PATCH] Userspace tuner by Markus Rechberger on Thursday, September 13, 2007 - 9:17 am. (2 messages)

Next thread: [GIT PATCH] USB autosuspend fixes for 2.6.23-rc6 by Greg KH on Thursday, September 13, 2007 - 9:33 am. (23 messages)
To: <pete@...>
Cc: Jason Wessel <jason.wessel@...>, Randy Dunlap <randy.dunlap@...>, Matt Mackall <mpm@...>, Amit Kale <amitkale@...>, Dave Anderson <anderson@...>, <kdb@...>, <jlan@...>, Vivek Goyal <vgoyal@...>, Andrew Morton <akpm@...>, Kexec Mailing List <kexec@...>, <linux-kernel@...>
Date: Thursday, September 13, 2007 - 9:21 am

Yes.

There is a tension here between generality of support infrastructure,
maintainability of the infrastructure, simplicity of the
infrastructure and reliability of the infrastructure.

The historical linux perspective is that anything that compromises
the maintainability or the reliability of the kernel without the
tools is unacceptable.

There is also a historical perspective that using the single stepping
mode of a debugger to diagnose problems frequently leads to symptoms
being fixed and not the actual problems being fixed.

My initial proposal in this thread was that if kdb wanted to have
a hook point someplace where were not comfortable adding a hook
point it could use a break point or some of the tracing
infrastructure. Somehow that suggestion seems to have gotten lost.

On the kexec on panic path the philosophy is that the kernel is
broken and as little as possible should be relied upon. So in general
I am opposed to extra code on that path. General hooks like notifiers
in particular, because they make adding non-paranoid code much easier
and review of the code on a particular call path much harder.

From what I can tell the philosophy of the kdb code is that the kernel
is mostly ok except for one or two little bugs so it is reasonable to
rely on lots of kernel infrastructure.

As I understand the problem the difference in philosophy and
maintenance overhead is why kexec on panic has been merged and why
it has a much larger success rate the previous crash dump
implementation like lkcd. I will not that in some sense it is a
harder approach to implement as it emphasizes the challenge of
drivers that work starting from a random hardware state, and because
it draws a clear line between the broken kernel and the recover
kernel. But those things are exactly what encourage things to work
well.

I don't mind playing well with others as long as that doesn't
compromise the implementation reliability, and maintainability.

So far it is my opinion that the current ke...

To: Eric W. Biederman <ebiederm@...>
Cc: <pete@...>, Jason Wessel <jason.wessel@...>, Matt Mackall <mpm@...>, Amit Kale <amitkale@...>, Dave Anderson <anderson@...>, <kdb@...>, <jlan@...>, Vivek Goyal <vgoyal@...>, Andrew Morton <akpm@...>, Kexec Mailing List <kexec@...>, <linux-kernel@...>
Date: Monday, September 17, 2007 - 9:38 pm

Yes. and I re-read it.

There are several things in Keith's email that make sense:

a. all RAS tools should use a common interface
b. it's not the kernel's job to decide which RAS tool runs first

Eric makes some good points too. I'm mostly similar to Eric:
paranoid about trusting software/hardware after a panic (or oops).

So if someone wants to use multiple RAS tools on a panic event,
enabling an admin to set priorities is OK with me, but I'll only
trust the first one that is used, and even that one may have
problems. IOW, I don't see a big need to support multiple RAS

Ack that.

---
~Randy
-

To: Randy Dunlap <randy.dunlap@...>
Cc: Eric W. Biederman <ebiederm@...>, <pete@...>, Jason Wessel <jason.wessel@...>, Matt Mackall <mpm@...>, Amit Kale <amitkale@...>, Dave Anderson <anderson@...>, <kdb@...>, <jlan@...>, Andrew Morton <akpm@...>, Kexec Mailing List <kexec@...>, <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 12:28 am

I would be nice to have a kernel debugger co-exist with crash dumping.

I like Eric's idea of debugger putting a break point on panic(). This
would mean that rest of the post panic() actions have to be performed
by second kernel which can perform those actions much more reliably.

But this also brings in the additional requirement of passing all the
required context to second kernel. For example, in the past somebody wanted
to send a message to a remote node that sytem crashed so that standby can
take over. If the same job has to be done in second kernel, it requires all
the relavant information like remote host IP, port etc passed to the second
kernel which I think makes the job little harder. May be one can pre-configure
these parameters in user space and let the job be done either from initrd
or user space scripts in second kernel.

Thanks
Vivek
-

Previous thread: Re: [linux-dvb] [PATCH] Userspace tuner by Markus Rechberger on Thursday, September 13, 2007 - 9:17 am. (2 messages)

Next thread: [GIT PATCH] USB autosuspend fixes for 2.6.23-rc6 by Greg KH on Thursday, September 13, 2007 - 9:33 am. (23 messages)