Re: [patch] log fatal signals like SIGSEGV

Previous thread: Re: Laptop shock detection and harddisk protection by Bodo Eggert on Friday, September 12, 2008 - 6:28 am. (1 message)

Next thread: ARM 2.6.26.5-rt8 by Alexander Pazdnikov on Friday, September 12, 2008 - 6:04 am. (1 message)
From: Thomas Jarosch
Date: Friday, September 12, 2008 - 6:02 am

Hello together,

we've created a small patch that helped us troubleshoot obscure hardware
faults several times. Imagine someone calls and complains his box is
freezing from time to time. We usually check the logs and find various tasks
segfaulting between the freezes. Most of the time the hardware is broken.

Attached patch logs fatal signal like SIGSEGV or SIGBUS. It also features
a log flood protection, though I'm not sure if it works with dynamic ticks.
The code is a stripped down version of grsecurity's signal logger.

The patch doesn't have Kconfig support or other fancy stuff yet as I first
wanted to ask if it would make sense to integrate something like this
upstream. For us it made remote diagnosis a lot easier.

Please CC: comments.

Best regards,
Thomas Jarosch

--------------------------------------------------------
Log fatal signals like SIGSEGV or SIGBUS
to aid debugging of obscure problems.

The code is a stripped down version
of grsecurity's signal logger.

Signed-Off-By: Gerd v. Egidy <gve@intra2net.com>
Signed-Off-By: Thomas Jarosch <thomas.jarosch@intra2net.com>

diff -u -r -p linux-2.6.22/kernel/signal.c linux.siglog/kernel/signal.c
--- linux-2.6.22/kernel/signal.c	Mon Jul  9 01:32:17 2007
+++ linux.siglog/kernel/signal.c	Wed Aug 22 11:08:58 2007
@@ -514,6 +514,64 @@ static int rm_from_queue(unsigned long m
 }
 
 /*
+ * Stuff needed for signal logger
+ */
+
+spinlock_t siglog_lock = SPIN_LOCK_UNLOCKED;
+unsigned long volatile siglog_wtime = 0;
+unsigned long volatile siglog_fyet = 0;
+
+/* time span in which flooding is measured */
+#define SIGLOG_FLOOD_SECONDS 5
+
+/* how many log entries are allowed in this time span */
+#define SIGLOG_FLOOD_BURST_LINES 20
+
+/*
+ * Log fatal signals
+ */
+void
+log_fatal_signal(const int sig, const struct task_struct *t)
+{
+	if ((sig == SIGSEGV) || (sig == SIGILL) || (sig == SIGABRT)
+        || (sig == SIGBUS) || (sig == SIGKILL) || (sig == SIGFPE)) {
+
+		/* flood protection ...
From: Marcin Slusarz
Date: Friday, September 12, 2008 - 10:11 am

Note that on current kernel when process segfaults it printks:

Please look at Documentation/SubmittingPatches.


It would be clearer if you could write this code as:

if (!sth)
	return;


Marcin
--

From: Thomas Jarosch
Date: Tuesday, September 16, 2008 - 5:59 am

Thank you for your feedback Marcin, it was really helpful.

The log message above is a good start. There's also the
"print-fatal-signals" kernel parameter which prints out
much more than just the basic info:

 =======================
sleep/3277: potentially unexpected fatal signal 11.
code at 001b7571: 89 d3 3d 01 f0 ff ff 73 01 c3 53 e8 00 00 00 00

Pid: 3277, comm: sleep Not tainted (2.6.26-2.i2nsmp #7)
EIP: 0073:[<001b7571>] EFLAGS: 00000246 CPU: 0
EIP is at 0x1b7571
EAX: fffffdfc EBX: bffae0d4 ECX: bffae0d4 EDX: 00215c80
ESI: bffae0d4 EDI: bffae1e8 EBP: bffae268 ESP: bffae0a0
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 007b
CR0: 8005003b CR2: b7fbf55c CR3: 369db000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 =======================

This is way too much information to leave it on all the time.

The new code just adds one line of information to the log
and is rate limited. The idea is to aid troubleshooting
later on by looking through the logs. I guess you'll

scripts/checkpatch.pl is happy now :-)

I've reworked the locking using CONFIG_LOCKDEP and ran  

while /bin/true; do
  sleep 10 &
  kill -9 $!
done

for over twenty minutes with no noticable issue.
The printk_ratelimit() works perfect.

Here's the new version:
-----------------------------------------------------------------
From: Thomas Jarosch <thomas.jarosch@intra2net.com>

Log the signals SIGSEGV, SIGILL, SIGABRT, SIGBUS, SIGKILL and SIGFPE
to aid debugging of obscure problems. Also logs the sender of the signal.

The log message looks like this:
"kernel: signal 9 sent to freezed[2634] uid:100,
 parent init[1] uid:0 by bash[3168] uid:0, parent sshd[3164] uid:0"

The printing code is based on grsecurity's signal logger.

Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Gerd v. Egidy <gve@intra2net.com>

diff -u -r -p linux-2.6.26.vanilla/kernel/signal.c linux-2.6.26/kernel/signal.c
--- ...
From: Marcin Slusarz
Date: Tuesday, September 16, 2008 - 10:42 am

It looks much better now. But I don't think it will go in as is.
Maybe you can disable it by default and create a sysctl switch?

Marcin
--

From: Thomas Jarosch
Date: Wednesday, September 17, 2008 - 1:12 am

The idea is to enable it by default :-) Normally you shouldn't see -any- 
output except for broken hardware/software. That's something you want to get 
notified about and be able to trace back later on.

I guess it might be a good idea to make the SIGKILL output optional via sysctl 
and disabled by default as this could trigger during normal system operation
("Sending all processes the KILL signal" during shutdown). Would that be ok?

Thomas

--

From: Thomas Jarosch
Date: Thursday, September 18, 2008 - 3:10 am

Thinking about it some more, I've added a "signal-log-level" sysctl var
including documentation. The patch applies nicely to 2.6.26 and 2.6 HEAD.

The idea is to default to log level 1 and log fatal signals only.
Log output should be close to zero during normal system operation.

There is a bit of a naming clash with "print-fatal-signals", though that
should be called "debug-fatal-signals" because of all the register dumps etc.
I don't want to rename it as it would unnecessarily cause issues
and it's debug-only (Documentation/kernel-parameters.txt) anyway.

Enjoy.

------------------------------------------------------
From: Thomas Jarosch <thomas.jarosch@intra2net.com>

Log signals like SIGSEGV, SIGILL, SIGBUS or SIGFPE to aid tracing
of obscure problems. Also logs the sender of the signal.

The log message looks like this:
"kernel: signal 9 sent to freezed[2634] uid:100,
 parent init[1] uid:0 by bash[3168] uid:0, parent sshd[3164] uid:0"

You can control the degree of logging via sysctl: "signal-log-level"
    0 - Signal logging disabled
    1 - Log SIGSEGV, SIGILL, SIGBUS and SIGPFE (default)
    2 - Log SIGKILL and SIGABRT and all signals from log level 1
    3 or higher: Log all signals

The printing code is based on grsecurity's signal logger.

Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Gerd v. Egidy <gve@intra2net.com>

diff -u -r -p linux-2.6.26.vanilla/kernel/signal.c linux-2.6.26/kernel/signal.c
--- linux-2.6.26.vanilla/kernel/signal.c	Tue Sep 16 13:45:34 2008
+++ linux-2.6.26/kernel/signal.c	Thu Sep 18 10:43:27 2008
@@ -796,6 +796,35 @@ static void complete_signal(int sig, str
 	return;
 }
 
+int signal_log_level __read_mostly = 1;
+
+static void log_signal(const int sig, const struct task_struct *t)
+{
+	bool log_signal = false;
+
+	if(signal_log_level >= 1 && (sig == SIGSEGV || sig == SIGILL
+			|| sig == SIGBUS || sig == SIGFPE))
+		log_signal = true;
+	else if (signal_log_level >= 2 && (sig == SIGKILL || sig == ...
From: Marcin Slusarz
Date: Thursday, September 18, 2008 - 1:20 pm

When fix typos you can add:
Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com>
--

From: Thomas Jarosch
Date: Saturday, September 20, 2008 - 10:12 am

Hello together,

here's the reworked version of the signal logging patch.
Many thanks to Marcin Slusarz for reviewing it. 

The idea is to briefly log "fatal" signals by default, the output
should be close to zero during normal system operation.
Faulty hardware gets detected much quicker
if you see crashing apps in the logs.

The patch applies to 2.6.26 and git HEAD.
scripts/checkpatch.pl is happy, too.

I'm off for a vacation, so expect a delay of some days
if there should be additional questions or comments.

Hope the new feature is as useful to others as it has been to me.

Enjoy,
Thomas

------------------------------------------------------
From: Thomas Jarosch <thomas.jarosch@intra2net.com>

Log signals like SIGSEGV, SIGILL, SIGBUS or SIGFPE to aid tracing
of obscure problems. Also logs the sender of the signal.

The log message looks like this:
"kernel: signal 9 sent to freezed[2634] uid:100,
 parent init[1] uid:0 by bash[3168] uid:0, parent sshd[3164] uid:0"

You can control the degree of logging via sysctl: "signal-log-level"
    0 - Signal logging disabled
    1 - Log SIGSEGV, SIGILL, SIGBUS and SIGFPE (default)
    2 - Log SIGKILL and SIGABRT and all signals from log level 1
    3 or higher: Log all signals

The printing code is based on grsecurity's signal logger.

Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Gerd v. Egidy <gve@intra2net.com>
------------------------------------------------------
diff -u -r -p linux-2.6.26.vanilla/kernel/signal.c linux-2.6.26/kernel/signal.c
--- linux-2.6.26.vanilla/kernel/signal.c	Tue Sep 16 13:45:34 2008
+++ linux-2.6.26/kernel/signal.c	Thu Sep 18 10:43:27 2008
@@ -796,6 +796,35 @@ static void complete_signal(int sig, str
 	return;
 }
 
+int signal_log_level __read_mostly = 1;
+
+static void log_signal(const int sig, const struct task_struct *t)
+{
+	bool log_signal = false;
+
+	if (signal_log_level >= 1 && (sig == SIGSEGV ...
From: Mikael Pettersson
Date: Sunday, September 21, 2008 - 12:05 pm

Thomas Jarosch writes:
 > Hello together,
 > 
 > here's the reworked version of the signal logging patch.
 > Many thanks to Marcin Slusarz for reviewing it. 
 > 
 > The idea is to briefly log "fatal" signals by default, the output
 > should be close to zero during normal system operation.
 > Faulty hardware gets detected much quicker
 > if you see crashing apps in the logs.
 > 
 > The patch applies to 2.6.26 and git HEAD.
 > scripts/checkpatch.pl is happy, too.
 > 
 > I'm off for a vacation, so expect a delay of some days
 > if there should be additional questions or comments.
 > 
 > Hope the new feature is as useful to others as it has been to me.
 > 
 > Enjoy,
 > Thomas
 > 
 > ------------------------------------------------------
 > From: Thomas Jarosch <thomas.jarosch@intra2net.com>
 > 
 > Log signals like SIGSEGV, SIGILL, SIGBUS or SIGFPE to aid tracing
 > of obscure problems. Also logs the sender of the signal.

I believe the approach taken in this patch is broken:

1. The signal logging decision is taken before signal delivery,
   which causes *handled* signals in the above list to be logged.
   So your printk_ratelimit() can be swamped by handled signals
   causing it to not log unhandled fatal signals.

   Applications that handle SEGV/BUS/ILL/FPE aren't that uncommon.

2. Fatal signals are only interesting if they are self-generated.
   Signals sent from other processes or threads are uninteresting,
   if the purpose is to detect program errors or faulty hardware.

3. Similar functionality already exists in the kernel, except
   it correctly runs much later in the signal delivery path.
   Grep for print_fatal_signals and show_unhandled_signals.

   There's also some trace hooks in the signal delivery path
   that look like they could log actual fatal signals.

/Mikael
--

From: Bernd Eckenfels
Date: Sunday, September 21, 2008 - 12:15 pm

JRE uses SIGSEGV for (expected) NullPointers. So you might see a lot of
those on the logs. However: not in all cases you dont want to log handled
(fatal) signals. Not sure how to detect those. Maybe it is better to fix
applications and remove their home grown crash logging.

Gruss
Bernd
--

From: Mikael Pettersson
Date: Sunday, September 21, 2008 - 12:40 pm

Bernd Eckenfels writes:
 > In article <18646.39701.819497.337394@harpo.it.uu.se> you wrote:
 > >   Applications that handle SEGV/BUS/ILL/FPE aren't that uncommon.
 > 
 > JRE uses SIGSEGV for (expected) NullPointers. So you might see a lot of
 > those on the logs. However: not in all cases you dont want to log handled
 > (fatal) signals. Not sure how to detect those. Maybe it is better to fix
 > applications and remove their home grown crash logging.

I was referring to applications that properly handle these signals,
e.g. by emulation, fixing up fault-causing conditions, or turning
faults into application-level exceptions.
--

From: Thomas Jarosch
Date: Monday, October 6, 2008 - 1:53 am

Thanks for your review. We already run the new patch on 500+ boxes and
didn't get any complaints about noisy messages in the logs (yet?).
Some of those boxes run "logcheck" and generate a daily report,
so there should be -something-.

I'm not sure if separating between kernel and process generated signals
makes much of a difference as there should be no log output anyway.
If you can provide me input that this will in fact generate noisy output,

print_fatal_signals is debug-only, see the mails
from the first review phase about that.

show_unhandled_signals seems to be implemented on x86 and PPC only.

Concerning x86: Both pieces of code are integrated in 
arch/x86/kernel/traps_32.c: do_general_proection().


Do you have a particular one in mind?


Yes, I like that. The new code is architecture-independant,
perhaps the architecture-dependant code could even be obsoleted
and all platforms would benefit from the new logging.

Thomas

--

From: Jiri Kosina
Date: Monday, September 22, 2008 - 4:52 pm

BTW be aware that for example x86 arch-specific code does this on its own, 
and therefore with your patch, the information will be duplicated. See 
page fault handler for x86.

-- 
Jiri Kosina
SUSE Labs

--

Previous thread: Re: Laptop shock detection and harddisk protection by Bodo Eggert on Friday, September 12, 2008 - 6:28 am. (1 message)

Next thread: ARM 2.6.26.5-rt8 by Alexander Pazdnikov on Friday, September 12, 2008 - 6:04 am. (1 message)