oops

Submitted by sreejithmm
on October 18, 2008 - 5:56am

Hi all,

This is a peculiar Oops we are encountering during the running of our board (sh4) architecture
we are some times getting Oops messages like this
Unable to handle kernel NULL pointer dereference at virtual address 00000004
pc = 844240f8
*pde = 00000000
Oops: 0001 [#1]

Pid : 529, Comm: cvm
PC is at run_timer_softirq+0x58/0x220
PC : 844240f8 SP : 88d1ff44 SR : 400080f0 TEA : c0169d64 Tainted: P
R0 : 00000000 R1 : 88d1ff44 R2 : 00000000 R3 : 846fa08c
R4 : 846fa084 R5 : 846fae8c R6 : 00000001 R7 : 00000000
R8 : 00000000 R9 : 846fa084 R10 : 84424020 R11 : 88d1ff0c
R12 : 88d1ff44 R13 : 846fba08 R14 : ffffffd3
MACH: 00000050 MACL: 00000078 GBR : 397b6938 PR : 844241a2

Call trace:
[<8442137a>] __do_softirq+0x7a/0x120
[<844218a6>] irq_exit+0x66/0x80
[<84407e80>] do_IRQ+0x0/0x60
[<84407eb8>] do_IRQ+0x38/0x60
[<84405070>] ret_from_irq+0x0/0x10

Kernel panic - not syncing: Aiee, killing interrupt handler!
I think this crash is a generic problem with our kernel configuration. has any one seen this kind of crash?Can any one tell me atleast when these types of crash can happen??
From the log, is it possible to tell what may cause these kind of behavior? The same crash is happening at different times during different operations. Please
Give you valuable suggestions!!

any idea???

on
October 20, 2008 - 12:37am

any idea???

Any idea for anyone about

on
October 20, 2008 - 12:40am

Any idea for anyone about this issue?

what

on
October 20, 2008 - 1:35am

what did you find out, sreejithmm? Have you looked at the code at the crash IP? Is the call stack correct? Which IRQ(s) were running? Which softirq? Can you rule out the proprietary module? The oops text only tells so much as where to look.

thanks for your reply. In

on
October 20, 2008 - 1:48am

thanks for your reply.
In our kernel configuration , CONFIG_PREEMPT was enabled.
THis crash is random and we are unable to make anything out of the call trace. I was asking if anyone is familiar with these kind of crashes or any assumption how it can happen?

which 'this kind'

on
October 20, 2008 - 4:01am

what do you mean when you say 'this kind of crashes'? this is just a NULL pointer exception. you may know this from user space code (program dies with 'Segmentation fault'), only this happend in the kernel code, but you have to apply the same debugging methods. the exact instruction where this happend is located at run_timer_softirq+0x58 i.e. at the 88th byte of the machine code for the function run_timer_softirq. you should have a look there (use debugging symbols to see the C code for this accress) to see which pointer was NULL, i.e. which data structure was corrupt or which assumption in the code didn't hold. the function was running in the process context of something named 'cvm', does this program behave strangely?

when you say 'random' but 'these kind of crashes' you contradict yourself: either the crash is totally random i.e. the call stack is different every time, or there is something defining the kind of crash. are there similar elements in the call trace, similiar circumstances etc. that lead you to think the crashes have the same reason? these links are important, because they help to focus on the real reason.

@strcmp thanks for the

on
October 20, 2008 - 6:40am

@strcmp

thanks for the detailed reply.

[<84407e80>] do_IRQ+0x0/0x60
[<84407eb8>] do_IRQ+0x38/0x60

From this call trace , can you make out how can these two do_IRQ calls happen successively?

Note::

As i said earlier , our kernel had pre emption enabled and we were calling schedule() from the function resume_kernel(entry.S).

locking

on
October 20, 2008 - 8:34am

you still didn't tell what triggered the oops in run_timer_softirq+0x58... which data structure got damaged?

is your code preempt save, i.e. do you implement proper spinlocks and have double-checked your locking? does it work without preempt? do you need preempt?

did you insert the schedule() call yourself? if so, why doesn't the original logic work for you and do you really understand what you are doing there?

The crash comes from the

on
October 21, 2008 - 5:16am

The crash comes from the standard linux kernel code.The crash is coming from list_splice_init() function in __run_timers function.In this joining of two lists are done.
The schedule() call is also not inserted by me but is in the kernel code.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.