Hi all,
This is a peculiar Oops we are encountering during the running of our board (sh4) architecture
we are some times getting Oops messages like this
Unable to handle kernel NULL pointer dereference at virtual address 00000004
pc = 844240f8
*pde = 00000000
Oops: 0001 [#1]
Pid : 529, Comm: cvm
PC is at run_timer_softirq+0x58/0x220
PC : 844240f8 SP : 88d1ff44 SR : 400080f0 TEA : c0169d64 Tainted: P
R0 : 00000000 R1 : 88d1ff44 R2 : 00000000 R3 : 846fa08c
R4 : 846fa084 R5 : 846fae8c R6 : 00000001 R7 : 00000000
R8 : 00000000 R9 : 846fa084 R10 : 84424020 R11 : 88d1ff0c
R12 : 88d1ff44 R13 : 846fba08 R14 : ffffffd3
MACH: 00000050 MACL: 00000078 GBR : 397b6938 PR : 844241a2
Call trace:
[<8442137a>] __do_softirq+0x7a/0x120
[<844218a6>] irq_exit+0x66/0x80
[<84407e80>] do_IRQ+0x0/0x60
[<84407eb8>] do_IRQ+0x38/0x60
[<84405070>] ret_from_irq+0x0/0x10
Kernel panic - not syncing: Aiee, killing interrupt handler!
I think this crash is a generic problem with our kernel configuration. has any one seen this kind of crash?Can any one tell me atleast when these types of crash can happen??
From the log, is it possible to tell what may cause these kind of behavior? The same crash is happening at different times during different operations. Please
Give you valuable suggestions!!
any idea???
any idea???
Any idea for anyone about
Any idea for anyone about this issue?
what
what did you find out, sreejithmm? Have you looked at the code at the crash IP? Is the call stack correct? Which IRQ(s) were running? Which softirq? Can you rule out the proprietary module? The oops text only tells so much as where to look.
thanks for your reply. In
thanks for your reply.
In our kernel configuration , CONFIG_PREEMPT was enabled.
THis crash is random and we are unable to make anything out of the call trace. I was asking if anyone is familiar with these kind of crashes or any assumption how it can happen?
which 'this kind'
what do you mean when you say 'this kind of crashes'? this is just a NULL pointer exception. you may know this from user space code (program dies with 'Segmentation fault'), only this happend in the kernel code, but you have to apply the same debugging methods. the exact instruction where this happend is located at run_timer_softirq+0x58 i.e. at the 88th byte of the machine code for the function run_timer_softirq. you should have a look there (use debugging symbols to see the C code for this accress) to see which pointer was NULL, i.e. which data structure was corrupt or which assumption in the code didn't hold. the function was running in the process context of something named 'cvm', does this program behave strangely?
when you say 'random' but 'these kind of crashes' you contradict yourself: either the crash is totally random i.e. the call stack is different every time, or there is something defining the kind of crash. are there similar elements in the call trace, similiar circumstances etc. that lead you to think the crashes have the same reason? these links are important, because they help to focus on the real reason.
@strcmp thanks for the
@strcmp
thanks for the detailed reply.
[<84407e80>] do_IRQ+0x0/0x60
[<84407eb8>] do_IRQ+0x38/0x60
From this call trace , can you make out how can these two do_IRQ calls happen successively?
Note::
As i said earlier , our kernel had pre emption enabled and we were calling schedule() from the function resume_kernel(entry.S).
locking
you still didn't tell what triggered the oops in run_timer_softirq+0x58... which data structure got damaged?
is your code preempt save, i.e. do you implement proper spinlocks and have double-checked your locking? does it work without preempt? do you need preempt?
did you insert the schedule() call yourself? if so, why doesn't the original logic work for you and do you really understand what you are doing there?
The crash comes from the
The crash comes from the standard linux kernel code.The crash is coming from list_splice_init() function in __run_timers function.In this joining of two lists are done.
The schedule() call is also not inserted by me but is in the kernel code.