Re: ath5k misbehaving affecting other kernel parts unrelated?

Previous thread: [git pull] drm radeon fixes by Dave Airlie on Thursday, April 22, 2010 - 11:35 pm. (1 message)

Next thread: [PATCH 01/14] platform: Make platform resource input parameters const by Geert Uytterhoeven on Friday, April 23, 2010 - 1:00 am. (14 messages)
From: Pedro Francisco
Date: Friday, April 23, 2010 - 12:06 am

Greetings!

Is it possible a faulty ath5k module to affect diverse parts of kernel causing 
different oopses to happen referencing things as different as cdrom_ioctl, 
find_vma, i915_gem_object_get_pages, get_vfs_caps_from_disk, 
warn_slowpath_common (ath5k_tasklet_rx), proc_lookup_de and 
_spin_lock(ext4_getattr) [soft lookup]?

The task in which such errors happen is capturing packets with kismet during 
the night. The errors aren't easy to create, sometimes they've already 
happened when I check the computer in the morning and sometimes they require 
stirring up the computer a bit such as starting X.

I memtested the machine during 7h and no error was detected.

I've been trying different kernels but main one is 2.6.32-21-generic, Ubuntu 
flavour, with linux-backports-modules-wireless-lucid-generic installed. I would 
use 2.6.34-rc5 vanilla if shutdown and suspend worked ;)

As such:
a) is it "normal"? I thought modules were somewhat isolated these days.
b) any advices for debugging? I've yet to have two oopses that are similar....

Thank you in Advance,
-- 
Pedro
--

From: John W. Linville
Date: Friday, April 23, 2010 - 6:48 am

It is certainly possible -- improper setup of DMA could be scribbling
over memory that doesn't belong to ath5k.  That sort of thing can be
ugly to track-down...

You mentioned Kismet.  Do you experience this sort of problem when
using ath5k for "normal" purposes (e.g. browsing the web)?  Or only
when using it to monitor the network?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.
--

From: me
Date: Friday, April 23, 2010 - 9:37 am

Advice for debugging: turn on slub/slab debug options, and possibly
kmemcheck.  kmemcheck was very helpful for me last time I had such
a corruption issue.

-- 
Bob Copeland %% www.bobcopeland.com

--

From: Maciej Żenczykowski
Date: Friday, April 23, 2010 - 9:43 am

Do you have more than ~2.5-3.5GB of ram (enough to make some ram
non-32-bit-DMA-accessible) with swiotlb enabled,
if such are you seeing "DMA: Out of SW-IOMMU space" kernel messages?

--

From: Pedro Francisco
Date: Saturday, April 24, 2010 - 1:59 am

I've 2 GB RAM + 5GB swap (got fed up with hibernating not working sometimes), 
so no such errors.

--

From: Pedro Francisco
Date: Saturday, April 24, 2010 - 1:56 am

For some reason I am unable to get a kernel I compiled to boot. Since I was 
unable to compile the kernel with those debugging options, I just turned on 
slub_debug on GRUB.

I got a few messages which I pasting here only partially to check for their 
significance. If significant I'll post them fully:

[ 2658.663308] 
=============================================================================
[ 2658.663424] BUG kmalloc-4096: Poison overwritten
[ 2658.663483] 
-----------------------------------------------------------------------------
[ 2658.663486] 
[ 2658.663606] INFO: 0xed3db0c0-0xed3db0cf. First byte 0xc4 instead of 0x6b
[ 2658.663698] INFO: Allocated in ath_rxbuf_alloc+0x30/0xa0 [ath] age=7117 
cpu=0 pid=0
[ 2658.663799] INFO: Freed in skb_release_data+0x70/0xa0 age=0 cpu=0 pid=0
[ 2658.663882] INFO: Slab 0xc15fbb00 objects=7 used=5 fp=0xed3db090 
flags=0x400040c3
[ 2658.663975] INFO: Object 0xed3db090 @offset=12432 fp=0xed3da060
[ 2658.663977] 
[ 2658.664069] Bytes b4 0xed3db080:  00 00 00 00 61 ff 08 00 5a 5a 5a 5a 5a 5a 
5a 5a ....a<FF>..ZZZZZZZZ
[ 2658.664258]   Object 0xed3db090:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b kkkkkkkkkkkkkkkk

(a lot of lines with memory contents omitted)

[ 2658.667258]  Redzone 0xed3dc090:  bb bb bb bb                                     
<BB><BB><BB><BB>            
[ 2658.667258]  Padding 0xed3dc0b8:  5a 5a 5a 5a 5a 5a 5a 5a                         
ZZZZZZZZ        
[ 2658.667258] Pid: 0, comm: swapper Not tainted 2.6.32-21-generic #32-Ubuntu
[ 2658.667258] Call Trace:
[ 2658.667258]  [<c01faf63>] print_trailer+0xd3/0x120
[ 2658.667258]  [<c01fb07c>] check_bytes_and_report+0xcc/0xf0
[ 2658.667258]  [<c01fc021>] check_object+0x1a1/0x1e0
[ 2658.667258]  [<c01fcc98>] alloc_debug_processing+0xc8/0x190
(...)


[ 2658.667258] FIX kmalloc-4096: Restoring 0xed3db0c0-0xed3db0cf=0x6b
[ 2658.667258] 
[ 2658.667258] FIX kmalloc-4096: Marking all objects used
[ 4689.941595] 


And I've two or three more similar ...
From: me
Date: Saturday, April 24, 2010 - 5:05 am

Yes, please post the whole message.

-- 
Bob Copeland %% www.bobcopeland.com

--

From: Vegard Nossum
Date: Saturday, April 24, 2010 - 5:09 am

Please post all the info you have. Full crash log and full slub-debug
report are both very interesting.

Not sure what exactly linux-backports-modules-wireless-lucid-generic
is, but one way to narrow it down is to NOT use that and see if it
still crashes.

You might find your distro config in /boot/config-*; you could copy
that to .config in your kernel directory and try to compile (and boot)
a more recent kernel with that.

Good luck,


Vegard
--

From: Pedro Francisco
Date: Saturday, April 24, 2010 - 10:28 am

Mashup of two different mails with identical info.


I *believe* it includes compat-wireless.

As I've yet to understand fully what is compat-wireless I'm following your 

I get an error booting as if I'd forgotten to compile my disk's controller; 
since I used the .config of the distro's kernel I doubt that (and I checked). 
Must be  missing some hocus-pocus on using Ubuntu/Debian's `make-kpkg' .

Ubuntu provides mainline kernels, I'm currently with the one of two or three 
days ago, since the current seems to cause X issues, my plan is to boot it 
(linux-image-2.6.34-999-generic_2.6.34-999.201004211003_i386) just for doing 
these tests.


Will do ;)


I left the three extra lines at the top to provide a "timeline" of the time 
between leaving kismet running and the time of the first error.

[  109.876670] device wlan1 entered promiscuous mode
[  109.882126] device wlan1mon entered promiscuous mode
[  311.656242] usb 5-1: USB disconnect, address 2
[ 2658.663308] 
=============================================================================
[ 2658.663424] BUG kmalloc-4096: Poison overwritten
[ 2658.663483] 
-----------------------------------------------------------------------------
[ 2658.663486] 
[ 2658.663606] INFO: 0xed3db0c0-0xed3db0cf. First byte 0xc4 instead of 0x6b
[ 2658.663698] INFO: Allocated in ath_rxbuf_alloc+0x30/0xa0 [ath] age=7117 
cpu=0 pid=0
[ 2658.663799] INFO: Freed in skb_release_data+0x70/0xa0 age=0 cpu=0 pid=0
[ 2658.663882] INFO: Slab 0xc15fbb00 objects=7 used=5 fp=0xed3db090 
flags=0x400040c3
[ 2658.663975] INFO: Object 0xed3db090 @offset=12432 fp=0xed3da060
[ 2658.663977] 
[ 2658.664069] Bytes b4 0xed3db080:  00 00 00 00 61 ff 08 00 5a 5a 5a 5a 5a 5a 
5a 5a ....a�..ZZZZZZZZ
[ 2658.664258]   Object 0xed3db090:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b kkkkkkkkkkkkkkkk
[ 2658.664443]   Object 0xed3db0a0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b kkkkkkkkkkkkkkkk
[ 2658.664629]   Object 0xed3db0b0:  6b 6b 6b 6b 6b 6b 6b 6b ...
From: me
Date: Sunday, April 25, 2010 - 12:22 pm

Ok there are 4 messages here, two of them are definitely 802.11 beacons,
which would point the finger squarely at ath5k.  We had reports of this
some time ago but a few things got rewritten around that time.  I'd guess


These two are definitely beacons.

-- 
Bob Copeland %% www.bobcopeland.com

--

From: Jiri Slaby
Date: Sunday, April 25, 2010 - 1:29 pm

These are CTS frames. So I think ath5k is to blame too :/.

regards,
-- 
js
--

From: Pedro Francisco
Date: Sunday, April 25, 2010 - 2:24 pm

Do note that's on a 2.6.32 kernel. I've however a new dmesg/syslog which on a 
2.6.34-rc5-daily has the same kind of issues. It's 400kB in size..... Do you 
want me to send it to the mailing list?

Or should I open a bug report on the kernel bugzilla and post them there? Or 
do both?

-- 
Pedro
--

From: Pedro Francisco
Date: Tuesday, April 27, 2010 - 4:04 am

From: Pedro Francisco
Date: Wednesday, April 28, 2010 - 2:52 pm

A Sábado, 24 de Abril de 2010 18:28:36 Pedro Francisco escreveu:
-snip

I am now able to compile my own kernels using `make all deb-pkg' on the 
vanilla source and additional hocus-pocus to create the initramfs, after 
having finally given up on using the Debian way (make-kpkg).


A Sexta, 23 de Abril de 2010 17:37:03 me@bobcopeland.com escreveu:

I can't seem to find the "kmemcheck: trap use of uninitialized memory" option 
in make menuconfig, section Kernel Hacking, though a search shows it should be 
there.

Will it be useful if I do such tests now that I've opened a bug report with 
data from slab_debug?
If so, how can I enable such feature?

P.S.: considering I'm just planning on using the kernel for the tests, I don't 
bother enabling options which hurt interactivity, as long as a terminal works 
well.

Thanks again,
-- 
Pedro
--

From: Pedro Francisco
Date: Thursday, April 29, 2010 - 3:25 am

It was "CONFIG_FUNCTION_TRACER" which was on; incidently, such dependency 
doesn't appear in the "Depends on" line, had to do a web search.

Will post the results tomorrow, unless there aren't any.

-- 
Pedro
--

From: Vegard Nossum
Date: Thursday, April 29, 2010 - 12:17 am

Most likely you need to disable "Optimize for size", under "General
setup" in the main menu to make it appear. But it could also be one of
the other dependencies; check the "Depends on" line of KMEMCHECK when
you do the search.


Vegard
--

From: Dan Carpenter
Date: Friday, April 23, 2010 - 12:13 pm

Yes.

A bug a kernel module could corrupt any memory.

If you know a working version of the kernel try a git bisect otherwise
just post your dmesg with the crash etc.

regards,
dan carpenter
--

From: Pedro Francisco
Date: Saturday, April 24, 2010 - 2:27 am

I don't. I doubt the dmesg will be of significance, as I said they're never the 
same.
However I've posted the dmesgs here [ 
http://ubuntuforums.org/showthread.php?t=1457568 ] prior to asking for help in 
LKML. Kernel version is Ubuntu's 2.6.32, not Vanilla.

I believe nothing significant can be extracted from them. Am I correct?

-- 
Pedro
--

From: Dan Carpenter
Date: Saturday, April 24, 2010 - 5:13 am

Well they have the 'M' taint for "Machine Check Exception" so that
probably means the hardware is failing.

regards,
--

From: Pedro Francisco
Date: Saturday, April 24, 2010 - 6:13 am

Only on two of them. Since when I get a BUG: Unable to handle paging request 
the system seems to get stuck and on 100% CPU after a while the CPU overheats 
I assumed those Machine Check Exceptions were due to the overheating caused by 
the 100% CPU usage caused by the errors. If it is due to that, the CPU 
throttles to prevent further overheating so it shouldn't affect anything.

-- 
Pedro
--

Previous thread: [git pull] drm radeon fixes by Dave Airlie on Thursday, April 22, 2010 - 11:35 pm. (1 message)

Next thread: [PATCH 01/14] platform: Make platform resource input parameters const by Geert Uytterhoeven on Friday, April 23, 2010 - 1:00 am. (14 messages)