Re: tty-related oops in latest kernel(s)?

Previous thread: Re: [PATCH] add a trivial patch style checker by Randy Dunlap on Sunday, May 27, 2007 - 1:10 pm. (31 messages)

Next thread: [PATCH] Delete useless reference to dead MODULE_PARM macro. by Robert P. J. Day on Sunday, May 27, 2007 - 3:00 pm. (1 message)
To: <linux-kernel@...>
Cc: <akpm@...>
Date: Sunday, May 27, 2007 - 5:06 am

Hi,

2.6.22-rc3 (with Reiser4 patch) oopses when watching videos with
mplayer using neofb console.

When mplayer starts I get these messages
(this is normal, repeating lines omitted):

neofb: no support for 32bpp
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1024x1024) larger than the LCD panel (800x600)
Mode (1280x1024) larger than the LCD panel (800x600)

Ok, everything seems to work and I can watch the video.
However, when the mplayer stops I get these warnings:

release_dev: driver.table[3] not tty for (tty4)
Warning: dev (tty4) tty->count(3) != #fd's(2) in release_dev
release_dev: driver.table[3] not tty for (tty4)

When I try to repeat the previous step the kernel oopses:

neofb: no support for 32bpp
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1024x1024) larger than the LCD panel (800x600)
Mode (1280x1024) larger than the LCD panel (800x600)
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000731
printing eip:
c021e50e
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: binfmt_misc floppy loop xirc2ps_cs pcmcia usb_storage parport_pc parport yenta_socket rsrc_nonstatic pcmcia_core evdev uhci_hcd usbcore
CPU: 0
EIP: 0060:[<c021e50e>] Not tainted VLI
EFLAGS: 00010202 (2.6.22-rc3-atr #4)
EIP is at vt_ioctl+0xda8/0x1482
eax: 00000679 ebx: 00005600 ecx: c3d41200 edx: 00000000
esi: 0893159c edi: c0386a2f ebp: 00000003 esp: c26a5e28
ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
Process mplayer (pid: 1502, ti=c26a5000 task=c3a22500 task.ti=c26a5000)
Stack: fffffffe c26a5f30 c0149fd9 c29fe400 c3d68600 00000001 4642f7b0 c3d0bcc0
c3c32464 00000101 00000001 00000000 4644793c 4642f7b0 00000002 00000005
0000001b 00000005 c34ad240 c3d0bcc0 c1386000 00000002 00000000 c021d766
Call Trace:
[<c0149fd9>] l...

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>
Date: Monday, May 28, 2007 - 3:22 am

Hi Tero,

(I am cc'ing Alan as he's been working on tty code recently.)

Can we have your .config please? Also, could you work out the file and
line number of vt_ioctl+0xda8/0x1482 like this:

gdb vmlinux
(gdb) l *0xc021e50e

Pekka
-

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>
Date: Monday, May 28, 2007 - 5:34 am

[snip]

I am getting this with your config:

(gdb) p vt_ioctl
$1 = {int (struct tty_struct *, struct file *, unsigned int, long
unsigned int)} 0xc01e404a <vt_ioctl>
(gdb) l *(0xc01e404a + 0xda8)
0xc01e4df2 is in vt_ioctl (drivers/char/vt_ioctl.c:720).
715 /*
716 * Returns the first available (non-opened) console.
717 */
718 case VT_OPENQRY:
719 for (i = 0; i < MAX_NR_CONSOLES; ++i)
720 if (! VT_IS_IN_USE(i))
721 break;
722 ucval = i < MAX_NR_CONSOLES ? (i+1) : -1;
723 goto setint;
724

Which seems to match the code dump in the OOPS as well. I am not sure
what %edx (which is zero and causes problems) should contain but I am
guessing tty_driver->ttys is corrupted which seems consistent with the
reference count sanity check failure. Unfortunately I am not familiar
enough with tty internals to immediately see why this is happening.
-

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>
Date: Tuesday, May 29, 2007 - 12:04 pm

FYI, I just tested 2.6.21.3. I couldn't reproduce the problem with
that kernel.

--
Tero Roponen
-

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Tuesday, May 29, 2007 - 2:57 pm

Hi Tero,

Well, I went through all tty related patches that went in after 2.6.21
and didn't really find anything interesting, except this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...

But it does seem correct. You could try reverting that from 2.6.22-rc3
and see if you can trigger the bug.

So tty from filp->private_data does not match
tty->driver->ttys[tty->index] and release_dev bails out (thus messing

Presumably someone tries to close the file again which is why we get a
new complaint that reference counting has gone bad.

Unfortunately, I have no idea why drivers->tty does not match. It
could be a race with release_tty() or real use-after-free but I am
unable to find anything obvious in 2.6.21 -> 2.6.22-rc3 that would
break it. Doing the git bisect dance here would really help...
-

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Tuesday, May 29, 2007 - 11:57 pm

[resend, mailer didn't like unzipped applications]

Hmmm, I just found something interesting. In 2.6.21.3 the /sbin/init
gets corrupted when I watch the video!

$ cp /sbin/init init.before
$ mplayer kiwi.flv
$ cp /sbin/init init.after

The sha1sums are here:

52c8d643057619cbe137b8e69d4709ce3bdd832d init.after
8efc7864a5b535a9e336fa82e9d7f112f3d956c1 init.before

It seems that something corrupts memory somewhere...

I attached those files in case someone can figure out
what is happening.

--
Tero Roponen

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 1:54 am

To debug this a bit further:

$ od -a -t x1 -v init.after > init.after.dump
$ od -a -t x1 -v init.before > init.before.dump
$ diff -u init.before.dump init.after.dump | less

-0011340 nul nul nul e9 f0 fe ff ff ff % < soh enq bs h 80
- 00 00 00 e9 f0 fe ff ff ff 25 3c 01 05 08 68 80
+0010000 y ack nul nul y ack nul nul y ack nul nul y ack nul nul
+ 79 06 00 00 79 06 00 00 79 06 00 00 79 06 00 00
+0010020 y ack nul nul y ack nul nul y ack nul nul y ack nul nul
+ 79 06 00 00 79 06 00 00 79 06 00 00 79 06 00 00
+0011340 y ack nul nul y ack nul nul ff % < soh enq bs h 80
+ 79 06 00 00 79 06 00 00 ff 25 3c 01 05 08 68 80

The file at offset 0010000 - 0011348 is overwritten with the byte
pattern 79 06 00 00.

Do you see anything in the logs or is this a silent corruption? Did
you see this corruption with 2.6.19 or 2.6.22-rc3?
-

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 8:02 am

I recompiled 2.6.22-rc3 and booted it with slub_debug. Now I can't oops
the kernel, but ./slab_info -v gives me a warning:

neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
neofb: no support for 32bpp
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1024x768) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode (1152x864) larger than the LCD panel (800x600)
Mode ...

To: Tero Roponen <teanropo@...>
Cc: Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 11:39 am

So something did an overwrite of a 1024-byte kmalloc. Unfortunately that
overwrite seems to have trashed our last-alloc info, so we don't know who
allocated that memory. Darn.

Does the problem go away if you disable CONFIG_SLUB and enable CONFIG_SLAB?

-

To: Andrew Morton <akpm@...>
Cc: Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 12:01 pm

> > Object 0xc10be8c0: ff ff ff ff ff ff ff ff 00 00 00 00 a8 61 00 00

To: Tero Roponen <teanropo@...>
Cc: Andrew Morton <akpm@...>, Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 7:17 pm

BTW, that was impressive. You exposed a long-standing bug in neofb,
thanks.

And just FYI, you can also trigger it by doing fbset -depth 24.

Tony

-

To: Tero Roponen <teanropo@...>
Cc: Andrew Morton <akpm@...>, Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 6:14 pm

It's a fb_setcolreg() bug in neofb. Try this patch?

Tony

To: Tero Roponen <teanropo@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 2:13 pm

Nice. This triggers on the file corruption on 2.6.21.3 also?
-

To: Pekka Enberg <penberg@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 2:27 pm

Yes:

[root@terrop ~]# init
Usage: init 0123456SsQqAaBbCcUu
[root@terrop ~]# ./oops
[root@terrop ~]# init
init: error while loading shared libraries: unexpected PLT reloc type 0xcc

--
Tero Roponen
-

To: Tero Roponen <teanropo@...>
Cc: Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>, <linux-fbdev-devel@...>, Antonino A. Daplas <adaplas@...>
Date: Wednesday, May 30, 2007 - 12:09 pm

cc's added ;)

Thanks.

Tony, this is with SLUB enabled, which might be detecting a
hitherto-undetected bug.

Config is at http://userweb.kernel.org/~akpm/config-tero.txt

-

To: Andrew Morton <akpm@...>
Cc: Tero Roponen <teanropo@...>, Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>, <linux-fbdev-devel@...>, Antonino A. Daplas <adaplas@...>
Date: Wednesday, May 30, 2007 - 2:04 pm

Two suspicious things for me:

1)

--- a/drivers/video/neofb.c
+++ b/drivers/video/neofb.c
@@ -1295,7 +1295,7 @@ static int neofb_setcolreg(u_int regno,
outb(blue >> 10, 0x3c9);
break;
case 16:
- ((u32 *) fb->pseudo_palette)[regno] =
+ ((u16 *) fb->pseudo_palette)[regno] =
((red & 0xf800)) | ((green & 0xfc00) >> 5) |
((blue & 0xf800) >> 11);
break;

2) palette in neofb_par is "u32 palette[16];" which is 4x16 = 64 bytes.
struct fb_info::pseudo_palette is assigned to it in neo_alloc_fb_info().
Yet, we check at the beginning of neofb_setcolreg() for color map
length which neofb advertises as 256 which seems too many.

printk()s showing "regno" at the beginning of neofb_setcolreg()
welcome.

Alexey, who only knows how to spell framebuffer and a bit.

-

To: Alexey Dobriyan <adobriyan@...>
Cc: Andrew Morton <akpm@...>, Tero Roponen <teanropo@...>, Pekka Enberg <penberg@...>, <linux-kernel@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>, <linux-fbdev-devel@...>
Date: Wednesday, May 30, 2007 - 7:14 pm

Yes, 256 is too many. the pseudo_palette is used for the 16-color
console only.

I'm impressed that this bug has escaped notice for this long. That bug
is present since the 2.5.x era.

Probably, the best thing to do is hide the pseudo_palette from the
drivers and move it to the console layer where it belongs to spare
future driver writers from palette usage confusion. That will be a
thankless job.

Tony

-

To: <linux-fbdev-devel@...>
Cc: Alexey Dobriyan <adobriyan@...>, <linux-kernel@...>, Pekka Enberg <penberg@...>, Tero Roponen <teanropo@...>, Andy Whitcroft <apw@...>, Andrew Morton <akpm@...>, Alan Cox <alan@...>
Date: Thursday, May 31, 2007 - 3:17 am

The console layer doesn't know how to fill in the pseudo palette in all
cases, that's why the driver have to do it.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-

To: <linux-fbdev-devel@...>
Cc: <linux-kernel@...>, Pekka Enberg <penberg@...>, Tero Roponen <teanropo@...>, Andy Whitcroft <apw@...>, Andrew Morton <akpm@...>, Alexey Dobriyan <adobriyan@...>, Alan Cox <alan@...>, Geert Uytterhoeven <geert@...>
Date: Thursday, May 31, 2007 - 5:04 am

I have actually started working on that. It involves breaking down
fb_setcolreg() so it deals only with writing to the actual hardware
registers. The part of fb_setcolreg() that adds entries to the
pseudo_palette can be separated as a new method, fb_get_pixel(), which
given red, blue, green, transp, the driver returns a u32 pixel value
that can be written to the pseudo_palette.

So fbcon can hold a copy of the pseudo_palette and fills it up by
calling info->fbops->fb_get_pixel() successively.

This will touch the logo code, the drawing libraries, each driver, etc
so it's a lot of work. During the conversion period, we support having
info->pseudo_palette and fbcon->pseudo_palette at the same time. Once
all drivers are converted, we can remove info->pseudo_palette.

One use for having an fb_get_pixel() method is we can use this as an
rgb888-image-to-raw-framebuffer-format converter.

Currently, I have only converted vesafb. Once the core code is done,
I'll start converting the rest of the drivers one by one.

Tony

-

To: <adaplas@...>
Cc: <adobriyan@...>, <akpm@...>, <teanropo@...>, <penberg@...>, <linux-kernel@...>, <alan@...>, <apw@...>, <linux-fbdev-devel@...>
Date: Wednesday, May 30, 2007 - 7:18 pm

From: "Antonino A. Daplas" <adaplas@gmail.com>

Many many drivers allocate 256 entries, just FYI :-) They
all should be fixed up I guess.
-

To: David Miller <davem@...>
Cc: <adobriyan@...>, <akpm@...>, <teanropo@...>, <penberg@...>, <linux-kernel@...>, <alan@...>, <apw@...>, <linux-fbdev-devel@...>
Date: Wednesday, May 30, 2007 - 7:28 pm

I did a pseudo_palette allocation audit before, it might be high time to
run one again :-(

Tony

-

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 2:00 am

2.6.19.2 has been very stable for me.
2.6.21.3 has this silent corruption (nothing in logs)
2.6.22-rc3 oopses when watching videos.

--
Tero Roponen
-

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Wednesday, May 30, 2007 - 1:59 am

Btw, please send us a strace log of the mplayer run for 2.6.20.3 and
2.6.21-rc3 so that we can see what it's doing. Furthermore, if you can
bisect this, please treat the bugs as separate for now.
-

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>, Andy Whitcroft <apw@...>
Date: Tuesday, May 29, 2007 - 11:43 pm

Hmmm, I just found something interesting. In 2.6.21.3 the /sbin/init
gets corrupted when I watch the video!

$ cp /sbin/init init.before
$ mplayer kiwi.flv
$ cp /sbin/init init.after

The sha1sums are here:

52c8d643057619cbe137b8e69d4709ce3bdd832d init.after
8efc7864a5b535a9e336fa82e9d7f112f3d956c1 init.before

It seems that something corrupts memory somewhere...

I attached those files in case someone can figure out
what is happening.
_
Tero Roponen

To: Pekka Enberg <penberg@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>
Date: Monday, May 28, 2007 - 4:08 am

My .config is appended. I don't have access to that computer right
now, but I'll try to do that later.

--
Tero Roponen

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc3-atr
# Sun May 27 11:07:24 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CON...

To: Tero Roponen <teanropo@...>
Cc: <linux-kernel@...>, <akpm@...>, Alan Cox <alan@...>
Date: Monday, May 28, 2007 - 3:47 am

Btw, this only works if you have CONFIG_DEBUG_INFO set. In case it
wasn't, please do:

- Enable CONFIG_DEBUG_INFO
- make vmlinux
- gdb vmlinux
(gdb) p vt_ioctl
(gdb) l *(0x<address of vt_ioctl> + 0xda8)
-

Previous thread: Re: [PATCH] add a trivial patch style checker by Randy Dunlap on Sunday, May 27, 2007 - 1:10 pm. (31 messages)

Next thread: [PATCH] Delete useless reference to dead MODULE_PARM macro. by Robert P. J. Day on Sunday, May 27, 2007 - 3:00 pm. (1 message)