Re: Strange system hangs

Previous thread: Re: kernel 2.6.23-rc6 hangs on Geode GX1 by Marco Tralli on Friday, September 28, 2007 - 4:35 am. (1 message)

Next thread: [GIT PULL] Correct the SMAP check in the e820 probe by H. Peter Anvin on Friday, September 28, 2007 - 4:56 am. (4 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, September 28, 2007 - 4:42 am

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes=
=20
and stops accepting remote connections, so it is no longer possible to=20
connect to most important services: smtp (postfix), www (squid) or even=20
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is=
=20
possible to work on such system until you do something stupid like "less=20
/var/log/all.log". Using strace I found that process blocks on:

--- strace: being ---
execve("/usr/bin/tail", ["tail", "-f", "/var/log/all.log"], [/* 33 vars */]=
) =3D 0
brk(0) =3D 0x8052000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
=3D 0x6ff00000
access("/etc/ld.so.preload", R_OK) =3D -1 ENOENT (No such file or dire=
ctory)
open("/etc/ld.so.cache", O_RDONLY) =3D 3
fstat64(3, {st_mode=3DS_IFREG|0644, st_size=3D20944, ...}) =3D 0
mmap2(NULL, 20944, PROT_READ, MAP_PRIVATE, 3, 0) =3D 0x6fefa000
close(3) =3D 0
open("/lib/libc.so.6", O_RDONLY) =3D 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0RY\1\0004\0\0\0"...=
, 512) =3D 512
fstat64(3, {st_mode=3DS_IFREG|0755, st_size=3D1175920, ...}) =3D 0
mmap2(NULL, 1185212, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
=3D 0x6fdd8000
mmap2(0x6fef4000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DE=
NYWRITE, 3, 0x11b) =3D 0x6fef4000
mmap2(0x6fef7000, 9660, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANO=
NYMOUS, -1, 0) =3D 0x6fef7000
close(3) =3D 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
=3D 0x6fdd7000
set_thread_area({entry_number:-1 -> 6, base_addr:0x6fdd76b0, limit:1048575,=
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_prese=
nt:0, useable:1}) =3D 0
mprotect(0x6fef4000, 4096, PROT_READ) =3D 0
mprotect(0x6ff1c000, 4096, PROT_READ) =3D 0
munmap(0x6fefa...

To: Krzysztof Oledzki <olel@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, September 28, 2007 - 4:14 pm

Is this a regression? If so, what's the most recent kernel that didn't show
the problem?

The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)

Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).

Are any IO errors occurring at all?

Thanks,
Nick
-

To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <osterried@...>, Andrew Morton <akpm@...>, Peter Zijlstra <peterz@...>
Date: Sunday, December 2, 2007 - 11:09 am

It seems that 2.6.23.x still fails but somehow different. I updated my=20
bugreport at: http://bugzilla.kernel.org/show_bug.cgi?id=3D9182. There are=
=20
new attachments with traces and an oops that happened while I was taking=20
the debugging data.

Thank you.

Best regards,

=09=09=09Krzysztof Ol=EAdzki

To: Nick Piggin <nickpiggin@...>
Cc: Peter Zijlstra <peterz@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 3:54 am

OK. It did it again :(, so I send additional debug information. Today I=20
discovered that "echo i > /proc/sysrq-trigger" makes my system being=20

SysRq : Show Memory
Mem-info:
DMA per-cpu:
Normal per-cpu:
Active:281859 inactive:140551 dirty:46161 writeback:0 unstable:0
free:39081 slab:50571 mapped:3261 pagetables:356 bounce:0
DMA free:8704kB min:512kB low:640kB high:768kB active:48kB inactive:0kB pre=
sent:16256kB pages_scanned:12 all_unreclaimable? no
lowmem_reserve[]: 0 2015
Normal free:147620kB min:65016kB low:81268kB high:97524kB active:1127388kB =
inactive:562204kB present:2064260kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
DMA: 4*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*20=
48kB 2*4096kB =3D 8704kB
Normal: 15781*4kB 522*8kB 264*16kB 56*32kB 9*64kB 8*128kB 6*256kB 1*512kB 1=
*1024kB 2*2048kB 16*4096kB =3D 147620kB
Swap cache: add 5022, delete 4928, find 2281/2804, race 0+0
Free swap =3D 2927208kB
Total swap =3D 2927608kB
Free swap: 2927208kB
524224 pages of RAM
0 pages of HIGHMEM
6345 reserved pages
249411 pages shared
94 pages swap cached
46161 pages dirty
0 pages writeback
3261 pages mapped
50571 pages slab

SysRq : Show Locks Held

Showing all locks held in the system:
1 lock held by syslogd/2197:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by freshclam/2898:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by crond/3061:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by snmpd/3546:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by squid/3606:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by zabbix_agentd/3728:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_a...

To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, October 9, 2007 - 4:08 pm

Mem-info:
DMA per-cpu:
Normal per-cpu:
Active:340169 inactive:117325 dirty:48579 writeback:0 unstable:0
free:25266 slab:28357 mapped:3109 pagetables:816 bounce:0
DMA free:8784kB min:512kB low:640kB high:768kB active:0kB inactive:72kB pre=
sent:16256kB pages_scanned:36 all_unreclaimable? no
lowmem_reserve[]: 0 2015
Normal free:92280kB min:65016kB low:81268kB high:97524kB active:1360676kB i=
nactive:469228kB present:2064260kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
DMA: 12*4kB 6*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2=
048kB 2*4096kB =3D 8784kB
Normal: 2724*4kB 1863*8kB 95*16kB 0*32kB 1*64kB 3*128kB 0*256kB 0*512kB 1*1=
024kB 1*2048kB 15*4096kB =3D 92280kB
Swap cache: add 6396, delete 6331, find 3048/3691, race 0+0
Free swap =3D 2927260kB
Total swap =3D 2927608kB
Free swap: 2927260kB
524224 pages of RAM
0 pages of HIGHMEM
6345 reserved pages
302073 pages shared
65 pages swap cached
48579 pages dirty
0 pages writeback
3109 pages mapped
28357 pages slab
816 pages pagetables

SysRq : Show Locks Held
Showing all locks held in the system:
1 lock held by syslogd/2163:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by freshclam/2864:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by named/2925:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by crond/3027:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by squid/3559:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by zabbix_agentd/3701:
#0: (&inode->i_mutex){--..}, at: [<7814aabc>] generic_file_aio_write+0x4=
3/0xb6
1 lock held by agetty/3753:
#0: (&tty->atomic_read_lock){--..}, at: [<7829cdde>] read_chan+0x1b1/0x5=
4b
1 lock held...

To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, September 29, 2007 - 11:54 am

OK. I'll try to do it next time if there will be a chance. It may take=20

I'll have to check - maybe I have this captured. If not I'll check it next=
=20

Didn't notice - so no.

Thank you.

Best regards,

=09=09=09Krzysztof Ol=EAdzki

To: Krzysztof Oledzki <olel@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, September 28, 2007 - 5:17 am

This trace puzzles me, what is: unix_dgram_recvmsg doing there.
Also, it has two invocations of: ext3_file_write

single write, no networking, also stuck in balance_dirty_pages().

-

To: Peter Zijlstra <peterz@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, September 28, 2007 - 5:44 am

No, no loopback:

# mount
/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=3Djournal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=3Djournal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosui=
d,nodev,noatime,data=3Dwriteback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosu=
id,nodev,noatime,data=3Dwriteback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,node=
v,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=3D0664,devgid=
=3D85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg=
,intr,tcp,addr=3D192.168.129.26)

Exactly. Strange, isn't it?

Thanks.

Best regards,

=09=09=09=09Krzysztof Ol=EAdzki

Previous thread: Re: kernel 2.6.23-rc6 hangs on Geode GX1 by Marco Tralli on Friday, September 28, 2007 - 4:35 am. (1 message)

Next thread: [GIT PULL] Correct the SMAP check in the e820 probe by H. Peter Anvin on Friday, September 28, 2007 - 4:56 am. (4 messages)