2.6.24-rc1 - Regularly getting processes stuck in D state on startup

Previous thread: Future of Linux 2.6.22.y series by Greg Kroah-Hartman on Monday, November 5, 2007 - 2:13 pm. (4 messages)

Next thread: [0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo. by Evgeniy Polyakov on Monday, November 5, 2007 - 2:41 pm. (1 message)
To: Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, November 5, 2007 - 2:23 pm

I've been testing rc1 for a week or so, and about 25% of the time I'm
seeing Firefox and Thunderbird getting stuck in 'D' state as they startup.

I've attached the output of Sysrq-T to this mail... system is a
dual-core AMD64, and files are on a RAID-1 root partition connected two
SATA disks on the on-board NVidia controller. I've had no problems
before .24 rc1

Cheers
David

To: David <david@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, November 6, 2007 - 4:00 am

David, thank you for the reporting.

Could you try with the attached 4 patches? Two of them are expected to
fix your problem, another two are debugging ones(in case the problem
persists).

Thank you,
Fengguang

To: Fengguang Wu <wfg@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, November 6, 2007 - 2:03 pm

I've applied the patches, and have tried a few reboots with no problems
so far. I will report back if I see any further problems.

Thanks
David
-

To: David <david@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, November 6, 2007 - 2:46 am

I am seeing something very similar on a PowerPC machine where copying a
file from an LVM volume with ext3 on it to a simple scsi partition (again
ext3) on the same disk will hang in congestion_wait. If I am patient
enough, the copy makes very slow progress. A kill -9 will kill it
eventually, but a simple control-C will not.

This hang occurs more often than not (and usually when I am trying to
install a new kernel into /boot for testing :-)).

I don't have access to the machine today, but if more information would
be useful, I could boot into 2.6.24-rc1-<mumble> again tomorrow.

--=20
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

To: David <david@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, November 6, 2007 - 11:24 pm

On Tue, 6 Nov 2007 17:46:26 +1100 Stephen Rothwell <sfr@canb.auug.org.au> w=

Turns out a simple control-C would kill the copy, I was just not patient
enough :-)

--=20
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

To: Stephen Rothwell <sfr@...>
Cc: David <david@...>, Linux Kernel Mailing List <linux-kernel@...>, Fengguang Wu <fengguang.wu@...>, Andrew Morton <akpm@...>, Dave Chinner <dgc@...>, Christoph Lameter <clameter@...>
Date: Tuesday, November 6, 2007 - 8:20 am

LVM will provide a different BDI even though it could be on the same
disk as another 'real' partition. Still that should not make the copy
take that long.

I tried copying a 1M file from the lvm to a real partition on the same
disk (after ensuring the lvm had all the dirty limit), works like
advertised.

x86_64 SMP PREEMPT v2.6.24-rc1-748-g2655e2c + the four attached patches
rawhide x86_64 userland

To test this scenario I made an lvm thingy /dev/lvm/foo on /dev/sdb6

/ -> /dev/sda3
/dev/sdb1 /mnt/sdb1
/dev/lvm/foo -> /mnt/foo

All ext3 for this test.

The pretty numbers come from:

# while sleep 1; do cat /sys/class/bdi/*/bdi_dirty_kb | awk '{t=$0; n+=
$0; while (getline) { t=t " " $0; n+=$0; } ; getline total <
"/sys/class/bdi/sda/dirty_kb" ; print t " : " n "/" total }' ; done

while doing:

# dd if=/dev/zero of=/mnt/foo/zero bs=4096 count=$((1024*1024/4))

dm-0 ............................................. sda sdb ..........

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 159440 0 0 0 0 0 0 : 159440/193540
5848 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89588 0 0 0 0 0 0 : 95436/193092
41488 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 82908 0 0 0 0 0 0 : 124396/192576
69984 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 62100 0 0 0 0 0 0 : 132084/191952
93488 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 67132 0 0 0 0 0 0 : 160620/191752
114452 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57676 0 0 0 0 0 0 : 172128/191696
124260 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 53508 0 0 0 0 0 0 : 177768/191544
138072 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 53140 0 0 0 0 0 0 : 191212/191252
145004 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45748 0 0 0 0 0 0 : 190752/190804
155408 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35508 0 0 0 0 0 0 : 190916/190920
162252 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29192 0 0 0 0 0 0 : 191444/191392
165968 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25108 0 0 0 0 0 0 : 191076/...

Previous thread: Future of Linux 2.6.22.y series by Greg Kroah-Hartman on Monday, November 5, 2007 - 2:13 pm. (4 messages)

Next thread: [0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo. by Evgeniy Polyakov on Monday, November 5, 2007 - 2:41 pm. (1 message)