Re: Problems with disk + network on 2.6.25.11-60.fc8

Previous thread: [PATCH] KBUILD: Extend "menuconfig" for modules to simplify Kconfig file. by Robert P. J. Day on Monday, August 4, 2008 - 10:31 am. (3 messages)

Next thread: [PATCH] BLOCK: Simplify the Kconfig structure for the BLOCK layer. by Robert P. J. Day on Monday, August 4, 2008 - 10:50 am. (1 message)
From: David Stuart
Date: Monday, August 4, 2008 - 10:39 am

Hello everyone,

I have trawled the depths of the Internet, scoured the innermost reaches 
of the Usenet, and finally I arrive, beaten and bruised, at the steps to 
the Linux kernel mailing list, to seek advice from the penguins 
themselves. I humbly prostrate myself .. :)

My first request: Please CC me directly on replies as I am not 
subscribed to the list.

Now to the meat of it; I have been experiencing a lot of trouble with 
system freezes; but these are not crippling freezes in the sense that 
they come back after a few seconds. They are always accompanied by the 
following log in /var/log/messages:

----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<
Aug  3 21:44:34 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 
SErr 0x0 action 0x2 frozen
Aug  3 21:44:34 localhost kernel: ata1.00: cmd 
ca/00:40:b8:c9:f3/00:00:00:00:00/e8 tag 0 dma 32768 out
Aug  3 21:44:34 localhost kernel:          res 
40/00:78:00:00:00/00:00:00:00:00/50 Emask 0x4 (timeout)
Aug  3 21:44:34 localhost kernel: ata1.00: status: { DRDY }
Aug  3 21:44:39 localhost kernel: ata1: port is slow to respond, please 
be patient (Status 0x80)
Aug  3 21:44:44 localhost kernel: ata1: device not ready (errno=-16), 
forcing hardreset
Aug  3 21:44:44 localhost kernel: ata1: soft resetting link
Aug  3 21:44:45 localhost kernel: ata1.00: configured for UDMA/100
Aug  3 21:44:45 localhost kernel: ata1.01: configured for UDMA/100
Aug  3 21:44:45 localhost kernel: ata1: EH complete
Aug  3 21:44:45 localhost kernel: sd 0:0:0:0: [sda] 160086528 512-byte 
hardware sectors (81964 MB)
Aug  3 21:44:45 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Aug  3 21:44:45 localhost kernel: sd 0:0:0:0: [sda] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Aug  3 21:44:45 localhost kernel: sd 0:0:1:0: [sdb] 488397168 512-byte 
hardware sectors (250059 MB)
Aug  3 21:44:45 localhost kernel: sd 0:0:1:0: [sdb] Write Protect is off
Aug  3 21:44:45 ...
From: Alan Cox
Date: Monday, August 4, 2008 - 10:42 am

To start with can I have a dmesg after boot and a description of what is
plugged into where (disks and CD wise)

Alan
--

From: David Stuart
Date: Monday, August 4, 2008 - 1:31 pm

Hi Alain,

Sure, no problem. First the description, I'll append the dmesg output at 
the end.

I have an ASUS A8V motherboard, with a Maxtor Ultra100 IDE controller 
card on one of the PCI ports. On the mainboard:
- Primary IDE Master : 13GB Quantum Fireball HD (Where Windows resides, 
not that I use it).
- Primary IDE Slave : LG CD-ROM, 52X
- Secondary IDE : Nothing

On the IDE controller card:
- "IDE1 slot" Master : 80GB Maxtor
- "IDE1 slot" Slave : 250GB Western Digital
- "IDE2 slot" : Nothing

The output of dmesg (after booting) follows:
==================================

Initializing cgroup subsys cpuset
Linux version 2.6.25.11-60.fc8 (mockbuild@x86-7) (gcc version 4.1.2 
20070925 (Red Hat 4.1.2-33)) #1 SMP Mon Jul 21 01:40:51 EDT 2008
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003ffb0000 (usable)
 BIOS-e820: 000000003ffb0000 - 000000003ffc0000 (ACPI data)
 BIOS-e820: 000000003ffc0000 - 000000003fff0000 (ACPI NVS)
 BIOS-e820: 000000003fff0000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 3200 used
Entering add_active_range(0, 256, 262064) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP 000FA810, 0021 (r2 ACPIAM)
ACPI: XSDT 3FFB0100, 003C (r1 A M I  OEMXSDT   5000729 MSFT       97)
ACPI: FACP 3FFB0290, 00F4 (r3 A M I  OEMFACP   5000729 MSFT       97)
ACPI: DSDT 3FFB03F0, 391B (r1  A0277 A0277001        1 MSFT  100000D)
ACPI: FACS 3FFC0000, 0040
ACPI: APIC 3FFB0390, 0052 (r1 A M I  OEMAPIC   5000729 MSFT       97)
ACPI: OEMB 3FFC0040, 003F (r1 A M I  OEMBIOS   5000729 MSFT       97)
Scanning NUMA topology in Northbridge 24
No NUMA configuration found
Faking a node at ...
From: Alan Cox
Date: Monday, August 4, 2008 - 1:44 pm

Ok so the pdc202xx_old hardware flakes out when you have very high
network load (I'd guess in fact very high bus traffic).

The actual log is the disk I/O timing out, then the drive being busy
(probably due to the timeout and a DMA transfer getting stuck). We reset
it and carry on.

Libata happens to log this a lot more visibly than old kernels which is
useful but does mean people sometimes don't notice.

The rest then fits - the freeze I'd expect as we block I/O while trying
to get the drive back.

Doubt the Nvidia module is involved as I'd then expect problems under
high graphical load but you can certainly test that. I don't suppose
you've got a spare PCI network card you could try instead to see if it is
the network card bits ?

Alan
--

From: David Stuart
Date: Monday, August 4, 2008 - 4:27 pm

Hi Alan,

Actually I do not really have another PCI network card, but I could 
switch the computer back to the other interface which is on the 
motherboard (does that one function as a PCI device?). As I mentioned in 
my first post, the current card I am using is an attempt to try to work 
around the problem (originally I thought it was the on-board 
controller), so I have my doubts as to whether switching back would help.

Nonetheless, I will give it a try again and let you know the result.

Thanks,
David

-- 
The only way to keep your health is to eat what you don't want, drink
what you don't like, and do what you'd rather not.
                -- Mark Twain

From: Alan Cox
Date: Monday, August 4, 2008 - 4:13 pm

That would be great - if it makes no difference, yet high network traffic
is the key factor then it mostly eliminates bugs in the network drivers
from suspicion. At that point its time to dig deeper into the chip config.
--

From: David Stuart
Date: Monday, August 4, 2008 - 4:58 pm

Hi Alan,

So I switched back to my old on-board network card (removing the prior 
card altogether from the case). I tried my test-case, which involves 
downloading some big files while simultaneously running a find command.

I *was* able to reproduce the issue fairly quickly.

The key difference is that this time I was using the skge kernel module 
for networking. So, I tend to agree that it is most likely not a network 
driver problem.

Please let me know if there is anything further I can do to assist in 
debugging.

Thanks,
David

-- 
The only way to keep your health is to eat what you don't want, drink
what you don't like, and do what you'd rather not.
                -- Mark Twain

From: David Stuart
Date: Wednesday, August 6, 2008 - 6:43 pm

Hello,

It occurs to me that you might be implying there is some "chip config" 
that I can retrieve and give to you. Is there anything I can/should do 
in order to give you more information, or is this pretty much out of my 
hands at this point?

Should I be filing a bug report for this?


-- 
The only way to keep your health is to eat what you don't want, drink
what you don't like, and do what you'd rather not.
                -- Mark Twain

From: Alan Cox
Date: Thursday, August 7, 2008 - 3:36 am

There are two things you can play with. One is the UDMA burst mode on the
chip which should be getting set, the other is PCI latencies (which the

Probably a good idea

Next things to try are

1.	edit drivers/ata/pata_pdc202xx_old.c
	after the line

	iowrite8(burst | 0x01, bmdma + 0x1f);

add

	printk(KERN_ERR "BURST was %02X\n", burst);

build/install/boot that kernel and see what it says.

The second (sledgehammer) approach would be to use setpci to set the
LATENCY_TIMER value on the pdc202xx_old and network card differently.

Alan
--

Previous thread: [PATCH] KBUILD: Extend "menuconfig" for modules to simplify Kconfig file. by Robert P. J. Day on Monday, August 4, 2008 - 10:31 am. (3 messages)

Next thread: [PATCH] BLOCK: Simplify the Kconfig structure for the BLOCK layer. by Robert P. J. Day on Monday, August 4, 2008 - 10:50 am. (1 message)