Hi, Kernel version: 2.6.22-rc5 (confirmed also on 2.6.20) Kernel config : Ubuntu 7.04 default (SMP) Relevant hardware: Asus P5K (Intel P35 chipset) Core 2 Duo E6600 2.4GHz Western Digital 10KRPM 150GB HDD on JMicron 20360/20363 AHCI Netconsoled dump: [ 724.350222] general protection fault: 0000 [1] SMP [ 724.350413] CPU 1 [ 724.350520] Modules linked in: usb_storage libusual netconsole binfmt_misc rfcomm l2cap bluetooth ppdev capability commoncap acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand cpufreq_conservative cpufreq_powersave freq_table video container battery dock asus_acpi ac sbs button af_packet nls_utf8 ntfs w83627ehf i2c_isa parport_pc lp parport fuse mt2060 snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm cx22702 snd_seq_dummy snd_seq_oss dvb_usb_dib0700 dib7000m dib7000p dvb_usb cx88_dvb cx88_vp3054_i2c snd_seq_midi snd_rawmidi video_buf_dvb dvb_core ipv6 snd_seq_midi_event snd_seq snd_timer dvb_pll cx8800 cx8802 cx88xx sr_mod ir_common snd_seq_device cdrom i2c_algo_bit dib3000mc dibx000_common tveeprom atl1 usbhid psmouse videodev compat_ioctl32 hid mii i2c_core v4l2_common v4l1_compat btcx_risc video_buf serio_raw snd soundcore pcspkr shpchp pci_hotplug snd_page_alloc intel_agp tsdev evdev ext3 jbd mbcache sg sd_mod pata_jmicron ata_generic ata_piix ahci libata scsi_mod ehci_hcd generic uhci_hcd usbcore thermal processor fan [ 724.355028] Pid: 199, comm: pdflush Not tainted 2.6.22-rc5-edge #1 [ 724.355305] RSP: 0018:ffff8101322e7bb0 EFLAGS: 00010202 [ 724.355394] RAX: 0000000000000000 RBX: 000000009d8145bd RCX: 0000000000001000 [ 724.355491] RDX: 000000009d8145bd RSI: 908553557cc5eb6f RDI: ffff81012e1052a0 [ 724.355587] RBP: 000000003b028b7a R08: 0000000000000000 R09: ffffffff880f1ba0 [ 724.355684] R10: 0000000000000000 R11: 0000000000000001 R12: 000000009d8145bd [ 724.355780] R13: 908553557cc5eb6f R14: ffff8100369a5200 R15: 0000000000000000 [ 724.357278] FS: 0000000000000000(0000) ...
Already done. The filesystem came back as clean after the first oops, but I forced a recheck with fsck to be safe - it found no problems. This is reproducible on a clean filesystem. -- Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/ PhD Student Imperial College London -
Following up on this, I've now extracted another oops (at the bottom of this mail). The common factor here seems to be the buffer_head circular list leading to invalid pointers in bh->b_this_page. I'm beginning to suspect the Attansic L1 Gigabit Etherner driver (marked as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these panics on disk-to-disk copies or SCP across the localhost interface. However, SCP from a server onto either of two different HDDs hits these oopses fairly quickly. Is it even possible for the Ethernet driver to corrupt ext3 data structures, short of trashing memory? [ 628.135241] general protection fault: 0000 [1] SMP [ 628.135422] CPU 1 [ 628.135522] Modules linked in: usb_storage libusual netconsole binfmt_misc rfcomm l2cap bluetooth ppdev capability commoncap acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand cpufreq_conservative cpufreq_powersave freq_table video container battery dock asus_acpi ac sbs button af_packet ipv6 nls_utf8 ntfs w83627ehf i2c_isa parport_pc lp parport fuse snd_hda_intel snd_pcm_oss snd_mixer_oss mt2060 snd_pcm snd_seq_dummy cx22702 snd_seq_oss cx88_dvb cx88_vp3054_i2c video_buf_dvb snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq cx8800 nls_cp437 dvb_usb_dib0700 dib7000m dib7000p dvb_usb cx8802 cx88xx dvb_core cifs ir_common dvb_pll i2c_algo_bit dib3000mc nvidia(P) dibx000_common snd_timer tveeprom atl1 compat_ioctl32 i2c_core videodev mii psmouse snd_seq_device v4l1_compat video_buf v4l2_common btcx_risc pcspkr shpchp snd soundcore snd_page_alloc intel_agp pci_hotplug serio_raw tsdev evdev sr_mod cdrom ext3 jbd mbcache sg sd_mod pata_jmicron usbhid hid ata_generic ata_piix ahci libata scsi_mod generic ehci_hcd uhci_hcd usbcore thermal processor fan [ 628.139866] Pid: 201, comm: kswapd0 Tainted: P 2.6.22-rc5-edge #1 [ 628.139952] RIP: 0010:[<ffffffff802929ee>] [<ffffffff802929ee>] free_block+0x10e/0x160 [ 628.140108] RSP: 0018:ffff8101322ebaf0 EFLAGS: 00010046 [ 628.140190] RAX: ...
On Sat, 23 Jun 2007 13:14:40 +0100 "Jay L. T. Cornwall" <jay@esuna.co.uk> That sounds like a good theory: you're getting easily-hit oopses in one of the kernel's most-used codepaths which hasn't chanbged much in a long I suppose so. I'd suggest that you enable every kernel debugging feature you can get your hands on (in the Kernel Hacking menu) and see if that turns anything up. Failing that, if you can whack a different network card in that machine it would help to firm or deny your suspicion. -
* From: Andrew Morton * Newsgroups: linux.kernel Maybe this time it's just "Tainted: P"? |-*- <467D0EB0.9030100@esuna.co.uk> -*- ... i2c_algo_bit dib3000mc nvidia(P) dibx000_common snd_timer tveeprom atl1 compat_ioctl32 i2c_core videodev mii psmouse snd_seq_device v4l1_compat video_buf v4l2_common btcx_risc pcspkr shpchp snd soundcore snd_page_alloc intel_agp pci_hotplug serio_raw tsdev evdev sr_mod cdrom ext3 jbd mbcache sg sd_mod pata_jmicron usbhid hid ata_generic ata_piix ahci libata scsi_mod generic ehci_hcd uhci_hcd usbcore thermal processor fan [ 628.139866] Pid: 201, comm: kswapd0 Tainted: P ... |-*- And oops have no ext3, like prev. one. [ as you know we have no automatic noise tracking system, and ] [ developers were not so productive in last discussion of it ] Jay, check your oops against "Tainted: P" flag, which is not supported here, and not drop persons, who assisted you from the CC list. ____ -
That'sthe NVIDIA module, which isn't doing much with X shut down regardless. It was bad form to forget this, of course, but is unrelated I know. This isn't ext3 related and I'm fairly certain drivers/net/atl1 is trashing... something. Perhaps the page table because: [ 153.785325] Bad page state in process 'scp' [ 153.785327] page:ffff81000308d020 flags:0x0040ad41dc050845 mapping:53dfe57d17cc59cf mapcount:16885953 count:292554304 [ 153.785329] Trying to fix it up, but a reboot is needed This one dismisses a reference counting issue because the page data here looks like garbage. And a panic in VLC, playing a video across the network hits a similar problem: [ 9194.281809] [<ffffffff802849e3>] page_remove_rmap+0x53/0x110 [ 9194.281819] [<ffffffff8027c32c>] unmap_vmas+0x4ec/0x7c0 [ 9194.281852] [<ffffffff802807ac>] unmap_region+0xcc/0x170 [ 9194.281867] [<ffffffff8028160a>] do_munmap+0x22a/0x2f0 [ 9194.281877] [<ffffffff80439ee2>] __down_write_nested+0x12/0xb0 [ 9194.281892] [<ffffffff802ef936>] sys_shmdt+0xb6/0x150 [ 9194.281903] [<ffffffff80209e8e>] system_call+0x7e/0x83 [ 9194.281921] [ 9194.281924] [ 9194.281925] Code: 48 2b ba 98 21 00 00 48 c1 ff 03 48 0f af f8 48 03 ba a8 21 My apologies, I had thought the etiquette was to only include maintainers on the CC list. I'll try and locate a maintainer for the Attansic driver a bit later, but I've only seen people loosely related to it. In any case we may as well let this thread die because it's not related to a filesystem bug (which the CC list is presumably interested in). -- Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/ PhD Student Imperial College London -
Last oops log was with tainting (as subject reflects), before that i've saw ext3 and "run fsck" reply. Thus, really clean oops log with all OK, i see now you are in Windows now, but i will try to ask you about making testcase using `netcat' or `curl'. If hardware is in trouble, probably network stressing could trigger that. And clean *one* test script and no X (or other stuff) will surely help. [ Netiquette here is being voluntary noise filter, after joining any ] [ thread, because reply-to-all is the way of communication in the LKML ] ____ -
On Sat, 23 Jun 2007 13:14:40 +0100 How much RAM is installed in your machine? If it's 4GB or more, does your problem go away if you boot with mem=3000M? Jay -
Intriguing. Yes, this machine has 4GB of RAM. If I boot with mem=3000M the problem does indeed go away - I can't induce an oops even after transferring tens of GB across the interface. I'm not sure I follow why that would be the case, except that it relates to pci_map_page behaviour. But I guess you have an inkling? -- Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/ PhD Student Imperial College London -
On Sun, 24 Jun 2007 21:31:36 +0100 For reasons not yet clear to me, it appears the L1 driver has a bug or the device itself has trouble with DMA in high memory. This patch, drafted by Luca Tettamanti, is being explored as a workaround. I'd be interested to know if it fixes your problem. [Aside: For future reference, atl1-devel@lists.sourceforge.net is a mailing list devoted to L1 driver development.] Jay diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c index 6862c11..a600601 100644 --- a/drivers/net/atl1/atl1_main.c +++ b/drivers/net/atl1/atl1_main.c @@ -2104,15 +2104,12 @@ static int __devinit atl1_probe(struct pci_dev *pdev, if (err) return err; - err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); if (err) { - err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); - if (err) { - dev_err(&pdev->dev, "no usable DMA configuration\n"); - goto err_dma; - } - pci_using_64 = false; + dev_err(&pdev->dev, "no usable DMA configuration\n"); + goto err_dma; } + pci_using_64 = false; /* Mark all PCI regions associated with PCI device * pdev as being reserved by owner atl1_driver_name */ -
Yes, it certainly seems to. Now running with this patch and 4GB active, I've transferred about 15GB with no problem so far. It usually oopses after a GB or two. I guess it's not an ideal solution, architecturally speaking, but it's a good deal better than an unstable driver. If there's any other patches you'd like me to test or traces to capture, I'm happy to help out. Otherwise I'll run with this one for now since it does the job! Thanks. -- Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/ PhD Student Imperial College London -
Hi Jeff,
a couple of users reported hard lockups when using L1 NICs on machines
with 4GB or more of RAM. We're still waiting official confirmation from
the vendor, but it seems that L1 has problems doing DMA to/from high
memory (physical address above the 4GB limit). Passing 32bit DMA mask
cures the problem.
Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
---
I think that the patch should be included in 2.6.22.
drivers/net/atl1/atl1_main.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 6862c11..a730f15 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -2097,21 +2097,16 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
struct net_device *netdev;
struct atl1_adapter *adapter;
static int cards_found = 0;
- bool pci_using_64 = true;
int err;
err = pci_enable_device(pdev);
if (err)
return err;
- err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+ err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
if (err) {
- err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
- if (err) {
- dev_err(&pdev->dev, "no usable DMA configuration\n");
- goto err_dma;
- }
- pci_using_64 = false;
+ dev_err(&pdev->dev, "no usable DMA configuration\n");
+ goto err_dma;
}
/* Mark all PCI regions associated with PCI device
* pdev as being reserved by owner atl1_driver_name
@@ -2176,7 +2171,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
netdev->ethtool_ops = &atl1_ethtool_ops;
adapter->bd_number = cards_found;
- adapter->pci_using_64 = pci_using_64;
/* setup the private structure */
err = atl1_sw_init(adapter);
@@ -2193,9 +2187,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
*/
/* netdev->features |= NETIF_F_TSO; */
- if (pci_using_64)
- netdev->features |= NETIF_F_HIGHDMA;
-
netdev->features |= NETIF_F_LLTX;
/*
Luca
--
Non ho ancora capito se il mio cane ...What boards have we seen this on? It's quite possible this is: a) an iommu-related problem specific to AMD or specific to Intel b) a BIOS problem that atl1 happens to be a victim of I'd rather not disable this unconditionally if we can get more information about why it's breaking. Doing so might just end up covering up the most obvious manifestation of a larger problem. -- Chris -
I can reproduce on an Asus P5K with a Core 2 Duo E6600. lspci identifies the controller as: 02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit Ethernet Adapter (rev b0) dmesg notes the PCI-DMA mapping implementation: PCI-DMA: Using software bounce buffering for IO (SWIOTLB) -- Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/ PhD Student Imperial College London -
I had a hunch this was on Intel. I'd rather just disable this when swiotlb is in use, unless we get more complaints. It's probably ultimately a BIOS quirk anyway. -- Chris -
On Mon, 25 Jun 2007 17:57:20 -0400 So far we have reports from both camps: Asus M2N8-VMX (AM2): 1 report of lockup http://sourceforge.net/mailarchive/forum.php?thread_name=46780384.063603.26165%40m12-1... Asus P5K (LGA775): 2 reports of lockups http://sourceforge.net/mailarchive/forum.php?thread_name=467E7E34.4010603%40gmail.com&... http://lkml.org/lkml/2007/6/25/107 The common denominator in these reports is 4GB RAM. -
Although its possible this device doesn't really support 64-bit, it's more likely that this is a platform problem of some sort, or a driver bug of some sort. In the driver, maybe it has a problem when you -cross- a 4GB boundary, which is not uncommon. Jeff -
I'm going on the record to say I don't trust the chipsets on these boards, and I'd like anyone having these problems to let us (atl1-devel@lists.sourceforge.net) know if they encounter similar problems with any other hardware. That said, I'm not going to stand in the way of stability just because it *might* be someone else's fault. I don't think limiting ourselves to dma32, at least while we track this down, is much of a loss on current hardware. Acked-By: Chris Snook <csnook@redhat.com> -
I don't follow you :| What kind "common" mistakes should we check for in the driver? Luca -
On Mon, 25 Jun 2007 23:18:55 +0200 Acked-by: Jay Cliburn <jacliburn@bellsouth.net> -
It may cause a "bounce" (i.e. data is copied to another buffer in lower memory) when a skb is allocated in high memory. Furthermore - at least on AMD systems - it should be possible to use the IOMMU to remap the memory to a bus address < 4GB. Xiong can you comment on this issue? To recap: users are seeing hard locks when L1 driver does a DMA to/from a high memory area (physical address > 4GB). Limiting DMA to the lower 4GB with: pci_set_dma_mask(pdev, DMA_32BIT_MASK); cures the issue. Does L1 have any know problem decoding 64 addresses? Luca -
On 22/06/07, Chuck Ebbert <cebbert@redhat.com> wrote: I agree that running fsck on the filesystem is a good idea, but still, even a corrupt filesystem should never be able to cause an Oops. In fact, nothing done from userspace should be able to cause an Oops. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html -
