Hi, I've found strange problem either in arcmsr driver, or maybe in areca-1660 card... When system on SAS discs RAID connected to areca-1660 card gets under heavy I/O load, it gets unusable after some time. I can 100% reproduce this, although it needs quite speciffic conditions: It can be reproduced on 2x quad core machine, RAM has to be limited to ~192MB to cause heavy paging. Only thing needed to cause the problem is to start loop doing kernel compilation using make -j 8 - this loads the system heavily, because of lack of memory. After few correct compile runs the system gets into state when all programs including the basic ones (ls, cp, ..) start crashing... dmesg (when it works) doesn't say anything strange... After reboot, the system is OK again. I have tested it on different motherboards, with different CPUs, RAMs(all were properly tested with memtest), with two different areca cards and different drives. I can't reproduce the problem on same hardware when using different RAID card (ie adaptec). All testing systems were properly cooled.. I have tried all available areca firmwares, two different distributions (oracle linux, and centos), and kernels ranging from distribution ones, to last GIT snapshot. Could somebody please give me some hints on how to hunt this problem? Areca support doesn't seem to be very interested in the problem :-( Thanks a lot in advance BR nik ------------------------------------- Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz ------------------------------------- --
(cc's added) Please get the machine into this state of memory exhaustion then take copies of the output of the following, and send them via reply-to-all to this email: - cat /proc/meminfo - cat /proc/slabinfo - dmesg -c > /dev/null ; echo m > /proc/sysrq-trigger ; dmesg -c Thanks. --
Hi On Sun, 24 Feb 2008, Andrew Morton wrote: Hi Andrew, thanks a lot for reply, I'm attaching requested information. please let me know if You need more information/testing, whatever. I'll be glad to help. BR --
Alas, that all looks OK to me. You never get any out-of-memory messages, and no oom-killing messages? Possibly what is happening here is that in this low-memory condition, some of the driver's internal memory-allocation attempts are failing, and the driver isn't correctly handling this. This is a rare situation which may well not have been hit in anyone else's testing. I expect that the Areca engineers will be able to reproduce this with a suitably small "mem=" kernel boot option. If not, they could perhaps investigate the kernel's fault-injection framework, which permits simulation of page allocation failures. --
Hi Andrew, no, right now I have the machine in the weird state, swap is empty (3GB), and so is bigger part of RAM (~100MB free), and the gcc crashes even when trying to compile c program with empty main function. so it doesn't seem to be problem with memory exhaustion. Hopefully the areca guys will be able to find out what is going on. But anyways, if You'll have any other idea what should I check/try, please let me know, as I have to admit that I'd really like to hunt it down myself (and yes, there is some vanity on my side here :)) thanks a lot once more cheers nik On Tue, 26 Feb 2008, -- --
Maybe memory fragmentation? Perhaps the driver tries to allocate a large block of memory and cannot find a continuous block of the right size. Maybe the driver developers used different kernel .config options than you are using. =20 Try increasing the value in /proc/sys/vm/min_free_kbytes. Try switching some things like SLAB or SLUB, try booting with kernelcore=3D512M to enable the Movable memory zone, or try 64-bit vs 32-bit kernels.=20 --=20 Zan Lynx <zlynx@acm.org>
Hi Nikola, Please put support@areca.com.tw in the loop. I am sure Areca support, Kevin, has taken over your case. If you like, please let him know your configuration and operations to synchronize both sides. Thank you for your patience and sorry for your inconvenience, -----Original Message----- From: Zan Lynx [mailto:zlynx@acm.org] Sent: Wednesday, February 27, 2008 5:04 AM To: Nikola Ciprich Cc: Andrew Morton; linux-kernel@vger.kernel.org; linux-scsi@vger.kernel.org; Nick Cheng; Erich Chen; kopi@linuxbox.cz Subject: Re: arcmsr & areca-1660 - strange behaviour under heavy load Maybe memory fragmentation? Perhaps the driver tries to allocate a large block of memory and cannot find a continuous block of the right size. Maybe the driver developers used different kernel .config options than you are using. Try increasing the value in /proc/sys/vm/min_free_kbytes. Try switching some things like SLAB or SLUB, try booting with kernelcore=512M to enable the Movable memory zone, or try 64-bit vs 32-bit kernels. -- Zan Lynx <zlynx@acm.org> --
Hi Nikola, As I said, we will test on our site. Our support team will help you to settle the issue. Sorry for your inconvenience, -----Original Message----- From: Nikola Ciprich [mailto:extmaillist@linuxbox.cz] Sent: Tuesday, February 26, 2008 5:36 PM To: Andrew Morton Cc: linux-kernel@vger.kernel.org; linux-scsi@vger.kernel.org; Nick Cheng; Erich Chen; kopi@linuxbox.cz Subject: Re: arcmsr & areca-1660 - strange behaviour under heavy load Hi On Sun, 24 Feb 2008, Andrew Morton wrote: Hi Andrew, thanks a lot for reply, I'm attaching requested information. please let me know if You need more information/testing, whatever. I'll be glad to help. BR -- --
| Sam Ravnborg | Are Section mismatches out of control? |
| Karl Meyer | PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out" |
| Bart Van Assche | Re: Is gcc thread-unsafe? |
| Adrian Bunk | Re: [Bug #10493] mips BCM47XX compile error |
git: | |
| Junio C Hamano | Re: [RFC/PATCH] git-branch: default to --track |
| Linus Torvalds | cleaner/better zlib sources? |
| Peter Stahlir | Git as a filesystem |
| Yossi Leybovich | corrupt object on git-gc |
| Manuel Wildauer | Re: Editing C with... |
| Mark Thomas | [i386/Thinkpad T41]USB mouse + Xorg obsd 4.1 |
| Stijn | Re: libiconv problem |
| Daniel Ouellet | Re: Router performance on OpenBSD and OpenBGPD |
| Felix Radensky | RE: e1000e "Detected Tx Unit Hang" |
| Johann Baudy | Packet mmap: TX RING and zero copy |
| David Miller | Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network |
| Dushan Tcholich | Re: ksoftirqd high cpu load on kernels 2.6.24 to 2.6.27-rc1-mm1 |
