Hi all, this is a re-send of my original email to e1000-devel@lists.sf.net. It includes all information at once and is sent to LKML as well. I am CC'ing Andrew Pochinsky because http://lkml.org/lkml/2008/6/10/3 is related. I deployed Bacula storage director (i.e. a backup target) on a machine which has been running in test mode for some time, now. This past night is the first one in which it received significant traffic from several other machines, 62 in total. My host has four Intel E1000 which are bonded into one virtual interface. The other side is a Cisco 6500. At 00:00, at a peak rate of 1.2 Gbit/s, I had ten(!) page allocation failures within seconds: # dmesg | grep __alloc_pages_nodemask [290645.350781] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.351129] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.493159] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.508153] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.660543] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.661091] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.801266] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290645.818294] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290646.197948] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 [290646.206736] <IRQ> [<ffffffff810afce1>] ? __alloc_pages_nodemask+0x569/0x5e4 # I am confident that I will be able to reproduce this issue, but did not get around to trying, yet. As this is a critical system, the amount of time I will have available to debug this may be limited. Other relevant information: The machine is a 64 bit Debian Lenny with almost current kernel: Linux host 2.6.32.6 #3 SMP Tue Jan 26 12:39:17 CET 2010 x86_64 GNU/Linux # lspci 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev ...
Other information that might help: For some reason, the e1000e module was _not_ loaded. Tbh, I am not sure why that was the case, unless e1000e is unrelated to this issue. FWIW, I loaded it, just now. # for i in /sys/class/net/bond0/bonding/*; do echo -n "$i : "; cat $i; done /sys/class/net/bond0/bonding/active_slave : /sys/class/net/bond0/bonding/ad_actor_key : 17 /sys/class/net/bond0/bonding/ad_aggregator : 1 /sys/class/net/bond0/bonding/ad_num_ports : 4 /sys/class/net/bond0/bonding/ad_partner_key : 2 /sys/class/net/bond0/bonding/ad_partner_mac : 00:d0:05:42:cc:00 /sys/class/net/bond0/bonding/ad_select : stable 0 /sys/class/net/bond0/bonding/arp_interval : 0 /sys/class/net/bond0/bonding/arp_ip_target : /sys/class/net/bond0/bonding/arp_validate : none 0 /sys/class/net/bond0/bonding/downdelay : 0 /sys/class/net/bond0/bonding/fail_over_mac : none 0 /sys/class/net/bond0/bonding/lacp_rate : slow 0 /sys/class/net/bond0/bonding/miimon : 100 /sys/class/net/bond0/bonding/mii_status : up /sys/class/net/bond0/bonding/mode : 802.3ad 4 /sys/class/net/bond0/bonding/num_grat_arp : 1 /sys/class/net/bond0/bonding/num_unsol_na : 1 /sys/class/net/bond0/bonding/primary : /sys/class/net/bond0/bonding/slaves : eth0 eth1 eth2 eth3 /sys/class/net/bond0/bonding/updelay : 0 /sys/class/net/bond0/bonding/use_carrier : 1 /sys/class/net/bond0/bonding/xmit_hash_policy : layer2 0 # lsmod Module Size Used by xfs 432237 1 xt_tcpudp 2287 3 nf_conntrack_ipv4 9499 1 nf_defrag_ipv4 1139 1 nf_conntrack_ipv4 xt_state 1303 1 nf_conntrack 46185 2 nf_conntrack_ipv4,xt_state iptable_filter 1410 1 ip_tables 13568 1 iptable_filter x_tables 12754 3 xt_tcpudp,xt_state,ip_tables sr_mod 12410 0 nfsd 256208 13 exportfs 3026 2 xfs,nfsd nfs 235903 1 lockd 56807 2 nfsd,nfs nfs_acl ...
Hi Richard, On Fri, Feb 26, 2010 at 2:42 AM, Richard Hartmann the memory allocation (order:0), while unexpected, are not fatal, and the e1000 driver is written to handle the failures during allocation. Does something else happen to the system after this or does operation continue? You might be able to try the sysctl tweak to reserve a little more memory for driver allocations. # sysctl vm.min_free_kbytes # sysctl -e vm.min_free_kbytes=<double what you have> --
I can not be sure, but I _think_ some bogus data made it into userspace. I did have some binary in a text string I received & logged, which is a No. Should I? Thanks, Richard --
in the future please copy netdev@vger.kernel.org on networking issues. On Mon, Mar 1, 2010 at 9:34 AM, Richard Hartmann hm, if that did occur it would be bad. But it does sound like I wouldn't recommend it if you're already having issues getting order:0 allocations, it would just make the problem worse. I wanted to make sure you were not. Jesse --
On Mon, Mar 1, 2010 at 18:34, Richard Hartmann It was host:~# sysctl vm.min_free_kbytes vm.min_free_kbytes = 16174 host:~# btw. Richard --
Can you post your kernel config? Thanks, Emil --
On Mon, Mar 1, 2010 at 19:53, Tantilov, Emil S Ugh, sorry... I totally forgot that. Debian stock kernel. host:~# cat /boot/config-$(uname -r) # # Automatically generated make config: don't edit # Linux kernel version: 2.6.32.6 # Tue Jan 26 12:38:33 2010 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_CONSTRUCTORS=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not ...
