Hello,
I think I hit a bug in e1000e build-in driver in the kernel, I have 3 machines with kernel 2.6.28.X kernel, and suddenly the net stopped working, I have KVM and I logged, there was te following in the dmesg:
Mar 2 20:48:16 server01 [4477887.193970] ------------[ cut here ]------------
Mar 2 20:48:16 server01 [4477887.193973] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x205/0x220()
Mar 2 20:48:16 server01 [4477887.193976] NETDEV WATCHDOG: eth0 (e1000e): transmit timed out
Mar 2 20:48:16 server01 [4477887.193977] Modules linked in: nfsd lockd auth_rpcgss sunrpc exportfs e1000e
Mar 2 20:48:16 server01 [4477887.193985] Pid: 0, comm: swapper Not tainted 2.6.28 #1
Mar 2 20:48:16 server01 [4477887.193987] Call Trace:
Mar 2 20:48:16 server01 [4477887.193989] [] warn_slowpath+0xcd/0x110
Mar 2 20:48:16 server01 [4477887.193998] [] __inet_lookup_established+0xc0/0x1b0
Mar 2 20:48:16 server01 [4477887.194002] [] dev_queue_xmit+0xe3/0x4c0
Mar 2 20:48:16 server01 [4477887.194004] [] ip_queue_xmit+0x18f/0x390
Mar 2 20:48:16 server01 [4477887.194008] [] __alloc_skb+0x8a/0x150
Mar 2 20:48:16 server01 [4477887.194011] [] copy_skb_header+0xe/0x90
Mar 2 20:48:16 server01 [4477887.194015] [] tcp_enter_cwr+0xae/0xe0
Mar 2 20:48:16 server01 [4477887.194017] [] tcp_transmit_skb+0x3bb/0x710
Mar 2 20:48:16 server01 [4477887.194020] [] lock_timer_base+0x34/0x70
Mar 2 20:48:16 server01 [4477887.194024] [] strlcpy+0x4e/0x80
Mar 2 20:48:16 server01 [4477887.194027] [] dev_watchdog+0x205/0x220
Mar 2 20:48:16 server01 [4477887.194030] [] sk_reset_timer+0xf/0x20
Mar 2 20:48:16 server01 [4477887.194033] [] tcp_write_timer+0x3bd/0x610
Mar 2 20:48:16 server01 [4477887.194037] [] hrtimer_run_pending+0x29/0xc0
Mar 2 20:48:16 server01 [4477887.194040] [] dev_watchdog+0x0/0x220
Mar 2 20:48:16 server01 [4477887.194042] [] run_timer_softirq+0x15f/0x1c0
Mar 2 20:48:16 server01 [4477887.194045] [] __do_softirq+0x9c/0x170
Mar 2 20:48:16 server01 [4477887.194049] [] call_softirq+0x1c/0x30
Mar 2 20:48:16 server01 [4477887.194051] [] do_softirq+0x35/0x70
Mar 2 20:48:16 server01 [4477887.194055] [] smp_apic_timer_interrupt+0x85/0xd0
Mar 2 20:48:16 server01 [4477887.194058] [] apic_timer_interrupt+0x6b/0x70
Mar 2 20:48:16 server01 [4477887.194059] [] blk_backing_dev_unplug+0x0/0x10
Mar 2 20:48:16 server01 [4477887.194066] [] mwait_idle+0x41/0x50
Mar 2 20:48:16 server01 [4477887.194068] [] cpu_idle+0x3a/0x70
Mar 2 20:48:16 server01 [4477887.194070] ---[ end trace 02979ba9c60bb4f2 ]---
The second server:
Mar 2 20:49:17 server02 [75405.093739] ------------[ cut here ]------------
Mar 2 20:49:17 server02 [75405.093744] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x205/0x220()
Mar 2 20:49:17 server02 [75405.093746] NETDEV WATCHDOG: eth0 (e1000e): transmit timed out
Mar 2 20:49:17 server02 [75405.093748] Modules linked in: nfsd lockd auth_rpcgss sunrpc exportfs e1000e
Mar 2 20:49:17 server02 [75405.093758] Pid: 0, comm: swapper Not tainted 2.6.28 #1
Mar 2 20:49:17 server02 [75405.093760] Call Trace:
Mar 2 20:49:17 server02 [75405.093763] [] warn_slowpath+0xcd/0x110
Mar 2 20:49:17 server02 [75405.093773] [] __ip_route_output_key+0x16f/0x9f0
Mar 2 20:49:17 server02 [75405.093777] [] source_load+0x37/0x70
Mar 2 20:49:17 server02 [75405.093781] [] __next_cpu+0x19/0x30
Mar 2 20:49:17 server02 [75405.093785] [] find_busiest_group+0x19a/0x840
Mar 2 20:49:17 server02 [75405.093789] [] strlcpy+0x4e/0x80
Mar 2 20:49:17 server02 [75405.093793] [] dev_watchdog+0x205/0x220
Mar 2 20:49:17 server02 [75405.093797] [] read_tsc+0x9/0x20
Mar 2 20:49:17 server02 [75405.093801] [] sk_reset_timer+0xf/0x20
Mar 2 20:49:17 server02 [75405.093806] [] getnstimeofday+0x41/0xc0
Mar 2 20:49:17 server02 [75405.093809] [] hrtimer_run_pending+0x29/0xc0
Mar 2 20:49:17 server02 [75405.093813] [] dev_watchdog+0x0/0x220
Mar 2 20:49:17 server02 [75405.093817] [] run_timer_softirq+0x15f/0x1c0
Mar 2 20:49:17 server02 [75405.093820] [] __do_softirq+0x9c/0x170
Mar 2 20:49:17 server02 [75405.093825] [] call_softirq+0x1c/0x30
Mar 2 20:49:17 server02 [75405.093828] [] do_softirq+0x35/0x70
Mar 2 20:49:17 server02 [75405.093832] [] smp_apic_timer_interrupt+0x85/0xd0
Mar 2 20:49:17 server02 [75405.093836] [] apic_timer_interrupt+0x6b/0x70
Mar 2 20:49:17 server02 [75405.093838] [] mwait_idle+0x41/0x50
Mar 2 20:49:17 server02 [75405.093844] [] cpu_idle+0x3a/0x70
Mar 2 20:49:17 server02 [75405.093846] ---[ end trace fec860feec4b0477 ]---
The third server:
Mar 2 20:53:49 server03 [1237595.966215] ------------[ cut here ]------------
Mar 2 20:53:49 server03 [1237595.966220] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x20e/0x220()
Mar 2 20:53:49 server03 [1237595.966222] NETDEV WATCHDOG: eth0 (e1000e): transmit timed out
Mar 2 20:53:49 server03 [1237595.966224] Modules linked in: nfsd lockd sunrpc exportfs reiserfs e1000e
Mar 2 20:53:49 server03 [1237595.966232] Pid: 0, comm: swapper Not tainted 2.6.28.4mce #2
Mar 2 20:53:49 server03 [1237595.966234] Call Trace:
Mar 2 20:53:49 server03 [1237595.966236] [] warn_slowpath+0xcd/0x110
Mar 2 20:53:49 server03 [1237595.966245] [] scsi_mode_select+0x190/0x1d0
Mar 2 20:53:49 server03 [1237595.966248] [] source_load+0x37/0x70
Mar 2 20:53:49 server03 [1237595.966252] [] __next_cpu+0x19/0x30
Mar 2 20:53:49 server03 [1237595.966254] [] find_busiest_group+0x19a/0x840
Mar 2 20:53:49 server03 [1237595.966258] [] strlcpy+0x4e/0x80
Mar 2 20:53:49 server03 [1237595.966261] [] dev_watchdog+0x20e/0x220
Mar 2 20:53:49 server03 [1237595.966264] [] read_tsc+0x12/0x40
Mar 2 20:53:49 server03 [1237595.966269] [] getnstimeofday+0x48/0xe0
Mar 2 20:53:49 server03 [1237595.966272] [] tcp_write_timer+0x96/0x630
Mar 2 20:53:49 server03 [1237595.966276] [] _spin_lock_irq+0xd/0x10
Mar 2 20:53:49 server03 [1237595.966278] [] dev_watchdog+0x0/0x220
Mar 2 20:53:49 server03 [1237595.966282] [] run_timer_softirq+0x16f/0x1e0
Mar 2 20:53:49 server03 [1237595.966285] [] __do_softirq+0x9c/0x180
Mar 2 20:53:49 server03 [1237595.966288] [] call_softirq+0x1c/0x30
Mar 2 20:53:49 server03 [1237595.966291] [] do_softirq+0x49/0x90
Mar 2 20:53:49 server03 [1237595.966295] [] smp_apic_timer_interrupt+0x85/0xd0
Mar 2 20:53:49 server03 [1237595.966298] [] apic_timer_interrupt+0x88/0x90
Mar 2 20:53:49 server03 [1237595.966299] [] blk_backing_dev_unplug+0x0/0x10
Mar 2 20:53:49 server03 [1237595.966306] [] mwait_idle+0x41/0x50
Mar 2 20:53:49 server03 [1237595.966309] [] cpu_idle+0x41/0x70
Mar 2 20:53:49 server03 [1237595.966311] ---[ end trace e4cfb04005031965 ]---
The net stopped, and every 10-15 seconds the dmesg was filled with:
Mar 2 20:55:49 [1237716.131221] 0000:04:00.0: eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 2 20:56:02 [1237728.734761] 0000:04:00.0: eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 2 20:56:10 [1237736.794499] 0000:04:00.0: eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
The servers were streaming video and it was in the peak, when the above happened, the other 10 servers with older kernel were OK, just the ones with kernel 2.6.28.x