kernel 2.6.24.2 network problem

Submitted by Anonymous
on February 19, 2008 - 6:31pm

Hello guys,
I have a realy big problem with the newest kernel. I would try to explain it as simple and accurate as I could:
I have an old kernel (2.6.20) on all of my machines (more than 10xservers - 2xQuadcores CPUs, new machines with 2xintel gigabit network adapters ) and I decided to switch to the new kernel, because of the "vmsplice Local Root Exploit", so I downloaded the kernel from the www.kernel.org and make my default configuration (you know nothing new, nothing experimental - the options I had in my last kernel) and everything seems to work at first, but just then I noticed, that when I try to copy a big file out of a machine with that kernel, soon the process goes to sleep, and the current transfer stalled. Because I use my program to copy things to my backup server, I decided to see what it is going to happen if I try sftp and the result was the same, the ssh process went to sleep after a while, the transfer stalled, the process just sleeps in a write operation - write(3,"some data",123213) and sleeps for unpredictable time. And all cases the netstat show 232305~260000 for the Send-Q column (The count of bytes not acknowledged by the remote host.) for this connection.
So it is strange, never had this problem before, I backup almost 1T every night, until I swithed to the new kernel 2.6.24.2, all servers have this problem backing up on 2 backup server. Hardly some gigabytes go on my backups. 3 days ago I switched back to old kernels two of my servers, and they backup beautifuly with no problems!
So any suggestions?

I have many issues with 2.6.23+ also

on
February 19, 2008 - 8:19pm

I think the decision to maintain "stable" 2.6.16.X and 2.6.22.X trees (which do have your vmsplice root exploit fix as well) means fewer people are testing the most important tree, which is 2.6.24.X. For example, VmWare used to always have an unofficial patch for the current kernel tree but now they don't even bother, their 115 patch will barely run 2.6.23 kernels. So those of us trying to run the current current stable tree are now "bleeding edge" experimenters.

I was very critical of Greg KH on this "multiple stable tree issue". We are going to see massive fragmentation to where "the latest stable kernel tree" as announced on kernel.org is going to be an untested joke, particularly with the massive number of patches being accepted per release now.

Of course IBM/Microsoft/Suse's business model fits very well with this concept, they will selectively backport kernel patches to abominations like 2.6.5-suse-patch_attempt_873, etc..

2.6.22.X

Anonymous (not verified)
on
February 22, 2008 - 5:00pm

ye, 2.6.22.18 is cool, no problem for now, and the machines are really stable.
BTW, as with kernel 2.6.24.2 Some of the machines had this in dmesg....
I hope this could help to some devs:

WARNING: at net/ipv4/tcp_output.c:1799 tcp_simple_retransmit()
Pid: 0, comm: swapper Not tainted 2.6.24.2 #1
[] tcp_simple_retransmit+0x19d/0x1b0
[] tcp_v4_err+0x518/0x520
[] icmp_unreach+0x12b/0x330
[] tcp_v4_rcv+0x726/0x770
[] icmp_rcv+0xec/0x190
[] ip_local_deliver_finish+0x65/0x140
[] ip_local_deliver+0xb7/0xc0
[] enqueue_entity+0x2b/0x50
[] ip_rcv_finish+0xbf/0x340
[] activate_task+0x1e/0x60
[] getnstimeofday+0x3e/0x140
[] ip_rcv+0x1a2/0x2a0
[] netif_receive_skb+0x1f9/0x220
[] e1000_clean_rx_irq+0x1c2/0x540
[] e1000_clean+0x3d/0xc0
[] net_rx_action+0x73/0x160
[] rebalance_domains+0xb3/0x140
[] __do_softirq+0x75/0xf0
[] do_softirq+0x38/0x40
[] do_IRQ+0x3e/0x70
[] common_interrupt+0x23/0x28
[] allocate_slab+0x88/0xe0
[] mwait_idle_with_hints+0x3e/0x50
[] mwait_idle+0x0/0x10
[] cpu_idle+0x73/0x90
=======================
WARNING: at net/ipv4/tcp_input.c:2413 tcp_fastretrans_alert()
Pid: 9029, comm: httpd Not tainted 2.6.24.2 #1
[] tcp_fastretrans_alert+0x526/0x690
[] tcp_ack+0x1d3/0x3e0
[] _spin_lock_bh+0x8/0x20
[] tcp_rcv_established+0x3c0/0x690
[] tcp_v4_do_rcv+0xf3/0x100
[] tcp_prequeue_process+0x50/0x70
[] tcp_recvmsg+0x419/0x760
[] sock_common_recvmsg+0x45/0x70
[] do_sock_read+0x90/0xa0
[] sock_aio_read+0x78/0x90
[] do_sync_read+0xbd/0x110
[] unmap_page_range+0xd4/0x190
[] autoremove_wake_function+0x0/0x50
[] min_pages_to_free+0x15/0x30
[] quicklist_trim+0x33/0x90
[] do_sigaction+0x57/0x1a0
[] copy_to_user+0x32/0x50
[] sys_rt_sigaction+0x98/0xb0
[] vfs_read+0xbe/0xd0
[] sys_read+0x41/0x70
[] sysenter_past_esp+0x5f/0x85
=======================
WARNING: at net/ipv4/tcp_input.c:1675 tcp_enter_frto()
Pid: 0, comm: swapper Not tainted 2.6.24.2 #1
[] tcp_enter_frto+0x2a2/0x2c0
[] tcp_retransmit_timer+0x106/0x410
[] hrtimer_get_softirq_time+0x17/0x80
[] hrtimer_run_queues+0x3f/0xe0
[] tcp_write_timer+0xaf/0xd0
[] run_timer_softirq+0xac/0x180
[] profile_pc+0x36/0x60
[] profile_tick+0x58/0x80
[] __do_softirq+0x75/0xf0
[] do_softirq+0x38/0x40
[] smp_apic_timer_interrupt+0x2a/0x40
[] apic_timer_interrupt+0x28/0x30
[] allocate_slab+0x88/0xe0
[] mwait_idle_with_hints+0x3e/0x50
[] mwait_idle+0x0/0x10
[] cpu_idle+0x73/0x90
=======================

So do not touch kernel 2.6.24.X in production environment!

2.6.24.2 wifi issue too

Anonymous (not verified)
on
February 22, 2008 - 8:15pm

2.6.24.2 seems to have caused quite a few regressions in the networking department. There is a bug that is apparently biting a lot of people (including me) trying to use it on recent laptops that have both wired and wifi Broadcom chips. The driver (b44) for the wired one grabs both devices, blocking ndiswrapper's access to the wireless one.

Slowlyness issue might be caused by FRTO

Brice (not verified)
on
February 29, 2008 - 8:23am

Note that 2.6.24 has frto activated (see the following commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...)

I learned it like you the hard way when some transfers to some clients behind cisco firewall started to crawl...

It took me three hours of intensive troubleshooting until I find that FRTO was the culprit (in fact I guess the real culprit is the cisco firewall)

So I suggest you disable FRTO and see if it changes something
echo "0" > /proc/sys/net/ipv4/tcp_frto

Hope this helps,
Brice

FRTO caused transfer to stall

Pawel (not verified)
on
March 8, 2008 - 5:40pm

Hi,
Brice, thanks for the hint.
In out network we have some proxim wireless links connected by an old cisco switch with vlans. Additionally traffic is routed by cisco 72xx router (also old machine). We had serious problems with file download through these wireless links from servers operating on 2.6.24 kernels. After downloading few hundred kB transfer stalled. It is known that these links suffers from packet lost because of radio noise. When packet lost occured, the frto somehow caused whole transfer to stall. After setting echo "0" > /proc/sys/net/ipv4/tcp_frto we are able to download without problems.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.