Re: race in skb_splice_bits?

Previous thread: Adding multiple multicast routing tables. by Ben Greear on Monday, May 26, 2008 - 1:23 pm. (1 message)

Next thread: Re: possible double call of kfree_skb in net/llc/llc_sap.c by Patrick McHardy on Monday, May 26, 2008 - 11:16 pm. (2 messages)
From: Octavian Purdila
Date: Monday, May 26, 2008 - 5:25 pm

Hi,

The following socket lock dropping in skb_splice_bits seems to open a race 

Setup: 

- powerpc, non-SMP, no preemption, 2.6.25
- RX side: LRO enabled, splice from socket to /dev/null; 
- TX side: MTU set to 128 bytes (on the TX side), GSO enabled, splice from 
file to socket

The oops - on the RX side: 

Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0x80109ee0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 80109ee0 LR: 80109edc CTR: 8010c52c
REGS: bcd25b90 TRAP: 0300   Tainted: P          (2.6.25-00005-gf7b547d)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24000822  XER: 20000000
DAR: 00000030, DSISR: 40000000
TASK = bfbe1bf0[156] 'splice' THREAD: bcd24000
GPR00: 8010c94c bcd25c40 bfbe1bf0 00000000 00000000 802835f8 00000001 0000004c 
GPR08: 00024000 00000100 00000032 bcd24000 00010dc4 100198b4 390046a8 0a5042f3 
GPR16: 8028238c bd18fe00 00000008 10010000 6fbcbac0 00000000 10001060 bcd25dd8 
GPR24: 8014b520 00000000 bcd25e30 bccefa00 bf33e300 fffffe00 bcd25d70 00000000 
NIP [80109ee0] lock_sock_nested+0x1c/0x50
LR [80109edc] lock_sock_nested+0x18/0x50
Call Trace:
[bcd25c60] [8010c94c] skb_splice_bits+0x130/0x134
[bcd25dc0] [8014b548] tcp_splice_data_recv+0x28/0x38
[bcd25dd0] [8014d08c] tcp_read_sock+0x108/0x1f8
[bcd25e20] [8014b58c] __tcp_splice_read+0x34/0x44
[bcd25e40] [8014b61c] tcp_splice_read+0x80/0x220
[bcd25e90] [80105730] sock_splice_read+0x2c/0x44
[bcd25ea0] [8008a374] do_splice_to+0x90/0xac
[bcd25ed0] [8008a850] do_splice+0x258/0x2f0
[bcd25f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd25f40] [8000ff14] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0x10000894
     LR = 0x10000e2c


Analysis: 

Printks show that __skb->sk is non-NULL before splice_to_pipe and NULL after. 
Using a hardware watchpoint I was able to see that the write in __skb->sk is 
caused by ...
From: Ben Hutchings
Date: Monday, May 26, 2008 - 7:08 pm

Given the previous comment, that certainly looks wrong.


But this could apparently cause deadlock.  Surely the correct fix is
to copy __skb->sk to a local variable before calling splice_to_pipe()
so we can re-lock it?

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
--

From: Octavian Purdila
Date: Tuesday, May 27, 2008 - 3:41 am

I've tried that, but I think that the freed __skb might be touched later in 
tcp_read_sock:

Faulting instruction address: 0x8014d0c0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 8014d0c0 LR: 8014d090 CTR: 8010c52c
REGS: bcd27d20 TRAP: 0300   Tainted: P          (2.6.25-00005-gf7
b547d-dirty)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24000822  XER: 20000000
DAR: 0000000c, DSISR: 40000000
TASK = bd0d7bd0[172] 'splice' THREAD: bcd26000
GPR00: 00000000 bcd27dd0 bd0d7bd0 fffffe00 00000000 802835f8 00000001 0000004c 
GPR08: 00024000 00000000 00000062 bcd26000 0023ac37 100198b4 390046a8 0a5042f3 
GPR16: 8028238c bd18fe00 00000008 10010000 6fd01ac0 00000000 10001060 bcd27dd8 
GPR24: 8014b524 00000000 bcd27e30 bcd07180 0000004c bcd071e4 a7c41e8b bfb3aa60 
NIP [8014d0c0] tcp_read_sock+0x138/0x1f8
LR [8014d090] tcp_read_sock+0x108/0x1f8
Call Trace:
[bcd27dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable)
[bcd27e20] [8014b590] __tcp_splice_read+0x34/0x44
[bcd27e40] [8014b620] tcp_splice_read+0x80/0x220
[bcd27e90] [80105730] sock_splice_read+0x2c/0x44
[bcd27ea0] [8008a374] do_splice_to+0x90/0xac
[bcd27ed0] [8008a850] do_splice+0x258/0x2f0
[bcd27f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd27f40] [8000ff14] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0x10000894
     LR = 0x10000e2c

Thanks,
tavi
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 4:01 am

What about sock_hold() here?
It will prevent from socket freeing and read/write to it will

And appropriate sock_put(sk);

-- 
	Evgeniy Polyakov
--

From: Ben Hutchings
Date: Tuesday, May 27, 2008 - 4:08 am

<snip>

We know the socket isn't going to go away because somewhere up the call
stack someone has to be holding the socket in order to lock it.  However,
the skb may (and evidently sometimes does) go away during splice_to_pipe()
so we can't look up the socket through it.

However, from Octavian's later mail it seems we can't let the skb go away
at all.  So some wider changes seem to be required.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 4:52 am

Yes, I misread the original message.

Octavian, could you please decode where bug occured via gdb:
gdb
$ l *(skb_splice_bits+0x130)

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 4:56 am

It is lock_sock()?

And also please show what
tcp_read_sock+0x108
tcp_read_sock+0x138

are. And do you have a test case for that?

-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Tuesday, May 27, 2008 - 5:53 am

Yes it is lock_sock(). The crash in lock_sock() occurs with the following 
sequence:

release_sock(__skb->sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(__skb->sk);

0x8010c94c is in skb_splice_bits (include/net/sock.h:837).
832
833     extern void lock_sock_nested(struct sock *sk, int subclass);
834
835     static inline void lock_sock(struct sock *sk)
836     {
837             lock_sock_nested(sk, 0);
838     }
839
840     extern void release_sock(struct sock *sk);

This crash occurs with the following sequence:

struct sock *sk= __skb->sk;

release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(sk);

0x8014d090 is in tcp_read_sock (net/ipv4/tcp.c:1225).
1220                            used = recv_actor(desc, skb, offset, len);
1221                            if (used < 0) {
1222                                    if (!copied)
1223                                            copied = used;
1224                                    break;
1225                            } else if (used <= len) {
1226                                    seq += used;
1227                                    copied += used;
1228                                    offset += used;

Yes. I've attached my test program which triggers both crashes. To improve the 
reproducibility rate you need to use a relatively high buffer for splice (I 
used 16384) and a small MTU (I used 128). This will make the 1st splice on 
the receive side block (due to not enough PIPE_BUFFERS it seems), and will 
give plenty of time to the race condition to manifest itself. When you 
interrupt the program, the system will crash.

Thanks,
tavi

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 6:21 am

Cool!

I've reproduced the problem and will try to work it out, thank you.

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 7:03 am

Attached patch fixes the crash for me, Octavian could you please test
it.

David, is __kfree_skb() usage in tcp_read_sock() an optimisation only?
With this patch I do not see any leaks, but I did not investigate it
deep enough. If approach seems correct, I will clean things up.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
 
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..b8318e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1182,6 +1182,23 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 	return NULL;
 }
 
+#ifdef CONFIG_NET_DMA
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	if (!copied_early)
+		kfree_skb(skb);
+	else
+		__skb_queue_tail(&sk->sk_async_wait_queue, skb);
+}
+#else
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	kfree_skb(skb);
+}
+#endif
+
 /*
  * This routine provides an alternative to tcp_recvmsg() for routines
  * that would like to handle copying from skbuffs directly in 'sendfile'
@@ -1231,11 +1248,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				break;
 		}
 		if (tcp_hdr(skb)->fin) {
-			sk_eat_skb(sk, skb, 0);
+			__sk_eat_skb(sk, skb, 0);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb, 0);
+		__sk_eat_skb(sk, skb, 0);
 		if ...
From: Octavian Purdila
Date: Tuesday, May 27, 2008 - 7:39 am

On my system it still crashes, if I interrupt the program before splice 
blocks:

[   16.170009] NIP [8014d0c0] tcp_read_sock+0x138/0x1f8
[   16.170009] LR [8014d090] tcp_read_sock+0x108/0x1f8
[   16.170009] Call Trace:
[   16.170009] [bcd21dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable)
[   16.170009] [bcd21e20] [8014b590] __tcp_splice_read+0x34/0x44
[   16.170009] [bcd21e40] [8014b620] tcp_splice_read+0x80/0x220
[   16.170009] [bcd21e90] [80105730] sock_splice_read+0x2c/0x44
[   16.170009] [bcd21ea0] [8008a374] do_splice_to+0x90/0xac
[   16.170009] [bcd21ed0] [8008a850] do_splice+0x258/0x2f0
[   16.170009] [bcd21f10] [8008b1d4] sys_splice+0xe0/0xe8
[   16.170009] [bcd21f40] [8000ff14] ret_from_syscall+0x0/0x38


Thanks,
tavi
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 8:09 am

Yes, that was somewhat stupid patch.
Since I can not reproduce it anymore, let's try to upload this task to you :)
Please test this one assuming there are no other patches applied before.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..54bb460 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff ...
From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 8:12 am

A typo sneaked into the patch, please try this one.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..fb8bfc2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 		  sk_read_actor_t ...
From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 8:22 am

The same wrong one, sorry about that.

Idea is to hold skb between release/lock sock calls and thus do not
allow to free it by core stack when it is being released. Patch still
misses the case, when socket is released and skb was dequeued, so splice
will try to dequeue it again, which will crash. I will think on how to
fix the issue.

-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Tuesday, May 27, 2008 - 8:33 am

Yes, I think I got the idea you are trying to use here. But, somehow I feel 
uneasy with this approach :) Isn't it cleaner to keep the lock and try to 
avoid the deadlock on the sendfile() side? Or is that unfeasible?

I don't think we can drop the socket lock, it will introduce at least on type 
of races: since the skb we are processing is still on the socket queue, any 
entity accessing the socket queue will possible collide with us. 

Thanks,
tavi
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 8:47 am

Well, providing some flags to ->sendpage() callbacks to say that they
should not grab locks is a bit ugly to me, I think we can fix splice

Each access is being done under socket lock, so it should be safe. But
socket release path drops skb from the list before dropping its
reference counter, so we are not allowed to unlink it again, but we can
check if skb is linked or not and make a decision based on that.


-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Tuesday, May 27, 2008 - 10:28 am

And actually this will lead to data corruption, since different process
can enter and change socket values under us. Without temporal lock drop
we can deadlock (first process locked socket and then waiting for
i_mutex, while another one grabbed i_mutex and waiting for socket lock

Please try attached patch on top of vanilla tree.
It does not use skb after socket was dropped, but instead search it
again when socket is locked, so if socket is alive, it will find it and
clean otherwise it will exit.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
 
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..217dc59 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1200,15 +1200,17 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 seq = tp->copied_seq;
 	u32 offset;
-	int copied = 0;
+	int copied = 0, fin;
+	size_t used, len, skb_len;
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
 	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+		used = 0;
 		if (offset < skb->len) {
-			size_t used, len;
 
-			len = skb->len - offset;
+			skb_len = skb->len;
+			len = skb_len - offset;
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
 				u32 urg_offset = tp->urg_seq - seq;
@@ -1217,6 +1219,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (!len)
 					break;
 			}
+
 			used = recv_actor(desc, skb, offset, ...
From: Octavian Purdila
Date: Tuesday, May 27, 2008 - 4:59 pm

This fixes the crash, thanks.

One doubt though: suppose that while we drop the lock the skb gets aggregated 
with the one after it. If the original skb is fully consumed in the receive 
actor, then the we will eat the new, aggregated skb, loosing data. 

Here is a patch, based on your idea, which tries to cope with the above 
scenario. The !skb check was added for the case in which the actor does not 
consume anything in the current interration. 

tavi

From: Evgeniy Polyakov
Date: Wednesday, May 28, 2008 - 1:52 am

How can it be aggregated with another skb? It is only possible that some
other reader consumes the data, but in that case sequence number will

If it does not get any data, then skb will likely exists and will be
consumed in the next run. I preserved old semantic, when we free skb
only if we read it whole or in case of fin. With your changes we can
also free skb, if it was partially consumed and do not free it at all if
skb was not processed becuase it is too old (i.e. it lives in receive
queue, but we already read data siwth sequnce number, which corresponds
to it), no?


-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Wednesday, May 28, 2008 - 6:20 am

I think that tcp_collapse does this kind of aggregation when we have memory 
pressure: will create a new skb and move the data from the old skbs in the 

Hmm, I didn't thought about this, but you are right, we could have other 

If the consumer didn't consume data, the seq number will not be updated, and 
seq-1 will correspond to a previous skb, which was already dequeued and 

If the skb was partially consumed then tcp_recv_skb(seq-1) will return the 

I didn't get this part, shouldn't the entity which dequeues the packet also 
free it? 

Anyways, I am a newbie in this area, so if my logic doesn't make any sense 
please be patient with me :)

Thanks,
tavi

--

From: Evgeniy Polyakov
Date: Wednesday, May 28, 2008 - 7:11 am

I understand now, please correct me if I got your idea wrong.
We only ned to search for the skb again only in case we processed it and
it was possible that socket lock was dropped. So, the only needed place
to put tcp_recv_skb() is where you pointed. Next, to find current skb we
have to get sequence number inside its data, and seq-1 is indeed that
case, even if skb was changed under us (like combined with others), we
still will find it in the queue. Offset in tcp_recv_skb() is set to
number of bytes between starting sequence number for skb and given
sequence number, thus if we search for (seq-1), it will point to the
last byte processed by the code, and thus offset will 'point' to last
processed byte, not byte to start from, so we check against (offset+1)
and if (offset+1) equals to the size of the skb found we can free this
skb (in the code below). If it is not the case and skb contains data
behind (offset+1) we break and the rest of the data will be processed
in the next run. If there is no skb, we just break out of the loop.

So yes, your patch is simpler and faster than mine so you should push it
upstream. Fortunately David Miller is in copy and will (David, will

We all are newbies somewhere and even splice code itself is rather new :)


-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Wednesday, May 28, 2008 - 8:20 am

<snip>


OK, will clean it up and post a proper patch soon. Which tree should I base it 
on?

Thanks,
tavi
--

From: Evgeniy Polyakov
Date: Wednesday, May 28, 2008 - 8:42 am

Vanilla tree I think, it should apply without problems to all trees,
since patch is rather small.

-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Wednesday, May 28, 2008 - 10:08 am

Hmm, might have found a problem with this approach: say we drop the lock, 
queue the data into the pipe, and at the same time tcp_collapse() is called 
which frees some of the skbs of which data we just queued in the pipe. Later, 
when we will read from the pipe, we will get random data instead of socket 
data...


Thanks,
tavi

--

From: Evgeniy Polyakov
Date: Wednesday, May 28, 2008 - 10:51 am

We queue data under the lock and clone appropriate skb (and then grab it
multiple times), so even it will be dropped, its data will not freed, and
thus we will be able to read it. Or you are talking about different
skbs?

-- 
	Evgeniy Polyakov
--

From: Octavian Purdila
Date: Wednesday, May 28, 2008 - 11:02 am

You are right, I forgot about the clone.

Thanks,
tavi
--

From: Jarek Poplawski
Date: Wednesday, May 28, 2008 - 1:01 pm

Probably I miss something, but how does it help when tcp_collapse()
uses __kfree_skb()?

Regards,
Jarek P.
--

From: Octavian Purdila
Date: Wednesday, May 28, 2008 - 1:09 pm

__kfree_skb() -> skb_release_all() -> skb_release_data():

static void skb_release_data(struct sk_buff *skb)
{
        if (!skb->cloned ||
<snip>
                kfree(skb->head);

Since we clone the skb in skb_splice_bits() the skb's data will only be freed 
when the last clone is deleted.

tavi
--

From: Jarek Poplawski
Date: Wednesday, May 28, 2008 - 1:16 pm

Right!

Thanks for explanation,
Jarek P.
--

Previous thread: Adding multiple multicast routing tables. by Ben Greear on Monday, May 26, 2008 - 1:23 pm. (1 message)

Next thread: Re: possible double call of kfree_skb in net/llc/llc_sap.c by Patrick McHardy on Monday, May 26, 2008 - 11:16 pm. (2 messages)