Hi,
The following socket lock dropping in skb_splice_bits seems to open a race
Setup:
- powerpc, non-SMP, no preemption, 2.6.25
- RX side: LRO enabled, splice from socket to /dev/null;
- TX side: MTU set to 128 bytes (on the TX side), GSO enabled, splice from
file to socket
The oops - on the RX side:
Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0x80109ee0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 80109ee0 LR: 80109edc CTR: 8010c52c
REGS: bcd25b90 TRAP: 0300 Tainted: P (2.6.25-00005-gf7b547d)
MSR: 00009032 <EE,ME,IR,DR> CR: 24000822 XER: 20000000
DAR: 00000030, DSISR: 40000000
TASK = bfbe1bf0[156] 'splice' THREAD: bcd24000
GPR00: 8010c94c bcd25c40 bfbe1bf0 00000000 00000000 802835f8 00000001 0000004c
GPR08: 00024000 00000100 00000032 bcd24000 00010dc4 100198b4 390046a8 0a5042f3
GPR16: 8028238c bd18fe00 00000008 10010000 6fbcbac0 00000000 10001060 bcd25dd8
GPR24: 8014b520 00000000 bcd25e30 bccefa00 bf33e300 fffffe00 bcd25d70 00000000
NIP [80109ee0] lock_sock_nested+0x1c/0x50
LR [80109edc] lock_sock_nested+0x18/0x50
Call Trace:
[bcd25c60] [8010c94c] skb_splice_bits+0x130/0x134
[bcd25dc0] [8014b548] tcp_splice_data_recv+0x28/0x38
[bcd25dd0] [8014d08c] tcp_read_sock+0x108/0x1f8
[bcd25e20] [8014b58c] __tcp_splice_read+0x34/0x44
[bcd25e40] [8014b61c] tcp_splice_read+0x80/0x220
[bcd25e90] [80105730] sock_splice_read+0x2c/0x44
[bcd25ea0] [8008a374] do_splice_to+0x90/0xac
[bcd25ed0] [8008a850] do_splice+0x258/0x2f0
[bcd25f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd25f40] [8000ff14] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0x10000894
LR = 0x10000e2c
Analysis:
Printks show that __skb->sk is non-NULL before splice_to_pipe and NULL after.
Using a hardware watchpoint I was able to see that the write in __skb->sk is
caused by ...Given the previous comment, that certainly looks wrong. But this could apparently cause deadlock. Surely the correct fix is to copy __skb->sk to a local variable before calling splice_to_pipe() so we can re-lock it? Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. --
I've tried that, but I think that the freed __skb might be touched later in
tcp_read_sock:
Faulting instruction address: 0x8014d0c0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 8014d0c0 LR: 8014d090 CTR: 8010c52c
REGS: bcd27d20 TRAP: 0300 Tainted: P (2.6.25-00005-gf7
b547d-dirty)
MSR: 00009032 <EE,ME,IR,DR> CR: 24000822 XER: 20000000
DAR: 0000000c, DSISR: 40000000
TASK = bd0d7bd0[172] 'splice' THREAD: bcd26000
GPR00: 00000000 bcd27dd0 bd0d7bd0 fffffe00 00000000 802835f8 00000001 0000004c
GPR08: 00024000 00000000 00000062 bcd26000 0023ac37 100198b4 390046a8 0a5042f3
GPR16: 8028238c bd18fe00 00000008 10010000 6fd01ac0 00000000 10001060 bcd27dd8
GPR24: 8014b524 00000000 bcd27e30 bcd07180 0000004c bcd071e4 a7c41e8b bfb3aa60
NIP [8014d0c0] tcp_read_sock+0x138/0x1f8
LR [8014d090] tcp_read_sock+0x108/0x1f8
Call Trace:
[bcd27dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable)
[bcd27e20] [8014b590] __tcp_splice_read+0x34/0x44
[bcd27e40] [8014b620] tcp_splice_read+0x80/0x220
[bcd27e90] [80105730] sock_splice_read+0x2c/0x44
[bcd27ea0] [8008a374] do_splice_to+0x90/0xac
[bcd27ed0] [8008a850] do_splice+0x258/0x2f0
[bcd27f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd27f40] [8000ff14] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0x10000894
LR = 0x10000e2c
Thanks,
tavi
--
What about sock_hold() here? It will prevent from socket freeing and read/write to it will And appropriate sock_put(sk); -- Evgeniy Polyakov --
<snip> We know the socket isn't going to go away because somewhere up the call stack someone has to be holding the socket in order to lock it. However, the skb may (and evidently sometimes does) go away during splice_to_pipe() so we can't look up the socket through it. However, from Octavian's later mail it seems we can't let the skb go away at all. So some wider changes seem to be required. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. --
Yes, I misread the original message. Octavian, could you please decode where bug occured via gdb: gdb $ l *(skb_splice_bits+0x130) -- Evgeniy Polyakov --
It is lock_sock()? And also please show what tcp_read_sock+0x108 tcp_read_sock+0x138 are. And do you have a test case for that? -- Evgeniy Polyakov --
Yes it is lock_sock(). The crash in lock_sock() occurs with the following
sequence:
release_sock(__skb->sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(__skb->sk);
0x8010c94c is in skb_splice_bits (include/net/sock.h:837).
832
833 extern void lock_sock_nested(struct sock *sk, int subclass);
834
835 static inline void lock_sock(struct sock *sk)
836 {
837 lock_sock_nested(sk, 0);
838 }
839
840 extern void release_sock(struct sock *sk);
This crash occurs with the following sequence:
struct sock *sk= __skb->sk;
release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(sk);
0x8014d090 is in tcp_read_sock (net/ipv4/tcp.c:1225).
1220 used = recv_actor(desc, skb, offset, len);
1221 if (used < 0) {
1222 if (!copied)
1223 copied = used;
1224 break;
1225 } else if (used <= len) {
1226 seq += used;
1227 copied += used;
1228 offset += used;
Yes. I've attached my test program which triggers both crashes. To improve the
reproducibility rate you need to use a relatively high buffer for splice (I
used 16384) and a small MTU (I used 128). This will make the 1st splice on
the receive side block (due to not enough PIPE_BUFFERS it seems), and will
give plenty of time to the race condition to manifest itself. When you
interrupt the program, the system will crash.
Thanks,
tavi
Attached patch fixes the crash for me, Octavian could you please test
it.
David, is __kfree_skb() usage in tcp_read_sock() an optimisation only?
With this patch I do not see any leaks, but I did not investigate it
deep enough. If approach seems correct, I will clean things up.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
if (spd.nr_pages) {
int ret;
+ struct sock *sk = __skb->sk;
/*
* Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
* we call into ->sendpage() with the i_mutex lock held
* and networking will grab the socket lock.
*/
- release_sock(__skb->sk);
+ release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
- lock_sock(__skb->sk);
+ lock_sock(sk);
return ret;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..b8318e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1182,6 +1182,23 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
return NULL;
}
+#ifdef CONFIG_NET_DMA
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+ __skb_unlink(skb, &sk->sk_receive_queue);
+ if (!copied_early)
+ kfree_skb(skb);
+ else
+ __skb_queue_tail(&sk->sk_async_wait_queue, skb);
+}
+#else
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+ __skb_unlink(skb, &sk->sk_receive_queue);
+ kfree_skb(skb);
+}
+#endif
+
/*
* This routine provides an alternative to tcp_recvmsg() for routines
* that would like to handle copying from skbuffs directly in 'sendfile'
@@ -1231,11 +1248,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
if (tcp_hdr(skb)->fin) {
- sk_eat_skb(sk, skb, 0);
+ __sk_eat_skb(sk, skb, 0);
++seq;
break;
}
- sk_eat_skb(sk, skb, 0);
+ __sk_eat_skb(sk, skb, 0);
if ...On my system it still crashes, if I interrupt the program before splice blocks: [ 16.170009] NIP [8014d0c0] tcp_read_sock+0x138/0x1f8 [ 16.170009] LR [8014d090] tcp_read_sock+0x108/0x1f8 [ 16.170009] Call Trace: [ 16.170009] [bcd21dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable) [ 16.170009] [bcd21e20] [8014b590] __tcp_splice_read+0x34/0x44 [ 16.170009] [bcd21e40] [8014b620] tcp_splice_read+0x80/0x220 [ 16.170009] [bcd21e90] [80105730] sock_splice_read+0x2c/0x44 [ 16.170009] [bcd21ea0] [8008a374] do_splice_to+0x90/0xac [ 16.170009] [bcd21ed0] [8008a850] do_splice+0x258/0x2f0 [ 16.170009] [bcd21f10] [8008b1d4] sys_splice+0xe0/0xe8 [ 16.170009] [bcd21f40] [8000ff14] ret_from_syscall+0x0/0x38 Thanks, tavi --
Yes, that was somewhat stupid patch.
Since I can not reproduce it anymore, let's try to upload this task to you :)
Please test this one assuming there are no other patches applied before.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
* the frag list, if such a thing exists. We'd probably need to recurse to
* handle that cleanly.
*/
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int tlen,
unsigned int flags)
{
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
.ops = &sock_pipe_buf_ops,
.spd_release = sock_spd_release,
};
- struct sk_buff *skb;
-
- /*
- * I'd love to avoid the clone here, but tcp_read_sock()
- * ignores reference counts and unconditonally kills the sk_buff
- * on return from the actor.
- */
- skb = skb_clone(__skb, GFP_KERNEL);
- if (unlikely(!skb))
- return -ENOMEM;
/*
* __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
}
done:
- /*
- * drop our reference to the clone, the pipe consumption will
- * drop the rest.
- */
- kfree_skb(skb);
-
if (spd.nr_pages) {
int ret;
+ struct sock *sk = __skb->sk;
/*
* Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
* we call into ->sendpage() with the i_mutex lock held
* and networking will grab the socket lock.
*/
- release_sock(__skb->sk);
+ release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
- lock_sock(__skb->sk);
+ lock_sock(sk);
return ret;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..54bb460 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff ...A typo sneaked into the patch, please try this one.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
* the frag list, if such a thing exists. We'd probably need to recurse to
* handle that cleanly.
*/
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int tlen,
unsigned int flags)
{
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
.ops = &sock_pipe_buf_ops,
.spd_release = sock_spd_release,
};
- struct sk_buff *skb;
-
- /*
- * I'd love to avoid the clone here, but tcp_read_sock()
- * ignores reference counts and unconditonally kills the sk_buff
- * on return from the actor.
- */
- skb = skb_clone(__skb, GFP_KERNEL);
- if (unlikely(!skb))
- return -ENOMEM;
/*
* __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
}
done:
- /*
- * drop our reference to the clone, the pipe consumption will
- * drop the rest.
- */
- kfree_skb(skb);
-
if (spd.nr_pages) {
int ret;
+ struct sock *sk = __skb->sk;
/*
* Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
* we call into ->sendpage() with the i_mutex lock held
* and networking will grab the socket lock.
*/
- release_sock(__skb->sk);
+ release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
- lock_sock(__skb->sk);
+ lock_sock(sk);
return ret;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..fb8bfc2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t ...The same wrong one, sorry about that. Idea is to hold skb between release/lock sock calls and thus do not allow to free it by core stack when it is being released. Patch still misses the case, when socket is released and skb was dequeued, so splice will try to dequeue it again, which will crash. I will think on how to fix the issue. -- Evgeniy Polyakov --
Yes, I think I got the idea you are trying to use here. But, somehow I feel uneasy with this approach :) Isn't it cleaner to keep the lock and try to avoid the deadlock on the sendfile() side? Or is that unfeasible? I don't think we can drop the socket lock, it will introduce at least on type of races: since the skb we are processing is still on the socket queue, any entity accessing the socket queue will possible collide with us. Thanks, tavi --
Well, providing some flags to ->sendpage() callbacks to say that they should not grab locks is a bit ugly to me, I think we can fix splice Each access is being done under socket lock, so it should be safe. But socket release path drops skb from the list before dropping its reference counter, so we are not allowed to unlink it again, but we can check if skb is linked or not and make a decision based on that. -- Evgeniy Polyakov --
And actually this will lead to data corruption, since different process
can enter and change socket values under us. Without temporal lock drop
we can deadlock (first process locked socket and then waiting for
i_mutex, while another one grabbed i_mutex and waiting for socket lock
Please try attached patch on top of vanilla tree.
It does not use skb after socket was dropped, but instead search it
again when socket is locked, so if socket is alive, it will find it and
clean otherwise it will exit.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
if (spd.nr_pages) {
int ret;
+ struct sock *sk = __skb->sk;
/*
* Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
* we call into ->sendpage() with the i_mutex lock held
* and networking will grab the socket lock.
*/
- release_sock(__skb->sk);
+ release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
- lock_sock(__skb->sk);
+ lock_sock(sk);
return ret;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..217dc59 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1200,15 +1200,17 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
struct tcp_sock *tp = tcp_sk(sk);
u32 seq = tp->copied_seq;
u32 offset;
- int copied = 0;
+ int copied = 0, fin;
+ size_t used, len, skb_len;
if (sk->sk_state == TCP_LISTEN)
return -ENOTCONN;
while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+ used = 0;
if (offset < skb->len) {
- size_t used, len;
- len = skb->len - offset;
+ skb_len = skb->len;
+ len = skb_len - offset;
/* Stop reading if we hit a patch of urgent data */
if (tp->urg_data) {
u32 urg_offset = tp->urg_seq - seq;
@@ -1217,6 +1219,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
if (!len)
break;
}
+
used = recv_actor(desc, skb, offset, ...This fixes the crash, thanks. One doubt though: suppose that while we drop the lock the skb gets aggregated with the one after it. If the original skb is fully consumed in the receive actor, then the we will eat the new, aggregated skb, loosing data. Here is a patch, based on your idea, which tries to cope with the above scenario. The !skb check was added for the case in which the actor does not consume anything in the current interration. tavi
How can it be aggregated with another skb? It is only possible that some other reader consumes the data, but in that case sequence number will If it does not get any data, then skb will likely exists and will be consumed in the next run. I preserved old semantic, when we free skb only if we read it whole or in case of fin. With your changes we can also free skb, if it was partially consumed and do not free it at all if skb was not processed becuase it is too old (i.e. it lives in receive queue, but we already read data siwth sequnce number, which corresponds to it), no? -- Evgeniy Polyakov --
I think that tcp_collapse does this kind of aggregation when we have memory pressure: will create a new skb and move the data from the old skbs in the Hmm, I didn't thought about this, but you are right, we could have other If the consumer didn't consume data, the seq number will not be updated, and seq-1 will correspond to a previous skb, which was already dequeued and If the skb was partially consumed then tcp_recv_skb(seq-1) will return the I didn't get this part, shouldn't the entity which dequeues the packet also free it? Anyways, I am a newbie in this area, so if my logic doesn't make any sense please be patient with me :) Thanks, tavi --
I understand now, please correct me if I got your idea wrong. We only ned to search for the skb again only in case we processed it and it was possible that socket lock was dropped. So, the only needed place to put tcp_recv_skb() is where you pointed. Next, to find current skb we have to get sequence number inside its data, and seq-1 is indeed that case, even if skb was changed under us (like combined with others), we still will find it in the queue. Offset in tcp_recv_skb() is set to number of bytes between starting sequence number for skb and given sequence number, thus if we search for (seq-1), it will point to the last byte processed by the code, and thus offset will 'point' to last processed byte, not byte to start from, so we check against (offset+1) and if (offset+1) equals to the size of the skb found we can free this skb (in the code below). If it is not the case and skb contains data behind (offset+1) we break and the rest of the data will be processed in the next run. If there is no skb, we just break out of the loop. So yes, your patch is simpler and faster than mine so you should push it upstream. Fortunately David Miller is in copy and will (David, will We all are newbies somewhere and even splice code itself is rather new :) -- Evgeniy Polyakov --
<snip> OK, will clean it up and post a proper patch soon. Which tree should I base it on? Thanks, tavi --
Vanilla tree I think, it should apply without problems to all trees, since patch is rather small. -- Evgeniy Polyakov --
Hmm, might have found a problem with this approach: say we drop the lock, queue the data into the pipe, and at the same time tcp_collapse() is called which frees some of the skbs of which data we just queued in the pipe. Later, when we will read from the pipe, we will get random data instead of socket data... Thanks, tavi --
We queue data under the lock and clone appropriate skb (and then grab it multiple times), so even it will be dropped, its data will not freed, and thus we will be able to read it. Or you are talking about different skbs? -- Evgeniy Polyakov --
You are right, I forgot about the clone. Thanks, tavi --
Probably I miss something, but how does it help when tcp_collapse() uses __kfree_skb()? Regards, Jarek P. --
__kfree_skb() -> skb_release_all() -> skb_release_data():
static void skb_release_data(struct sk_buff *skb)
{
if (!skb->cloned ||
<snip>
kfree(skb->head);
Since we clone the skb in skb_splice_bits() the skb's data will only be freed
when the last clone is deleted.
tavi
--
