This patch series was discussed before on linux-netdev ("hardware time stamping + igb example implementation"). Since then I have rebased against net-next and addressed all comments sent so far, except Octavian's suggestion to include more information in the packet which is bounced back to the application. As suggested by David, I'm now also including linux-kernel because: * patch 2 adds a new user space API (documentation and example program included, but no man page patch yet) * patch 8 extends the clocksource struct (mostly adds code, but also adds one branch to reading the clock, which may affect gettimeofday) * patch 10 adds generic code for time synchronization (not network specific, so people not subscribed to linux-netdev might have comments) The open question on linux-netdev was whether struct sk_buff should be extended to hold the additional hardware time stamp. The previous patch avoided that at the cost of much more complicated code and side effects on normal time stamping. This patch now adds a 8 byte field unconditionally. The implementation is a lot more straight-forward. The user space API was already designed to cover this case, so it remained unchanged. There's one unsolved problem, though: time synchronization with PTP (the use case I'm working on) requires a transformation of raw hardware time stamps into system time. Currently this is done at the socket level by finding the device which created the time stamp and letting it do the transformation. This fails for incoming packets because skb->rt points to the "lo" device. Perhaps the interface number can be used to find the real hardware device. Alternatively the conversion could be done when generating the time stamp (might also be more accurate), but then another 8 byte field is needed. Delta encoding won't help because one cannot assume that hardware time stamps track system time closely enough. I'm posting the patch despite this problem so that the discussion can move ...
In __sock_recv_timestamp() the additional SCM_TIMESTAMP[NS] is used. This
has the same value as SO_TIMESTAMP[NS], so this is a purely cosmetic change.
---
net/compat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/compat.c b/net/compat.c
index 67fb6a3..6ce1a1c 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -226,14 +226,14 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
return 0; /* XXX: return error? check spec. */
}
- if (level == SOL_SOCKET && type == SO_TIMESTAMP) {
+ if (level == SOL_SOCKET && type == SCM_TIMESTAMP) {
struct timeval *tv = (struct timeval *)data;
ctv.tv_sec = tv->tv_sec;
ctv.tv_usec = tv->tv_usec;
data = &ctv;
len = sizeof(ctv);
}
- if (level == SOL_SOCKET && type == SO_TIMESTAMPNS) {
+ if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
struct timespec *ts = (struct timespec *)data;
cts.tv_sec = ts->tv_sec;
cts.tv_nsec = ts->tv_nsec;
--
1.6.0.4
--
User space can request hardware and/or software time stamping. Reporting of the result(s) via a new control message is enabled separately for each field in the message because some of the fields may require additional computation and thus cause overhead. When a TX timestamp operation is requested, the TX skb will be cloned and the clone will be time stamped (in hardware or software) and added to the socket error queue of the skb, if the skb has a socket associated with it. The actual TX timestamp will reach userspace as a RX timestamp on the cloned packet. If timestamping is requested and no timestamping is done in the device driver (potentially this may use hardware timestamping), it will be done in software after the device's start_hard_xmit routine. TODO: add SO_TIMESTAMPING define also to other platforms --- Documentation/networking/timestamping.txt | 178 ++++++++ .../networking/timestamping/timestamping.c | 469 ++++++++++++++++++++ arch/x86/include/asm/socket.h | 3 + include/linux/errqueue.h | 1 + include/linux/sockios.h | 3 + include/net/timestamping.h | 95 ++++ 6 files changed, 749 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/timestamping.txt create mode 100644 Documentation/networking/timestamping/timestamping.c create mode 100644 include/net/timestamping.h diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt new file mode 100644 index 0000000..3ae60b5 --- /dev/null +++ b/Documentation/networking/timestamping.txt @@ -0,0 +1,178 @@ +The existing interfaces for getting network packages time stamped are: + +* SO_TIMESTAMP + Generate time stamp for each incoming packet using the (not necessarily + monotonous!) system time. Result is returned via recv_msg() in a + control message as timeval (usec resolution). + +* SO_TIMESTAMPNS + Same time ...
The new sk_buff->hwtstamp is used to transport time stamping instructions to the device driver (outgoing packets) and to return raw hardware time stamps back to user space (incoming or sent packets). Implements TX time stamping in software if the device driver doesn't support hardware time stamping. The new semantic for hardware/software time stamping around net_device->hard_start_xmit() is based on two assumptions about existing network device drivers which don't support hardware time stamping and know nothing about it: - they leave the skb->hwtstamp field unmodified - the keep the connection to the originating socket in skb->sk alive, i.e., don't call skb_orphan() Given that hwtstamp is a new field, the first assumption is safe. The second is only true for some drivers. As a result, software TX time stamping currently works with the bnx2 driver, but not with the unmodified igb driver (the two drivers this patch series was tested with). --- include/linux/netdevice.h | 11 ++++ include/linux/skbuff.h | 136 ++++++++++++++++++++++++++++++++++++++++++++- net/core/dev.c | 23 +++++++- net/core/skbuff.c | 72 ++++++++++++++++++++++++ 4 files changed, 239 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 12d7f44..24bea0c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -763,6 +763,17 @@ struct net_device /* for setting kernel sock attribute on TCP connection setup */ #define GSO_MAX_SIZE 65536 unsigned int gso_max_size; + +#define HAVE_HW_TIME_STAMP + /* Transforms original raw hardware time stamp to + * system time base. Always required when supporting + * hardware time stamping. + * + * Returns empty stamp (= all zero) if conversion wasn't + * possible. + */ + ktime_t (*hwtstamp_raw2sys)(struct net_device *dev, + ktime_t hwstamp); }; #define to_net_dev(d) container_of(d, struct net_device, dev) diff --git ...
The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set. It's disabled if all of these are off.
---
include/net/sock.h | 37 ++++++++++++++++++++++--
net/compat.c | 19 ++++++++----
net/core/sock.c | 79 ++++++++++++++++++++++++++++++++++++++++++++--------
net/socket.c | 75 ++++++++++++++++++++++++++++++++++++-------------
4 files changed, 168 insertions(+), 42 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 0a63894..0f71951 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -158,7 +158,7 @@ struct sock_common {
* @sk_allocation: allocation mode
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
- * %SO_OOBINLINE settings
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
@@ -488,6 +488,13 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_TIMESTAMPING_TX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_HARDWARE */
+ SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_SOFTWARE */
+ SOCK_TIMESTAMPING_RX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_HARDWARE */
+ SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_SOFTWARE */
+ SOCK_TIMESTAMPING_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SOFTWARE */
+ SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RAW_HARDWARE */
+ SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SYS_HARDWARE */
};
static inline void ...Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.
---
include/net/ip.h | 1 +
net/can/raw.c | 6 ++++++
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 2 ++
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 ++++
6 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index bc026ec..9bc2b65 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -55,6 +55,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ union sk_buff_hwtstamp tstamp_tx;
};
#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 6e0663f..d4a38e3 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
struct raw_sock *ro = raw_sk(sk);
struct sk_buff *skb;
struct net_device *dev;
+ union sk_buff_hwtstamp tstamp_tx;
int ifindex;
int err;
@@ -639,6 +640,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
if (!dev)
return -ENXIO;
+ err = sock_tx_timestamp(msg, sk, &tstamp_tx);
+ if (err < 0)
+ return err;
+
skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
&err);
if (!skb) {
@@ -654,6 +659,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
}
skb->dev = dev;
skb->sk = sk;
+ skb->hwtstamp = tstamp_tx;
err = can_send(skb, ro->loopback);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 21e497e..ba739f4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void ...---
fs/compat_ioctl.c | 1 +
net/core/dev.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5235c67..a5001a6 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2555,6 +2555,7 @@ HANDLE_IOCTL(SIOCSIFMAP, dev_ifsioc)
HANDLE_IOCTL(SIOCGIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFHWBROADCAST, dev_ifsioc)
+HANDLE_IOCTL(SIOCSHWTSTAMP, dev_ifsioc)
/* ioctls used by appletalk ddp.c */
HANDLE_IOCTL(SIOCATALKDIFADDR, dev_ifsioc)
diff --git a/net/core/dev.c b/net/core/dev.c
index b4b8eb8..4f61f5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3650,6 +3650,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
cmd == SIOCSMIIREG ||
cmd == SIOCBRADDIF ||
cmd == SIOCBRDELIF ||
+ cmd == SIOCSHWTSTAMP ||
cmd == SIOCWANDEV) {
err = -EOPNOTSUPP;
if (dev->do_ioctl) {
@@ -3805,6 +3806,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCBONDCHANGEACTIVE:
case SIOCBRADDIF:
case SIOCBRDELIF:
+ case SIOCSHWTSTAMP:
if (!capable(CAP_NET_ADMIN))
return -EPERM;
/* fall through */
--
1.6.0.4
--
---
drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 89ffc07..be8e2b8 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
#include <linux/ipv6.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/timestamping.h>
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/if_vlan.h>
@@ -4115,6 +4116,33 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
}
/**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+ struct hwtstamp_config config;
+
+ if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+ return -EFAULT;
+
+ /* reserved for future extensions */
+ if (config.flags)
+ return -EINVAL;
+
+ if (config.tx_type == HWTSTAMP_TX_OFF &&
+ config.rx_filter_type == HWTSTAMP_FILTER_NONE)
+ return 0;
+
+ return -ERANGE;
+}
+
+/**
* igb_ioctl -
* @netdev:
* @ifreq:
@@ -4127,6 +4155,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
case SIOCGMIIREG:
case SIOCSMIIREG:
return igb_mii_ioctl(netdev, ifr, cmd);
+ case SIOCSHWTSTAMP:
+ return igb_hwtstamp_ioctl(netdev, ifr, cmd);
default:
return -EOPNOTSUPP;
}
--
1.6.0.4
--
Before someone else mentions it: this was meant to be "sizeof(*serr)" of course. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
Hello Patrick, I still have two questions of understanding regarding these (offset) transformations that make it still difficult: 1. As no one has an insight of how the specific hardware generates the hw time stamp anyway, why can't you put the already transformed values into the hw timestamp field? I won't stick on the fact that the hw time stamp is the value that has been read from specific controller registers that are very hardware depended. 2. From what i've read in the discussion, i understood that the hardware clock and the system clock skew. Assuming both to be strong monotonic(?), is there a linear skew observable from your testing experiences? If so i would suggest to have some kind of sysfs entry like e.g. /sys/class/net/eth0/hwstamp_skew_ns that gives something like -86234765@12277656012584334095 as output. In general: <skew_in_ns>@<systemtime> So if you would check this sysfs entry two or more times, you would get an impression about the hw time stamp behaviour of your hardware by simply calculating the linear (timedepended) offset based on several 'skew-sample-points', right? I hope these suggestions are not completely damaged ;-) Best regards, Oliver --
It certainly would be sufficient for PTP with the "assisted" PTP mode of synchronizing system time. But my understanding is that Octavian and possibly others want to have access to the raw, unmodified hardware time stamps. Octavian, perhaps you can confirm/elaborate what your use case is? I should point out that currently the "raw" hardware time stamp is already "cooked" a bit in the sense that the new field is spec'ed to contain nanoseconds, not some opaque 64 bit blob. This is intentional because even though the time when the clock started to run is unknown, at least deltas can be calculated. The expectation is that this delta is close to world and system time, with some inaccuracies caused by using different crystals to drive the underlying hardware and thus varying frequencies. The "two-way" PTP mode could use these values to synchronize NIC clocks (using a yet to be defined ioctl call which speeds up or slows down NIC clocks, like adjtimex does for system time) and system time on top of that NIC time. At the NIC level synchronization is going to be more accurate than at the system level because it removes the NIC/system time comparison from the equation. I believe this is similar to what Octavian The clock drift changes over time, so on a big scale the skew is not linear. For short periods (minute range) it is reasonable to assume that the drift is constant, so a linear interpolation works reasonably well. For your proposal that means that the user application needs to reread the sysfs file at least once in a while, without really knowing how often that should be. This would work, but it exposes internal mechanisms (skew measurement) to user space. If a driver developer has a better method that is more suitable for his hardware (perhaps something non-linear or based on more frequent skew measurements), then the application doesn't benefit from it. To summarize, I see the following options at this time: 1 only store transformed time stamp... 1a ... ...
From: Patrick Ohly <patrick.ohly@intel.com> We use the hardware timestamp to determine one-way delay: - hardware synchronize the NIC clocks (20ns resolution) - insert the hw timestamp in the packet as its going out the wire (this is done in hw) - received packets will be tagged with a hardware timestamp One way delay is computed by subtracting the TX hw timestamp from the RX hw I also vote for 3 (storing hw timestamps in the skb). Let me throw in another idea: when enabling hw timestamps could we allocate a bigger skb and store the hw timestamp somewhere in the skb data buffer? We can then modify sock_recv_timestamp to call a new netdev method which should return the hw timestamp. This should take care of RX hw timestamps. For TX, we can we get a new skb with skb_copy_expand to make room for the TX timestamp, and queue that in the error socket. There is still the problem of requesting TX timestamps per packets. At this point it seems that on the TX path the tstamp field is not used, so we could use that space. Or, maybe we can use the same dynamic approach: can we modify the hard_header_len after device registration (e.g. when TX timestamps are enabled)? Thanks, tavi --
How does the socket layer detect that the HW timestamp is available in Finding the netdev is non-trivial, see David's comment about the current hacky approach via the route. Besides, if the time stamp is in the skb Agreed in general, but there was one corner case where the tstamp field I don't know. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
From: Patrick Ohly <patrick.ohly@intel.com> It doesn't know, thats why we need help from the device (the new netdev OK I see now, it seems that we can't get to the device in a generic way. How about this twist: we add a new option at the socket level, to get the whole skb->head - skb->end data into a user buffer. Then, we call an device ioctl and pass this buffer. The device will extract the hw timestamp and give it to the user. use the convention that the device driver must put the timestamp below the mac header. One potential problem I see with this approach is leaking sensitive information into userspace, which means we will have to restrict this to privileged processes only. tavi --
4 would be mine. I assume, we would get kicked somewhere when we are trying to push Ugh. Not every protocol that uses skbuffs, has a mac header (e.g. the CAN protocol doesn't have mac addresses). This twist does not look very maintainable to me ... One additional question for Patrick: As you wrote that your hw timestamp contained in the new skbuff-field is already cocked ... is there any identifier that tells the userspace application about the type of hw timestamp he gets (e.g. cocked, raw registers, offset to whatever, etc.) ? Regards, Oliver --
From: Oliver Hartkopp <oliver@hartkopp.net> OK, then what about this: we use a device ioctl to get the number of bytes to copy from skb->head. We then pass this to the socket level option. This is more complicated than option 3 or 4, but it should address the concerns raised here - no performance impact when not using this feature. The Could you elaborate on the maintainability issues, they are not clear to me. Thanks, tavi --
Do you know what all the people are doing with skbuffs in the kernel? I'm not aware of it. So whenever we touch a vital data structure like the skbuff we must be sure that our handling is a wise approach for all. Btw. your answer brought me to a completely different approach: What about just creating a new pointer in the struct skbuff that points to a struct hwstamp when it is available OR the pointer is NULL when no hwstamps are available. This struct hwstamp may consist of 'everthing we really need for timestamping' and is added somewhere at the head or the tail of the skbuff data section by the device driver. And if the socket sees this pointer to be not NULL it knows where to look for our fancy struct hwstamp ... I don't know, where this hwstamp data could be pasted into the data section in the best way - but surely others know. Especially this approach meets the requirement that the additional hwstamp data is only allocated (per interface!), when it's really needed - and not for everyone by default. What do you thing about this idea? Regards, Oliver --
From: Oliver Hartkopp <oliver@hartkopp.net> I agree. But what in this approach do you think its not wise? :) The skb is not touched at all, we just need to allocate a larger skb when That is still adding a member into the skb. For what I am proposing we don't have to touch the skb and it has all of your approach benefits. I would rather use this approach since it allows us to "extend" the skb in this manner int the future for other hardware features which requires passing per-skb information to userspace. Thanks, tavi --
That just pushes the problem into the device driver: when it gets passed a skb pointer, how can it tell reliably that the data buffer contains the HW timestamping information? I agree with Oliver, this approach doesn't look maintainable to me. I don't know enough about the kernel to tell whether it works reliably at the moment, much less whether it will work in the future. In the proposed API the userspace application gets three time stamps: software, "cooked" hardware time stamp (converted to nanoseconds by the driver, but not tampered with in any other way), hardware time stamp converted to system time. Each of these may be missing (not available, couldn't be calculated). So yes, the userspace application knows what it got and can pick the value that it needs. I like this better than tampering with the data buffer pointers implicitly because it enables usages of the additional information inside the kernel itself. It's similar to skb_shared_info, except that it is not allocated for all skbs. The skb_shared_info is always at the end of the data buffer. Assume that we have a new __netdev_alloc_hw_skb() which increases the allocated data buffer to make room for the additional struct hwtstamp (either before skb_shared_info or after). I cannot think of a way how the rest of the kernel can tell that this additional data is available by just looking at the existing head/data/end fields in a skb - if I missed something, please let me know. So it seems to me that we need the additional 32 bit offset (or pointer, on 32 bit architectures) in skb which points towards the struct hwtstamp. But that's actually less than the additional 64 bit which hold the time stamp value, as in the current patch. I'll give it a few more days for further debate, then try out this approach. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the ...
It doesn't even need a 32 bit offset. If it is always at the same location in the buffer (e.g., directly after skb_shared_info), then a single bit is sufficient. The same mechanism could be used to also store other optional structs/fields. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
That's really fine. Sorry that i did not go that deep into your code I'm not very familiar with skb fields but hiding this with a new __netdev_alloc_hw_skb() looks very good to me and creates a proper way to add hw-specific netdevice information in a generic way and - that's I'll be on a business trip until Thursday, so from my side you would get a 'go ahead' right now ;-) Best regards, Oliver --
