Hello all!
This patch series is related to Octavian's proposals a while ago ("net:
support for TX timestamps", "net: support for hardware timestamps"). It
mostly implements the revised plan discussed recently ("hardware time
stamps + existing time stamp usage").
As part of that discussion there was no consensus on storing hardware
time stamps in sk_buff. One suggested solution for avoiding a new field
was based on storing a transformed hardware time stamp in the existing
field (thus keeping the existing code happy) and then transforming back
to the original hardware time base if needed.
This patch series follows that approach, but the utility code which is
used to transform from hardware time to system time doesn't implement
the necessary inverse transformation. This means that although access to
the original, raw hardware time stamp is part of the API, the
corresponding value is never actually reported by the existing code. If
this value is required, then the final patch shows how a separate
hardware time stamp field solves the problem with a lot less hassle.
If the extra field is considered acceptable unconditionally, then the
#ifdefs become redundant and the code much cleaner. I would prefer this
solution, but I don't know what the performance impact of a larger
sk_buff really is.
This patch series is currently based on v2.6.27. If it is considered for
inclusion, then I'd make sure that PTPd indeed works with this patch
(currently it has only been tested with the included example program)
and resubmit the patch based on the latest netdev branch. This is my
first contribution to the Linux kernel; it hasn't been discussed inside
Intel either. So please give it the treatment that a newbie patch
deserves: thorough, but not too harsh critisism ;-) Thanks!
I'm sure there will be things to correct. To spare you the trouble of
searching for them, here are some I already know of:
* there's currently no locking of the newly added fields in the
igb private data ...We make use of the upper bits in the skb->tstamp to transport the senders time stamping settings into the lower levels. Currently these are per-socket settings, but a per-packet control message could also be added. When a TX timestamp operation is requested, the TX skb will be cloned and the clone will be time stamped (in hardware or software) and added to the socket error queue of the skb, if the skb has a socket associated. The actual timestamp will reach userspace as a RX timestamp on the cloned packet. If timestamping is requested and no timestamping is done in the device driver (potentially this may use hardware timestamping), it will be done in software after the device's start_hard_xmit routine. The new semantic for hardware/software time stamping around net_device->hard_start_xmit() is based on two assumptions about existing network device drivers which don't support hardware time stamping and know nothing about it: - they leave the skb->tstamp field unmodified - the keep the connection to the originating socket in skb->sk alive, i.e., don't call skb_orphan() The first assumption seems to hold for in-tree drivers. The second is only true for some drivers. As a result, software TX time stamping currently works with the bnx2 driver, but not with the unmodified igb driver (the two drivers this patch was tested with). Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> --- Documentation/networking/timestamping.txt | 31 +++++++++++++++++ include/linux/netdevice.h | 10 ++++++ include/linux/skbuff.h | 51 +++++++++++++++++++++++++++++ include/net/sock.h | 14 ++++++++ net/core/dev.c | 34 +++++++++++++++++-- net/core/skbuff.c | 36 ++++++++++++++++++++ net/socket.c | 15 ++++++++ 7 files changed, 188 insertions(+), 3 deletions(-) diff --git a/Documentation/networking/timestamping.txt ...
From: Patrick Ohly <patrick.ohly@intel.com>
I think the addition of the following bits will be of use to applications:
serr = SKB_EXT_ERR(skb);
serr->ee.ee_origin = SO_EE_ORIGIN_TXTSTAMP;
serr->ee.ee_mac = skb->mac.raw - skb->data;
serr->ee.ee_network = skb->nh.raw - skb->data;
serr->ee.ee_transport = skb->h.raw - skb->data;
For example, for UDP PTP we don't have to manually skip the ethernet (and take
into account VLANs) and IP headers.
Thanks,
tavi
--
Yes, that would be useful. Will add it. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
include/net/ip.h | 1 +
net/can/raw.c | 8 ++++++++
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 2 ++
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 ++++
6 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index 250e6ef..76cee15 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -54,6 +54,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ union ktime tstamp;
};
#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 6e0663f..b3a978b 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,9 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
struct raw_sock *ro = raw_sk(sk);
struct sk_buff *skb;
struct net_device *dev;
+ union ktime tstamp = {
+ .tv64 = 0.
+ };
int ifindex;
int err;
@@ -639,6 +642,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
if (!dev)
return -ENXIO;
+ err = sock_tx_timestamp(msg, sk, &tstamp);
+ if (err < 0)
+ return err;
+
skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
&err);
if (!skb) {
@@ -654,6 +661,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
}
skb->dev = dev;
skb->sk = sk;
+ skb->tstamp = tstamp;
err = can_send(skb, ro->loopback);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 55c355e..27cd661 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.tstamp.tv64 = 0;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
inet_sk(sk)->tos = ...From: Patrick Ohly <patrick.ohly@intel.com> Please use ktime_t --
This happens when IP_MULTICAST_LOOP is on. Apparently the time
stamped packet goes through the loop device's start_hard_xmit?!
TODO: find a clean solution.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
net/core/skbuff.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7d714b8..7d9f1dd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2574,6 +2574,13 @@ void skb_hwtstamp_tx(struct sk_buff *orig_skb,
struct sk_buff *skb;
int err = -ENOMEM;
+ /* sanity check: extra bits set => might be a real time stamp */
+ if (orig_skb->tstamp.tv64 & ~(SKB_TSTAMP_TX_HARDWARE|SKB_TSTAMP_TX_HARDWARE_IN_PROGRESS|SKB_TSTAMP_TX_SOFTWARE)) {
+ printk(KERN_DEBUG
+ "skb_hwtstamp_tx: invalid command flags\n");
+ return;
+ }
+
if (!sk)
return;
--
1.6.0.4
--
From: Patrick Ohly <patrick.ohly@intel.com>
Line is way too long, split up and group the bits:
if (orig_skb->tstamp.tv64 & ~(SKB_TSTAMP_TX_HARDWARE |
SKB_TSTAMP_TX_HARDWARE_IN_PROGRESS |
SKB_TSTAMP_TX_SOFTWARE)) {
--
In its current form the new ioctl allows time stamping
PTP packets (all currently useful flavors) and all packets.
This should be good enough for the use cases discussed
on Linux netdev so far.
It does not yet allow user space control over the clock
in the NIC. If this should become necessary, then it will
have to be extended.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
Documentation/networking/timestamping.txt | 23 ++++++++++++++++-
.../networking/timestamping/timestamping.c | 26 ++++++++++++++++++++
fs/compat_ioctl.c | 1 +
include/net/timestamping.h | 7 ++++-
net/core/dev.c | 2 +
5 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 6a87a96..537e55b 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -96,6 +96,24 @@ struct hwtstamp_config {
int rx_filter_type; /**< HWTSTAMP_RX_* */
};
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter_type are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't ...New socket option SO_TIMESTAMPING. Supersedes SO_TIMESTAMP[NS]. The overlap with the old SO_TIMESTAMP[NS] options is handled so that time stamping in software (net_enable_timestamp()) is enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE is set. It's disabled if all of these are off. User space can request hardware and/or software time stamping. Reporting of the result(s) via a new control message is enabled separately for each field in the message because some of the fields may require additional computation and thus cause overhead. New SIOCSHWTSTAMP ioctl number which controls the hardware which does the hardware time stamping. Must be implemented by each device driver which supports hardware time stamping together with the new time stamp transformation methods in struct net_device. Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> --- Documentation/networking/timestamping.txt | 147 +++++++ .../networking/timestamping/timestamping.c | 441 ++++++++++++++++++++ include/asm-x86/socket.h | 3 + include/linux/errqueue.h | 1 + include/linux/netdevice.h | 15 +- include/linux/skbuff.h | 4 +- include/linux/sockios.h | 3 + include/net/sock.h | 22 +- include/net/timestamping.h | 92 ++++ net/compat.c | 19 +- net/core/skbuff.c | 13 +- net/core/sock.c | 88 ++++- net/socket.c | 68 +++- 13 files changed, 861 insertions(+), 55 deletions(-) create mode 100644 Documentation/networking/timestamping.txt create mode 100644 Documentation/networking/timestamping/timestamping.c create mode 100644 include/net/timestamping.h diff --git a/Documentation/networking/timestamping.txt ...
From: Patrick Ohly <patrick.ohly@intel.com> It's __KERNEL__ not __kernel__, and user visible interfaces should not be added to <net/foo.h> files. Rather, they should be put into <linux/foo.h> files. Since "timestamping.h" is too generic a name, use something like <linux/net_tstamp.h> --
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
drivers/net/igb/igb_main.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 634c4c9..becf8d6 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
#include <linux/ipv6.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/timestamping.h>
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/if_vlan.h>
@@ -4103,6 +4104,35 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
}
/**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+ struct hwtstamp_config config;
+
+ printk("igb_hwtstamp_ioctl\n");
+
+ if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+ return -EFAULT;
+
+ /* reserved for future extensions */
+ if (config.flags)
+ return -EINVAL;
+
+ if (config.tx_type == HWTSTAMP_TX_OFF &&
+ config.rx_filter_type == HWTSTAMP_FILTER_NONE)
+ return 0;
+
+ return -ERANGE;
+}
+
+/**
* igb_ioctl -
* @netdev:
* @ifreq:
@@ -4115,6 +4145,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
case SIOCGMIIREG:
case SIOCSMIIREG:
return igb_mii_ioctl(netdev, ifr, cmd);
+ case SIOCSHWTSTAMP:
+ return igb_hwtstamp_ioctl(netdev, ifr, cmd);
default:
return -EOPNOTSUPP;
}
--
1.6.0.4
--
Currently only mapping from clock source to system time is implemented. The interface could have been made more versatile by not depending on a clock source, but this wasn't done to avoid writing glue code elsewhere. The method implemented here is the one used and analyzed under the name "assisted PTP" in the LCI PTP paper: http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> --- include/linux/clocksync.h | 141 +++++++++++++++++++++++++++++++++++++++++++++ kernel/time/Makefile | 2 +- kernel/time/clocksync.c | 108 ++++++++++++++++++++++++++++++++++ 3 files changed, 250 insertions(+), 1 deletions(-) create mode 100644 include/linux/clocksync.h create mode 100644 kernel/time/clocksync.c diff --git a/include/linux/clocksync.h b/include/linux/clocksync.h new file mode 100644 index 0000000..a5bec6f --- /dev/null +++ b/include/linux/clocksync.h @@ -0,0 +1,141 @@ +/* + * Utility code which helps transforming between hardware time stamps + * generated by a clocksource and system time. The clocksource is + * assumed to return monotonically increasing time (but this code does + * its best to compensate if that is not the case) whereas system time + * may jump. + */ +#ifndef _LINUX_CLOCKSYNC_H +#define _LINUX_CLOCKSYNC_H + +#include <linux/clocksource.h> +#include <linux/ktime.h> + +/** + * struct clocksync - stores state and configuration for the two clocks + * + * Initialize to zero, then set clock, systime, num_samples. + * + * Transformation between HW time and system time is done with: + * HW time transformed = HW time + offset + + * (HW time - last_update) * skew / CLOCKSYNC_SKEW_RESOLUTION + * + * @clock: the source for HW time stamps (%clocksource_read_time) + * @systime: function returning current system time (ktime_get + * for monotonic time, or ktime_get_real for wall clock) + * @num_samples: number ...
That should be separately allocated to avoid potential stack overflow.
Also as a style nit there are normally no {} around single line
statements.
-Andi
--
Good catch. "make checkstack" also complains about it, but I didn't get
around to fixing it yet.
I'd prefer to allocate a very small array on the stack (10 entries = 160
bytes) and only fall back to dynamic allocation if the user of clocksync
This is the part of the CodingStyle that I had most trouble adapting to
because a) I wrote a lot of code where the required style explicitly
asked for {} and b) I can think of several reasons for adding them
always and only one for not adding them.
Anyway, I'll try to keep this in mind, but would prefer to not reformat
the patches unless I have to touch them for other reasons.
--
Best Regards, Patrick Ohly
The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.
--
From: Patrick Ohly <patrick.ohly@intel.com> That distracts the eyes of the people reviewing the code, because such people spend most of their time reading code that conforms to the proper kernel coding style. Therefore, please fix up these issues rather than defer them. What does it take like 5 minutes of your time? About the same amount of time it took you to say you would defer it? Come on... --
You are right of course. I have changed this and also addressed the other comments. I'll give it a few more days in case that there are further comments, then resubmit with linux-kernel on CC. Should I rebase against net-2.6 or net-next-2.6? -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
net-next-2.6 is the tree you want to use for new network developments net-2.6 is for bug fixes only --
From: Patrick Ohly <patrick.ohly@intel.com> Like patch 9, this will need a full review on linux-kernel --
Adds register definitions and a clocksource accessing the
NIC time.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
drivers/net/igb/e1000_regs.h | 28 +++++++++++
drivers/net/igb/igb.h | 3 +
drivers/net/igb/igb_main.c | 105 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 136 insertions(+), 0 deletions(-)
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 95523af..37f9d55 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -75,6 +75,34 @@
#define E1000_FCRTH 0x02168 /* Flow Control Receive Threshold High - RW */
#define E1000_RDFPCQ(_n) (0x02430 + (0x4 * (_n)))
#define E1000_FCRTV 0x02460 /* Flow Control Refresh Timer Value - RW */
+
+/* IEEE 1588 TIMESYNCH */
+#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCFG 0x05F50
+
+#define E1000_SYSTIML 0x0B600
+#define E1000_SYSTIMH 0x0B604
+#define E1000_TIMINCA 0x0B608
+
+#define E1000_RXMTRL 0x0B634
+#define E1000_RXSTMPL 0x0B624
+#define E1000_RXSTMPH 0x0B628
+#define E1000_RXSATRL 0x0B62C
+#define E1000_RXSATRH 0x0B630
+
+#define E1000_TXSTMPL 0x0B618
+#define E1000_TXSTMPH 0x0B61C
+
+#define E1000_ETQF0 0x05CB0
+#define E1000_ETQF1 0x05CB4
+#define E1000_ETQF2 0x05CB8
+#define E1000_ETQF3 0x05CBC
+#define E1000_ETQF4 0x05CC0
+#define E1000_ETQF5 0x05CC4
+#define E1000_ETQF6 0x05CC8
+#define E1000_ETQF7 0x05CCC
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 4ff6f05..2938ab3 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -34,6 +34,8 @@
#include "e1000_mac.h"
#include "e1000_82575.h"
+#include <linux/clocksource.h>
+
struct igb_adapter;
#ifdef CONFIG_IGB_LRO
@@ -262,6 +264,7 @@ struct igb_adapter {
struct napi_struct napi;
struct pci_dev *pdev;
struct net_device_stats net_stats;
+ struct clocksource clock;
...So far struct clocksource acted as the interface between time/timekeeping and hardware. This patch generalizes the concept so that the same interface can also be used in other contexts. The only change as far as kernel/time/timekeeping is concerned is that the hardware access can be done either with or without passing the clocksource pointer as context. This is necessary in those cases when there is more than one instance of the hardware. The extensions in this patch add code which turns the raw cycle count provided by hardware into a continously increasing time value. This reuses fields also used by timekeeping.c. Because of slightly different semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns does that transparently) timekeeping.c was not modified to use the generalized code. The new code does no locking of the clocksource. This is the responsibility of the caller. Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> --- include/linux/clocksource.h | 119 ++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 118 insertions(+), 1 deletions(-) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 55e434f..da4c7cd 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -24,6 +24,9 @@ struct clocksource; /** * struct clocksource - hardware abstraction for a free running counter * Provides mostly state-free accessors to the underlying hardware. + * Also provides utility functions which convert the underlying + * hardware cycle values into a non-decreasing count of nanoseconds + * ("time"). * * @name: ptr to clocksource name * @list: list head for registration @@ -43,6 +46,9 @@ struct clocksource; * The ideal clocksource. A must-use where * available. * @read: returns a cycle value + * @read_clock: alternative to read which gets a pointer to the clock + * source so that the same code can read different clocks; + * ...
From: Patrick Ohly <patrick.ohly@intel.com> This patch, since it changes generic facilities in the kernel rather than networking specific ones, will need to get a full review on linux-kernel --
Because of performance reasons, adding a new field to struct sk_buff
was avoided. Hardware time stamps are stored in the existing field, but
in order to not break other code, they must have been transformed to
the system time base.
To obtain the original hardware time stamp before the transformation,
a network device driver must implement the inverse transformation.
The clocksync code has no support for that yet and it would be
difficult to implement 100% accurately (rounding errors, updated
offset/skew values).
Instead of implementing this inverse transformation, this patch
adds another field for hardware time stamps. It is off by default
and mainstream Linux distributions should leave it off (PTP time
synchronization doesn't need it), but special distributions/users
could enable it if needed without having to patch the mainline
kernel source.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
drivers/net/igb/igb_main.c | 3 +-
include/linux/skbuff.h | 48 ++++++++++++++++++++++++++++++++++---------
net/Kconfig | 16 ++++++++++++++
net/core/skbuff.c | 18 +++++++++++++--
4 files changed, 71 insertions(+), 14 deletions(-)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index b320fec..2fed508 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -4116,7 +4116,8 @@ send_up:
clocksync_update(&adapter->sync, tstamp);
skb_hwtstamp_set(skb,
ns_to_ktime(clocksource_cyc2time(&adapter->clock,
- tstamp)));
+ tstamp)),
+ ns_to_ktime(tstamp));
}
if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bcca8fc..123711d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,7 @@ typedef unsigned char *sk_buff_data_t;
* thus is recorded in system time. If the lowest bit is set,
* then the value was originally generated by a ...Currently only TX hardware time stamping is implemented. Due to
hardware limitations it is not possible to verify reliably which
packet was time stamped when multiple were pending for sending; this
will be solved by only allowing one packet marked for hardware time
stamping into the queue (not implemented yet).
RX time stamping relies on the flag in the packet descriptor which
marks packets that were time stamped. In "all packet" mode this flag
is not set. TODO: also support that mode (even though it'll suffer
from race conditions).
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
drivers/net/igb/e1000_82575.h | 1 +
drivers/net/igb/e1000_defines.h | 1 +
drivers/net/igb/e1000_regs.h | 40 +++++++
drivers/net/igb/igb.h | 2 +
drivers/net/igb/igb_main.c | 239 +++++++++++++++++++++++++++++++++++++--
5 files changed, 275 insertions(+), 8 deletions(-)
diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..dd32a6f 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -116,6 +116,7 @@ union e1000_adv_tx_desc {
};
/* Adv Transmit Descriptor Config Masks */
+#define E1000_ADVTXD_MAC_TSTAMP 0x00080000 /* IEEE1588 Timestamp packet */
#define E1000_ADVTXD_DTYP_CTXT 0x00200000 /* Advanced Context Descriptor */
#define E1000_ADVTXD_DTYP_DATA 0x00300000 /* Advanced Data Descriptor */
#define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index ce70068..2a19698 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -104,6 +104,7 @@
#define E1000_RXD_STAT_UDPCS 0x10 /* UDP xsum calculated */
#define E1000_RXD_STAT_TCPCS 0x20 /* TCP xsum calculated */
#define E1000_RXD_STAT_DYNINT 0x800 /* Pkt caused INT via DYNINT */
+#define E1000_RXD_STAT_TS 0x10000 /* Pkt was time stamped */
#define E1000_RXD_ERR_CE ...As a general comment on the patch series I'm still a little sceptical the time stamp offset method is a good idea. Since it tries to approximate several unsynchronized clocks the result will always be of a little poor quality, which will likely lead to problems sooner or later (or rather require ugly workarounds in the user). I think it would be better to just bite the bullet and add new fields for this to the skbs. Hardware timestamps are useful enough to justify this. -Andi --
I'm all for it, as long as it doesn't keep this feature out of the mainline. At least one additional ktime_t field would be needed for the raw hardware time stamp. Transformation to system time (as needed by PTP) would have to be delayed until the packet is delivered via a socket. The code would be easier (and a bit more accurate) if also another ktime_t was added to store the transformed value directly after generating it. An extra field would also solve one of the open problems (tstamp set to time stamp when dev_start_xmit_hard is called for IP_MULTICAST_LOOP). -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
I really wondered if you posted the series to get an impression why adding a new field is a good idea ;-) Ok, i'm not that experienced on timestamps but i really got confused reading the patches and it's documentation (even together with the discussion on the ML). I would also vote for having a new field in the skb instead of this current 'bit-compression' approach which smells quite expensive at runtime and in code size. Not talking about the mentioned potential locking issues ... Regards, Oliver --
New fields in skb are probably the easy way to handle the problem, we
all know that.
But adding fields on such heavy duty structure for less than 0.001 % of
handled frames is disgusting.
Crazy idea here :
Say your NIC is capable of generating hw timestamps at TX or RX time.
Instead of storing them in skb, store them in a local structure (of the driver)
The local structure could contain an array of 4096 (or whatever size) couples of
{pointer to skb, hardware timestamp with whatever format is needed by this NIC}
If an application needs skb hw timestamp, get it when reading message, with appropriate
API, that calls NIC driver method, giving skb pointer as an argument. NIC driver
search in its local table a match of skb pointer (giving the most recent match of course),
and converts hwtimestamp into "generic application format". No need for a fast search, just
a linear search in the table, so that feeding it is really easy (maybe lockless)
For TX side, a flag on skb could ask NIC driver to feed transmited skb (or a copy of them)
to a raw socket (kind of a loopback for selected packets), once TX hstamp is collected in local table.
--
You have a strange definition of "disgusting". But if that's true that applies to the existing timestamp in there then too (and a couple of other fields in there too) Also I suspect that your percent numbers are wrong, depending on the workload. Personally I think hardware time stamps should replace the existing This will probably be a disaster on e.g. high speed network sniffing (which is one of the primary use cases of the hardware As soon as there is any reordering in the queue (and that is inevitable if you scale over multiple CPUs) your linear searches could get quite long and bounce cache lines like mad. Also I doubt it can be really done lockless. Also to be honest such a complicated and likely badly performing scheme just to save 4-8 bytes would match my own definition of "disgusting". -Andi --
You have a strange definition of "disgusting". But if that's true that applies to the existing timestamp in there then too (and a couple of other fields in there too) Also I suspect that your percent numbers are wrong, depending on the workload. Personally I think hardware time stamps should replace the existing This will probably be a disaster on e.g. high speed network sniffing (which is one of the primary use cases of the hardware As soon as there is any reordering in the queue (and that is inevitable if you scale over multiple CPUs) your linear searches could get quite long and bounce cache lines like mad. Also I doubt it can be really done lockless. Also to be honest such a complicated and likely badly performing scheme just to save 4-8 bytes would match my own definition of "disgusting". -Andi --
tstamp is the time stamp at the time NIC driver got the frame. Not the time the NIC got the frame from the wire. This scheme only is needed for special devices, used by PTP. Only *selected* frames really need to gather hwtstamp. TCP trafic wont use hwtstamp. Most UDP trafic wont use hwstamp. I threw a "crazy idea", that can be changed if necessary, say with a cookie that identifies the slot in NIC driver structure. O(1) lookup if really needed. Yes, I find year 2008 not appropriate to enlarge skb with a hwstamp, but then YMMV --
It's going to be supported by a large range of mass market NICs, not special devices. Actually it wouldn't surprise me if one of the numerous TCP congestion avoidance algorithms that get added all the time That depends. For example if you're running a packet sniffer most packets will carry it. Probably also others. e.g. dhcp is using timestamps and it would make sense to switch it to hw timestamps when available. I think "crazy" describes it well because it would be a lot of dubious and likely not performing well effort just to save 8 bytes. BTW it wouldn't surprise me if skb heads had some free space in common situations anyways becaus it's unlikely it fits exactly into 4K pages in slab/slub. -Andi --
HW Timestamps are also state-of-the-art in a large range of Controller Area Network (CAN) NICs. And when you want to write 'really professional' traffic sniffers or you need to deal with sensor fusion, hw timestamps allow big improvements in I would also assume that people will find new use-cases for hw The crazy idea from Eric looks easier and more clearly to me than the discussed patch set from Patrick Ohly - but i wonder if we should give a separate hw timestamp a try ... I know Patrick is not a friend of a CONFIG option here. But when we make it right HW timestamp could only be disabled on CONFIG_EMBEDDED or something like that. Regards, --
For reasons that have been mentioned already here (some hardware can time stamp every packet, new use cases) I think it would be important to have the hwtstamp information right in the skb. I can change the patch series so that it uses one additional ktime_t hwtstamp field; give I'm not a friend of a config option because it was suggested that hardware tstamps should off on *standard* kernels. That's of little use for users of unmodified distributions who want to run PTP. If the feature is only disabled on special distributions, then I really don't mind, but at the same time I wonder whether these distributions are performance sensitive enough to care about the additional field. Bye, Patrick --
Patrick, one question about a new crazy idea: If we would tend to add new space in the skb, won't 4 bytes enough then? A 32 bit value gives a nsec resolution of 4.294967296 seconds or +/- 2.147483648 seconds. If we make a 'full qualified' 64 bit sys-timestamp available anyway, the new 32 bit value could be used as an offest (or it could be given to the userspace directly) to calculate the hw timestamp within the sys-timestamp context, right? Regards, Oliver --
If NIC is going to receive 100.000 frames per second as Andi mentioned earlier my guess is you dont want to make sophisticated computation in NIC rx handler, but storing raw data delivered by NIC. Then, later, for the happy few^Wmany applications that need to get hwstamp, perform the computation if needed ? I hope tcp stack wont need hwstamp before 2013 or so ;) --
The sys-timestamp is normally not generated. The offset scheme would add a call to gettimeofdayns() even if there is no other use for the value. This might be acceptable; the bigger problem IMHO is that without tracking system time in the hardware, hardware and system time will quickly (~ a few days with the hardware I was looking at, if I remember correctly) diverge more than can be stored in the 32 bit offset. I'd prefer to spend 64 bits and be done without the need for further encoding hacks. Bye, Patrick --
I would like point to CUBIC, the Probe Control Protocol, and TCP Santa Cruz as evidence that you are correct. CUBIC v2.3 has a new slow start variant called Hystart. http://marc.info/?l=linux-netdev&m=122531684115306&w=2 In their tech report, they refer to the packet train technique for measuring available bandwidth used by Dovrolis in his Pathload tool. One of the reasons Hystart uses heuristics rather than the algorithms described in the Pathload paper is the unavailability of high precision timestamps. http://www.cc.gatech.edu/fac/Constantinos.Dovrolis/pathload.html The Probe Control Protocol is a non-TCP protocol. The authors implemented the Pathload algorithm for measuring available bandwidth, but they used libpcap to do it. http://www.cs.washington.edu/research/networking/pcp/ TCP Santa Cruz is another, older, variant of TCP that proposed using timestamps to model the depth of the queue in the bottleneck switch between two hosts. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.981 And lastly, I would welcome good TX and RX timestamps for use with my own research in providing better QoS on commodity networks. -- Andrew Shewmaker --
From: Andi Kleen <ak@linux.intel.com> I think that, considered globally, Eric's estimate is an over-estimate in fact. --
Oh dear, my secret plan has been revealed ;-) No, I was really hoping that the patch would be acceptable. After rewriting the patch series with one additional field the code is simpler (or so I hope). I just The locking issues still remain: the hardware reconfiguration in the ioctl needs to be coordinated with the ongoing time stamping. Then there's the raw2sys callback which is made by the socket layer into the device. That one is problematic also because finding that device isn't as easy as I thought (see my other mails), so perhaps we should get rid of the delayed transformation and add two fields. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. --
