Re: [RFC PATCH 06/13] workaround: detect time stamp when command flags are expected

Previous thread: Strange TCP dump between a Linux and a Samsung phone by ROUTE 66 - Catalin BOIE on Tuesday, November 11, 2008 - 7:05 am. (3 messages)

Next thread: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING by Johann Baudy on Tuesday, November 11, 2008 - 7:59 am. (2 messages)
From: Patrick Ohly
Date: Tuesday, November 11, 2008 - 7:44 am

Hello all!

This patch series is related to Octavian's proposals a while ago ("net:
support for TX timestamps", "net: support for hardware timestamps"). It
mostly implements the revised plan discussed recently ("hardware time
stamps + existing time stamp usage").

As part of that discussion there was no consensus on storing hardware
time stamps in sk_buff. One suggested solution for avoiding a new field
was based on storing a transformed hardware time stamp in the existing
field (thus keeping the existing code happy) and then transforming back
to the original hardware time base if needed.

This patch series follows that approach, but the utility code which is
used to transform from hardware time to system time doesn't implement
the necessary inverse transformation. This means that although access to
the original, raw hardware time stamp is part of the API, the
corresponding value is never actually reported by the existing code. If
this value is required, then the final patch shows how a separate
hardware time stamp field solves the problem with a lot less hassle.

If the extra field is considered acceptable unconditionally, then the
#ifdefs become redundant and the code much cleaner. I would prefer this
solution, but I don't know what the performance impact of a larger
sk_buff really is.

This patch series is currently based on v2.6.27. If it is considered for
inclusion, then I'd make sure that PTPd indeed works with this patch
(currently it has only been tested with the included example program)
and resubmit the patch based on the latest netdev branch. This is my
first contribution to the Linux kernel; it hasn't been discussed inside
Intel either. So please give it the treatment that a newbie patch
deserves: thorough, but not too harsh critisism ;-) Thanks!

I'm sure there will be things to correct. To spare you the trouble of
searching for them, here are some I already know of:
      * there's currently no locking of the newly added fields in the
        igb private data ...
From: Patrick Ohly
Date: Friday, October 24, 2008 - 6:41 am

We make use of the upper bits in the skb->tstamp to transport the
senders time stamping settings into the lower levels. Currently these
are per-socket settings, but a per-packet control message could also
be added.

When a TX timestamp operation is requested, the TX skb will be cloned
and the clone will be time stamped (in hardware or software) and added
to the socket error queue of the skb, if the skb has a socket
associated.

The actual timestamp will reach userspace as a RX timestamp on the
cloned packet. If timestamping is requested and no timestamping is
done in the device driver (potentially this may use hardware
timestamping), it will be done in software after the device's
start_hard_xmit routine.

The new semantic for hardware/software time stamping around
net_device->hard_start_xmit() is based on two assumptions about
existing network device drivers which don't support hardware
time stamping and know nothing about it:
- they leave the skb->tstamp field unmodified
- the keep the connection to the originating socket in skb->sk
  alive, i.e., don't call skb_orphan()

The first assumption seems to hold for in-tree drivers. The second
is only true for some drivers. As a result, software TX time stamping
currently works with the bnx2 driver, but not with the unmodified
igb driver (the two drivers this patch was tested with).

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 Documentation/networking/timestamping.txt |   31 +++++++++++++++++
 include/linux/netdevice.h                 |   10 ++++++
 include/linux/skbuff.h                    |   51 +++++++++++++++++++++++++++++
 include/net/sock.h                        |   14 ++++++++
 net/core/dev.c                            |   34 +++++++++++++++++--
 net/core/skbuff.c                         |   36 ++++++++++++++++++++
 net/socket.c                              |   15 ++++++++
 7 files changed, 188 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/timestamping.txt ...
From: Octavian Purdila
Date: Tuesday, November 11, 2008 - 4:15 pm

From: Patrick Ohly <patrick.ohly@intel.com>

I think the addition of the following bits will be of use to applications: 

        serr = SKB_EXT_ERR(skb);
        serr->ee.ee_origin = SO_EE_ORIGIN_TXTSTAMP;
        serr->ee.ee_mac = skb->mac.raw - skb->data;
        serr->ee.ee_network = skb->nh.raw - skb->data;
        serr->ee.ee_transport = skb->h.raw - skb->data;

For example, for UDP PTP we don't have to manually skip the ethernet (and take 
into account VLANs) and IP headers.

Thanks,
tavi

--

From: Patrick Ohly
Date: Wednesday, November 12, 2008 - 1:38 am

Yes, that would be useful. Will add it.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

--

From: Patrick Ohly
Date: Friday, October 24, 2008 - 6:49 am

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 include/net/ip.h     |    1 +
 net/can/raw.c        |    8 ++++++++
 net/ipv4/icmp.c      |    2 ++
 net/ipv4/ip_output.c |    2 ++
 net/ipv4/raw.c       |    1 +
 net/ipv4/udp.c       |    4 ++++
 6 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 250e6ef..76cee15 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -54,6 +54,7 @@ struct ipcm_cookie
 	__be32			addr;
 	int			oif;
 	struct ip_options	*opt;
+	union ktime		tstamp;
 };
 
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 6e0663f..b3a978b 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,9 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
 	struct raw_sock *ro = raw_sk(sk);
 	struct sk_buff *skb;
 	struct net_device *dev;
+	union ktime tstamp = {
+		.tv64 = 0.
+	};
 	int ifindex;
 	int err;
 
@@ -639,6 +642,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
 	if (!dev)
 		return -ENXIO;
 
+	err = sock_tx_timestamp(msg, sk, &tstamp);
+	if (err < 0)
+		return err;
+
 	skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
 				  &err);
 	if (!skb) {
@@ -654,6 +661,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
 	}
 	skb->dev = dev;
 	skb->sk  = sk;
+	skb->tstamp = tstamp;
 
 	err = can_send(skb, ro->loopback);
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 55c355e..27cd661 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	inet->tos = ip_hdr(skb)->tos;
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
+	ipc.tstamp.tv64 = 0;
 	if (icmp_param->replyopts.optlen) {
 		ipc.opt = &icmp_param->replyopts;
 		if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	inet_sk(sk)->tos = ...
From: David Miller
Date: Wednesday, November 12, 2008 - 2:59 am

From: Patrick Ohly <patrick.ohly@intel.com>

Please use ktime_t
--

From: Patrick Ohly
Date: Wednesday, October 29, 2008 - 7:48 am

This happens when IP_MULTICAST_LOOP is on. Apparently the time
stamped packet goes through the loop device's start_hard_xmit?!
TODO: find a clean solution.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 net/core/skbuff.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7d714b8..7d9f1dd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2574,6 +2574,13 @@ void skb_hwtstamp_tx(struct sk_buff *orig_skb,
 	struct sk_buff *skb;
 	int err = -ENOMEM;
 
+	/* sanity check: extra bits set => might be a real time stamp */
+	if (orig_skb->tstamp.tv64 & ~(SKB_TSTAMP_TX_HARDWARE|SKB_TSTAMP_TX_HARDWARE_IN_PROGRESS|SKB_TSTAMP_TX_SOFTWARE)) {
+		printk(KERN_DEBUG
+			"skb_hwtstamp_tx: invalid command flags\n");
+		return;
+	}
+
 	if (!sk)
 		return;
 
-- 
1.6.0.4


--

From: David Miller
Date: Wednesday, November 12, 2008 - 3:00 am

From: Patrick Ohly <patrick.ohly@intel.com>

Line is way too long, split up and group the bits:

	if (orig_skb->tstamp.tv64 & ~(SKB_TSTAMP_TX_HARDWARE |
				      SKB_TSTAMP_TX_HARDWARE_IN_PROGRESS |
				      SKB_TSTAMP_TX_SOFTWARE)) {
--

From: Patrick Ohly
Date: Friday, October 31, 2008 - 4:43 am

In its current form the new ioctl allows time stamping
PTP packets (all currently useful flavors) and all packets.
This should be good enough for the use cases discussed
on Linux netdev so far.

It does not yet allow user space control over the clock
in the NIC. If this should become necessary, then it will
have to be extended.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 Documentation/networking/timestamping.txt          |   23 ++++++++++++++++-
 .../networking/timestamping/timestamping.c         |   26 ++++++++++++++++++++
 fs/compat_ioctl.c                                  |    1 +
 include/net/timestamping.h                         |    7 ++++-
 net/core/dev.c                                     |    2 +
 5 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 6a87a96..537e55b 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -96,6 +96,24 @@ struct hwtstamp_config {
     int rx_filter_type;  /**< HWTSTAMP_RX_* */
 };
 
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter_type are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't ...
From: Patrick Ohly
Date: Wednesday, October 22, 2008 - 8:01 am

New socket option SO_TIMESTAMPING. Supersedes SO_TIMESTAMP[NS]. The
overlap with the old SO_TIMESTAMP[NS] options is handled so that time
stamping in software (net_enable_timestamp()) is enabled when
SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE is set.  It's
disabled if all of these are off.

User space can request hardware and/or software time stamping.
Reporting of the result(s) via a new control message is enabled
separately for each field in the message because some of the
fields may require additional computation and thus cause overhead.

New SIOCSHWTSTAMP ioctl number which controls the hardware which
does the hardware time stamping. Must be implemented by each
device driver which supports hardware time stamping together
with the new time stamp transformation methods in struct
net_device.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 Documentation/networking/timestamping.txt          |  147 +++++++
 .../networking/timestamping/timestamping.c         |  441 ++++++++++++++++++++
 include/asm-x86/socket.h                           |    3 +
 include/linux/errqueue.h                           |    1 +
 include/linux/netdevice.h                          |   15 +-
 include/linux/skbuff.h                             |    4 +-
 include/linux/sockios.h                            |    3 +
 include/net/sock.h                                 |   22 +-
 include/net/timestamping.h                         |   92 ++++
 net/compat.c                                       |   19 +-
 net/core/skbuff.c                                  |   13 +-
 net/core/sock.c                                    |   88 ++++-
 net/socket.c                                       |   68 +++-
 13 files changed, 861 insertions(+), 55 deletions(-)
 create mode 100644 Documentation/networking/timestamping.txt
 create mode 100644 Documentation/networking/timestamping/timestamping.c
 create mode 100644 include/net/timestamping.h

diff --git a/Documentation/networking/timestamping.txt ...
From: David Miller
Date: Wednesday, November 12, 2008 - 3:02 am

From: Patrick Ohly <patrick.ohly@intel.com>

It's __KERNEL__ not __kernel__, and user visible interfaces
should not be added to <net/foo.h> files.

Rather, they should be put into <linux/foo.h> files.

Since "timestamping.h" is too generic a name, use something
like <linux/net_tstamp.h>
--

From: Patrick Ohly
Date: Friday, October 31, 2008 - 5:21 am

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 drivers/net/igb/igb_main.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 634c4c9..becf8d6 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
 #include <linux/ipv6.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/timestamping.h>
 #include <linux/mii.h>
 #include <linux/ethtool.h>
 #include <linux/if_vlan.h>
@@ -4103,6 +4104,35 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
 }
 
 /**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+	struct hwtstamp_config config;
+
+	printk("igb_hwtstamp_ioctl\n");
+
+	if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+		return -EFAULT;
+
+	/* reserved for future extensions */
+	if (config.flags)
+		return -EINVAL;
+
+	if (config.tx_type == HWTSTAMP_TX_OFF &&
+		config.rx_filter_type == HWTSTAMP_FILTER_NONE)
+		return 0;
+
+	return -ERANGE;
+}
+
+/**
  * igb_ioctl -
  * @netdev:
  * @ifreq:
@@ -4115,6 +4145,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
 	case SIOCGMIIREG:
 	case SIOCSMIIREG:
 		return igb_mii_ioctl(netdev, ifr, cmd);
+	case SIOCSHWTSTAMP:
+		return igb_hwtstamp_ioctl(netdev, ifr, cmd);
 	default:
 		return -EOPNOTSUPP;
 	}
-- 
1.6.0.4


--


Currently only mapping from clock source to system time is implemented.
The interface could have been made more versatile by not depending on a clock source,
but this wasn't done to avoid writing glue code elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 include/linux/clocksync.h |  141 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/time/Makefile      |    2 +-
 kernel/time/clocksync.c   |  108 ++++++++++++++++++++++++++++++++++
 3 files changed, 250 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/clocksync.h
 create mode 100644 kernel/time/clocksync.c

diff --git a/include/linux/clocksync.h b/include/linux/clocksync.h
new file mode 100644
index 0000000..a5bec6f
--- /dev/null
+++ b/include/linux/clocksync.h
@@ -0,0 +1,141 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a clocksource and system time. The clocksource is
+ * assumed to return monotonically increasing time (but this code does
+ * its best to compensate if that is not the case) whereas system time
+ * may jump.
+ */
+#ifndef _LINUX_CLOCKSYNC_H
+#define _LINUX_CLOCKSYNC_H
+
+#include <linux/clocksource.h>
+#include <linux/ktime.h>
+
+/**
+ * struct clocksync - stores state and configuration for the two clocks
+ *
+ * Initialize to zero, then set clock, systime, num_samples.
+ *
+ * Transformation between HW time and system time is done with:
+ * HW time transformed = HW time + offset +
+ *                       (HW time - last_update) * skew / CLOCKSYNC_SKEW_RESOLUTION
+ *
+ * @clock:           the source for HW time stamps (%clocksource_read_time)
+ * @systime:         function returning current system time (ktime_get
+ *                   for monotonic time, or ktime_get_real for wall clock)
+ * @num_samples:     number ...

That should be separately allocated to avoid potential stack overflow.

Also as a style nit there are normally no {} around single line
statements.


-Andi
--


Good catch. "make checkstack" also complains about it, but I didn't get
around to fixing it yet.

I'd prefer to allocate a very small array on the stack (10 entries = 160
bytes) and only fall back to dynamic allocation if the user of clocksync

This is the part of the CodingStyle that I had most trouble adapting to
because a) I wrote a lot of code where the required style explicitly
asked for {} and b) I can think of several reasons for adding them
always and only one for not adding them.

Anyway, I'll try to keep this in mind, but would prefer to not reformat
the patches unless I have to touch them for other reasons.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

--


From: Patrick Ohly <patrick.ohly@intel.com>

That distracts the eyes of the people reviewing the code, because
such people spend most of their time reading code that conforms
to the proper kernel coding style.

Therefore, please fix up these issues rather than defer them.
What does it take like 5 minutes of your time?  About the same
amount of time it took you to say you would defer it?  Come on...
--


You are right of course. I have changed this and also addressed the
other comments. I'll give it a few more days in case that there are
further comments, then resubmit with linux-kernel on CC.

Should I rebase against net-2.6 or net-next-2.6?

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

--


net-next-2.6 is the tree you want to use for new network developments

net-2.6 is for bug fixes only


--


From: Patrick Ohly <patrick.ohly@intel.com>

Like patch 9, this will need a full review on linux-kernel
--

From: Patrick Ohly
Date: Tuesday, November 4, 2008 - 2:27 am

Adds register definitions and a clocksource accessing the
NIC time.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 drivers/net/igb/e1000_regs.h |   28 +++++++++++
 drivers/net/igb/igb.h        |    3 +
 drivers/net/igb/igb_main.c   |  105 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 95523af..37f9d55 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -75,6 +75,34 @@
 #define E1000_FCRTH    0x02168  /* Flow Control Receive Threshold High - RW */
 #define E1000_RDFPCQ(_n)  (0x02430 + (0x4 * (_n)))
 #define E1000_FCRTV    0x02460  /* Flow Control Refresh Timer Value - RW */
+
+/* IEEE 1588 TIMESYNCH */
+#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCFG 0x05F50
+
+#define E1000_SYSTIML 0x0B600
+#define E1000_SYSTIMH 0x0B604
+#define E1000_TIMINCA 0x0B608
+
+#define E1000_RXMTRL     0x0B634
+#define E1000_RXSTMPL 0x0B624
+#define E1000_RXSTMPH 0x0B628
+#define E1000_RXSATRL 0x0B62C
+#define E1000_RXSATRH 0x0B630
+
+#define E1000_TXSTMPL 0x0B618
+#define E1000_TXSTMPH 0x0B61C
+
+#define E1000_ETQF0   0x05CB0
+#define E1000_ETQF1   0x05CB4
+#define E1000_ETQF2   0x05CB8
+#define E1000_ETQF3   0x05CBC
+#define E1000_ETQF4   0x05CC0
+#define E1000_ETQF5   0x05CC4
+#define E1000_ETQF6   0x05CC8
+#define E1000_ETQF7   0x05CCC
+
 /* Split and Replication RX Control - RW */
 /*
  * Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 4ff6f05..2938ab3 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -34,6 +34,8 @@
 #include "e1000_mac.h"
 #include "e1000_82575.h"
 
+#include <linux/clocksource.h>
+
 struct igb_adapter;
 
 #ifdef CONFIG_IGB_LRO
@@ -262,6 +264,7 @@ struct igb_adapter {
 	struct napi_struct napi;
 	struct pci_dev *pdev;
 	struct net_device_stats net_stats;
+	struct clocksource clock;
 
 ...
From: Patrick Ohly
Date: Tuesday, November 4, 2008 - 2:23 am

So far struct clocksource acted as the interface between time/timekeeping
and hardware. This patch generalizes the concept so that the same
interface can also be used in other contexts.

The only change as far as kernel/time/timekeeping is concerned is that
the hardware access can be done either with or without passing
the clocksource pointer as context. This is necessary in those
cases when there is more than one instance of the hardware.

The extensions in this patch add code which turns the raw cycle count
provided by hardware into a continously increasing time value. This
reuses fields also used by timekeeping.c. Because of slightly different
semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns
does that transparently) timekeeping.c was not modified to use the
generalized code.

The new code does no locking of the clocksource. This is the responsibility
of the caller.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 include/linux/clocksource.h |  119 ++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 118 insertions(+), 1 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 55e434f..da4c7cd 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -24,6 +24,9 @@ struct clocksource;
 /**
  * struct clocksource - hardware abstraction for a free running counter
  *	Provides mostly state-free accessors to the underlying hardware.
+ *      Also provides utility functions which convert the underlying
+ *      hardware cycle values into a non-decreasing count of nanoseconds
+ *      ("time").
  *
  * @name:		ptr to clocksource name
  * @list:		list head for registration
@@ -43,6 +46,9 @@ struct clocksource;
  *				The ideal clocksource. A must-use where
  *				available.
  * @read:		returns a cycle value
+ * @read_clock:         alternative to read which gets a pointer to the clock
+ *                      source so that the same code can read different clocks;
+ *      ...
From: David Miller
Date: Wednesday, November 12, 2008 - 3:04 am

From: Patrick Ohly <patrick.ohly@intel.com>

This patch, since it changes generic facilities in the kernel rather
than networking specific ones, will need to get a full review on
linux-kernel
--

From: Patrick Ohly
Date: Friday, November 7, 2008 - 2:26 am

Because of performance reasons, adding a new field to struct sk_buff
was avoided. Hardware time stamps are stored in the existing field, but
in order to not break other code, they must have been transformed to
the system time base.

To obtain the original hardware time stamp before the transformation,
a network device driver must implement the inverse transformation.
The clocksync code has no support for that yet and it would be
difficult to implement 100% accurately (rounding errors, updated
offset/skew values).

Instead of implementing this inverse transformation, this patch
adds another field for hardware time stamps. It is off by default
and mainstream Linux distributions should leave it off (PTP time
synchronization doesn't need it), but special distributions/users
could enable it if needed without having to patch the mainline
kernel source.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 drivers/net/igb/igb_main.c |    3 +-
 include/linux/skbuff.h     |   48 ++++++++++++++++++++++++++++++++++---------
 net/Kconfig                |   16 ++++++++++++++
 net/core/skbuff.c          |   18 +++++++++++++--
 4 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index b320fec..2fed508 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -4116,7 +4116,8 @@ send_up:
 			clocksync_update(&adapter->sync, tstamp);
 			skb_hwtstamp_set(skb,
 					ns_to_ktime(clocksource_cyc2time(&adapter->clock,
-										tstamp)));
+										tstamp)),
+					ns_to_ktime(tstamp));
 		}
 
 		if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bcca8fc..123711d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,7 @@ typedef unsigned char *sk_buff_data_t;
  *               thus is recorded in system time. If the lowest bit is set,
  *               then the value was originally generated by a ...
From: Patrick Ohly
Date: Thursday, November 6, 2008 - 4:13 am

Currently only TX hardware time stamping is implemented.  Due to
hardware limitations it is not possible to verify reliably which
packet was time stamped when multiple were pending for sending; this
will be solved by only allowing one packet marked for hardware time
stamping into the queue (not implemented yet).

RX time stamping relies on the flag in the packet descriptor which
marks packets that were time stamped. In "all packet" mode this flag
is not set. TODO: also support that mode (even though it'll suffer
from race conditions).

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
---
 drivers/net/igb/e1000_82575.h   |    1 +
 drivers/net/igb/e1000_defines.h |    1 +
 drivers/net/igb/e1000_regs.h    |   40 +++++++
 drivers/net/igb/igb.h           |    2 +
 drivers/net/igb/igb_main.c      |  239 +++++++++++++++++++++++++++++++++++++--
 5 files changed, 275 insertions(+), 8 deletions(-)

diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..dd32a6f 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -116,6 +116,7 @@ union e1000_adv_tx_desc {
 };
 
 /* Adv Transmit Descriptor Config Masks */
+#define E1000_ADVTXD_MAC_TSTAMP   0x00080000 /* IEEE1588 Timestamp packet */
 #define E1000_ADVTXD_DTYP_CTXT    0x00200000 /* Advanced Context Descriptor */
 #define E1000_ADVTXD_DTYP_DATA    0x00300000 /* Advanced Data Descriptor */
 #define E1000_ADVTXD_DCMD_IFCS    0x02000000 /* Insert FCS (Ethernet CRC) */
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index ce70068..2a19698 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -104,6 +104,7 @@
 #define E1000_RXD_STAT_UDPCS    0x10    /* UDP xsum calculated */
 #define E1000_RXD_STAT_TCPCS    0x20    /* TCP xsum calculated */
 #define E1000_RXD_STAT_DYNINT   0x800   /* Pkt caused INT via DYNINT */
+#define E1000_RXD_STAT_TS       0x10000 /* Pkt was time stamped */
 #define E1000_RXD_ERR_CE       ...
From: Andi Kleen
Date: Wednesday, November 12, 2008 - 9:06 am

As a general comment on the patch series I'm still a little sceptical
the time stamp offset method is a good idea. Since it tries to approximate
several unsynchronized clocks the result will always be of a little poor
quality, which will likely lead to problems sooner or later (or rather
require ugly workarounds in the user).

I think it would be better to just bite the bullet and add new fields
for this to the skbs. Hardware timestamps are useful enough to justify
this.

-Andi

--

From: Patrick Ohly
Date: Wednesday, November 12, 2008 - 9:25 am

I'm all for it, as long as it doesn't keep this feature out of the
mainline.

At least one additional ktime_t field would be needed for the raw
hardware time stamp. Transformation to system time (as needed by PTP)
would have to be delayed until the packet is delivered via a socket. The
code would be easier (and a bit more accurate) if also another ktime_t
was added to store the transformed value directly after generating it.

An extra field would also solve one of the open problems (tstamp set to
time stamp when dev_start_xmit_hard is called for IP_MULTICAST_LOOP).

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

--

From: Oliver Hartkopp
Date: Wednesday, November 12, 2008 - 11:44 am

I really wondered if you posted the series to get an impression why 
adding a new field is a good idea ;-)
Ok, i'm not that experienced on timestamps but i really got confused 
reading the patches and it's documentation (even together with the 
discussion on the ML). I would also vote for having a new field in the 
skb instead of this current 'bit-compression' approach which smells 
quite expensive at runtime and in code size. Not talking about the 
mentioned potential locking issues ...

Regards,
Oliver

--

From: Eric Dumazet
Date: Wednesday, November 12, 2008 - 12:22 pm

New fields in skb are probably the easy way to handle the problem, we
all know that.

But adding fields on such heavy duty structure for less than 0.001 % of
handled frames is disgusting.

Crazy idea here :

Say your NIC is capable of generating hw timestamps at TX or RX time.

Instead of storing them in skb, store them in a local structure (of the driver)
The local structure could contain an array of 4096 (or whatever size) couples of
{pointer to skb, hardware timestamp with whatever format is needed by this NIC}

If an application needs skb hw timestamp, get it when reading message, with appropriate
API, that calls NIC driver method, giving skb pointer as an argument. NIC driver
search in its local table a match of skb pointer (giving the most recent match of course),
and converts hwtimestamp into "generic application format". No need for a fast search, just
a linear search in the table, so that feeding it is really easy (maybe lockless)

For TX side, a flag on skb could ask NIC driver to feed transmited skb (or a copy of them)
 to a raw socket (kind of a loopback for selected packets), once TX hstamp is collected in local table.


--

From: Andi Kleen
Date: Wednesday, November 12, 2008 - 1:23 pm

You have a strange definition of "disgusting".

But if that's true that applies to the existing timestamp in there then too
(and a couple of other fields in there too)

Also I suspect that your percent numbers are wrong, depending on the workload.

Personally I think hardware time stamps should replace the existing

This will probably be a disaster on e.g. high speed network sniffing
(which is one of the primary use cases of the hardware
As soon as there is any reordering in the queue (and that is inevitable
if you scale over multiple CPUs) your linear searches could get quite
long and bounce cache lines like mad. Also I doubt it can be really
done lockless.

Also to be honest such a complicated and likely badly performing scheme just to save 4-8 bytes
would match my own definition of "disgusting".

-Andi
--

From: Andi Kleen
Date: Wednesday, November 12, 2008 - 1:23 pm

You have a strange definition of "disgusting".

But if that's true that applies to the existing timestamp in there then too
(and a couple of other fields in there too)

Also I suspect that your percent numbers are wrong, depending on the workload.

Personally I think hardware time stamps should replace the existing

This will probably be a disaster on e.g. high speed network sniffing
(which is one of the primary use cases of the hardware
As soon as there is any reordering in the queue (and that is inevitable
if you scale over multiple CPUs) your linear searches could get quite
long and bounce cache lines like mad. Also I doubt it can be really
done lockless.

Also to be honest such a complicated and likely badly performing scheme just to save 4-8 bytes
would match my own definition of "disgusting".

-Andi
--

From: Eric Dumazet
Date: Wednesday, November 12, 2008 - 1:56 pm

tstamp is the time stamp at the time NIC driver got the frame. Not the time
the NIC got the frame from the wire.


This scheme only is needed for special devices, used by PTP.

Only *selected* frames really need to gather hwtstamp.

TCP trafic wont use hwtstamp. Most UDP trafic wont use hwstamp.

I threw a "crazy idea", that can be changed if necessary, say with a cookie
that identifies the slot in NIC driver structure. O(1) lookup if really needed.

Yes, I find year 2008 not appropriate to enlarge skb with a hwstamp,
but then YMMV



--

From: Andi Kleen
Date: Wednesday, November 12, 2008 - 2:34 pm

It's going to be supported by a large range of mass market NICs,
not special devices.



Actually it wouldn't surprise me if one of the numerous
TCP congestion avoidance algorithms that get added all the time

That depends. For example if you're running a packet sniffer
most packets will carry it. Probably also others. e.g. dhcp
is using timestamps and it would make sense to switch it to
hw timestamps when available.

I think "crazy" describes it well because it would be a lot of dubious
and likely not performing well effort just to save 8 bytes.

BTW it wouldn't surprise me if skb heads had some free space in common
situations anyways becaus it's unlikely it fits exactly into 4K pages
in slab/slub.

-Andi
--

From: Oliver Hartkopp
Date: Wednesday, November 12, 2008 - 3:26 pm

HW Timestamps are also state-of-the-art in a large range of Controller 
Area Network (CAN) NICs.
And when you want to write 'really professional' traffic sniffers or you 
need to deal with sensor fusion, hw timestamps allow big improvements in 

I would also assume that people will find new use-cases for hw 

The crazy idea from Eric looks easier and more clearly to me than the 
discussed patch set from Patrick Ohly - but i wonder if we should give a 
separate hw timestamp a try ...

I know Patrick is not a friend of a CONFIG option here. But when we make 
it right HW timestamp could only be disabled on CONFIG_EMBEDDED or 
something like that.

Regards,

--

From: Ohly, Patrick
Date: Thursday, November 13, 2008 - 8:53 am

For reasons that have been mentioned already here (some hardware can
time stamp every packet, new use cases) I think it would be important
to have the hwtstamp information right in the skb. I can change the
patch series so that it uses one additional ktime_t hwtstamp field; give

I'm not a friend of a config option because it was suggested that
hardware tstamps should off on *standard* kernels. That's of little
use for users of unmodified distributions who want to run PTP.

If the feature is only disabled on special distributions, then I really
don't mind, but at the same time I wonder whether these distributions
are performance sensitive enough to care about the additional field.

Bye, Patrick
--

From: Oliver Hartkopp
Date: Wednesday, November 12, 2008 - 11:15 pm

Patrick,

one question about a new crazy idea:

If we would tend to add new space in the skb, won't 4 bytes enough then?

A 32 bit value gives a nsec resolution of 4.294967296 seconds or +/- 
2.147483648 seconds.

If we make a 'full qualified' 64 bit sys-timestamp available anyway, the 
new 32 bit value could be used as an offest (or it could be given to the 
userspace directly) to calculate the hw timestamp within the 
sys-timestamp context, right?

Regards,
Oliver

--

From: Eric Dumazet
Date: Wednesday, November 12, 2008 - 11:29 pm

If NIC is going to receive 100.000 frames per second as Andi mentioned earlier
my guess is you dont want to make sophisticated computation in NIC rx handler,
but storing raw data delivered by NIC.

Then, later, for the happy few^Wmany applications that need to get hwstamp, perform
the computation if needed ?

I hope tcp stack wont need hwstamp before 2013 or so ;)

--

From: Ohly, Patrick
Date: Thursday, November 13, 2008 - 9:05 am

The sys-timestamp is normally not generated. The offset scheme would
add a call to gettimeofdayns() even if there is no other use for the
value. This might be acceptable; the bigger problem IMHO is that without
tracking system time in the hardware, hardware and system time will quickly
(~ a few days with the hardware I was looking at, if I remember correctly)
diverge more than can be stored in the 32 bit offset.

I'd prefer to spend 64 bits and be done without the need for further
encoding hacks.

Bye, Patrick
--

From: Andrew Shewmaker
Date: Sunday, November 16, 2008 - 1:15 am

I would like point to CUBIC, the Probe Control Protocol, and TCP Santa
Cruz as evidence that you are correct.

CUBIC v2.3 has a new slow start variant called Hystart.

http://marc.info/?l=linux-netdev&m=122531684115306&w=2

In their tech report, they refer to the packet train technique for
measuring available bandwidth used by Dovrolis in his Pathload tool.
One of the reasons Hystart uses heuristics rather than the algorithms
described in the Pathload paper is the unavailability of high
precision timestamps.

http://www.cc.gatech.edu/fac/Constantinos.Dovrolis/pathload.html

The Probe Control Protocol is a non-TCP protocol.  The authors
implemented the Pathload algorithm for measuring available bandwidth,
but they used libpcap to do it.

http://www.cs.washington.edu/research/networking/pcp/


TCP Santa Cruz is another, older, variant of TCP that proposed using
timestamps to model the depth of the queue in the bottleneck switch
between two hosts.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.981


And lastly, I would welcome good TX and RX timestamps for use with my
own research in providing better QoS on commodity networks.

-- 
Andrew Shewmaker
--

From: David Miller
Date: Wednesday, November 12, 2008 - 3:17 pm

From: Andi Kleen <ak@linux.intel.com>

I think that, considered globally, Eric's estimate is an over-estimate
in fact.
--

From: Patrick Ohly
Date: Wednesday, November 19, 2008 - 5:39 am

Oh dear, my secret plan has been revealed ;-) No, I was really hoping
that the patch would be acceptable. After rewriting the patch series
with one additional field the code is simpler (or so I hope). I just

The locking issues still remain: the hardware reconfiguration in the
ioctl needs to be coordinated with the ongoing time stamping. Then
there's the raw2sys callback which is made by the socket layer into the
device. That one is problematic also because finding that device isn't
as easy as I thought (see my other mails), so perhaps we should get rid
of the delayed transformation and add two fields.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

--

Previous thread: Strange TCP dump between a Linux and a Samsung phone by ROUTE 66 - Catalin BOIE on Tuesday, November 11, 2008 - 7:05 am. (3 messages)

Next thread: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING by Johann Baudy on Tuesday, November 11, 2008 - 7:59 am. (2 messages)