Hallo Tavi, Interesting initiative. I'm employed by Intel and had the chance to do some exploratory work on software PTP support for Intel's new 82576 Gigabit Ethernet Controller [1], which introduces hardware time stamping for PTP packets. I modified the open source PTPd so that it uses the more accurate hardware time stamps instead of time stamps generated by the Linux IP stack. The advantage was 50x higher accuracy under load. You can read more about that in a paper [2]. [1] http://download.intel.com/design/network/ProdBrf/320025.pdf [2] http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf In order to get these time stamps and read the clock inside the NIC which generates these time stamps, we had to add ioctl() calls to the igb driver - not nice and certainly not a suitable long-term solution. If there is a consensus on a better user space API and the Linux IP stack gets a general framework for PTP, then perhaps it could also be used with Intel's new NICs. Note that I'm not speaking in any official capacity for Intel here, just expressing my own opinion (and hope). I'm not even in the network team. I cannot release the PTPd and igb patches right now because that would require legal approval, but if there is interest I can get that process started. There's no reason not to do that. So, let's move on to Tavi's proposal: On Fri, 2008-07-04 at 01:47 +0300, Octavian Purdila wrote:I agree. Currently there is something similar with SO_TIMESTAMP and SCM_TIMESTAMP, but the problem with those is that only a timeval is returned, i.e., accuracy is limited to microseconds. To make full use of hardware time stamps we'll want a timespec with nanoseconds. We also need something more flexible than SO_TIMESTAMP. Depending on what the user space program wants to measure, it would be useful to time stamp * the various flavors of PTP packets (v1/v2/802.1as, SYNC/DELAY_REQUEST) selectively * all packets The hardware might not be capable of supporting all modes, but at least the API should support them and provide room for future extensions. It would be possible to fall back to time stamping using system time if the hardware is incapable of implementing the requested operation. Depending on how that fallback is implemented, PTPd's accuracy might be improved even without any hardware support. Forgive me my ignorance, can you provide more details how that would work? How about adding a new flag for send/sendto/sendmsg() instead of a new control message? Sounds a bit complicated to me. The trick currently used by PTPd might be more elegant and/or require less changes: it enables looping of outgoing packets with IP_MULTICAST_LOOP. The RX timestamp of the looped packet is then used as approximation for the TX time stamp of the original outgoing packet. Clearly this is inaccurate, in particular under load, but it is very easy to use. When a driver gets a skb with the request to generate a TX time stamp, it could send the packet, upon completion obtain the time stamp from the hardware and feed the packet and the time stamp back to the upper layers as if it had just been received. Would that work? The user space then obtains TX time stamps just like RX time stamps and can use the payload to determine what kind of time stamp it got. That also avoids the need for special cookies to detect packet loss or reordering. So far all that we get out of this is access to the raw time stamps. There may be some use for that, as Tavi said, but it would be a lot more interesting if the kernel would transform the raw time stamps into system time stamps if the user space process wants that. Then it can be used by a modified PTPd to synchronize the system time inside a cluster a lot more accurately than it is currently possible with NTP (think sub-microsecond accuracy instead of milliseconds). On Fri, 2008-07-04 at 03:42 +0300, Octavian Purdila wrote: For the paper I tried out two different ways of synchronizing the system time with the NIC time. The one called "Assisted System Time" could be implemented relatively easily inside the IP stack: the driver only has to provide access to the NIC's hardware clock. Then the layer above it can sample the system time/NIC time offset at regular intervals; when they drift apart, that drift rate can be tracked as part of the measurements and be taken into account when transforming from one time base into the other. The other method ("Two-Level PTP") is more complicated and didn't bring much benefit. Bye, Patrick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Srivatsa Vaddagiri | containers (was Re: -mm merge plans for 2.6.23) |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Benjamin Herrenschmidt | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
git: | |
| Jarek Poplawski | [PATCH take 2] pkt_sched: Protect gen estimators under est_lock. |
| David Miller | [GIT]: Networking |
| Gerhard Pircher | 3c59x: shared interrupt problem |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
