Steve got the _worst-case_ cpu_clock() difference down to 60 usecs not
so long ago. It might have regressed since then, it's really hard to do
it without cross-CPU synchronization.
( But it's not impossible, as Steve has proven it, because physical time
goes on linearly on each CPU so we have a chance to do it: by
accurately correlating the GTOD timestamps we get at to-idle/from-idle
times to the TSC. )
And note that i'm not only talking about cross-CPU synchronization, i'm
also talking about _single CPU_ timestamps. How do you get it right with
TSCs via a pure postprocessing method? A very large body of modern CPUs
will halt the TSC when they go into idle. (about 70% of the installed
base or so)
Note, we absolutely cannot do accurate timings in a pure
TSC-post-processing environment: unless you want to trace _every_
to-idle and from-idle event, which can easily be tens of thousands of
extra events per seconds.
What we could do perhaps is a hybrid method:
- save a GTOD+TSC pair at important events, such as to-idle and
from-idle, and in the periodic sched_tick(). [ perhaps also save it
when we change cpufreq. ]
- save the (last_GTOD, _relative_-TSC) pair in the trace entry
with that we have a chance to do good post-processed correlation - at
the cost of having 12-16 bytes of timestamp, per trace entry.
Or we could upscale the GTOD to 'TSC time', at go-idle and from-idle.
Which is rather complicated with cpufreq - which frequency do we want to
upscale to if we have a box with three available frequencies? We could
ignore cpufreq altogether - but then there goes dependable tracing on
another range of boxes.
i very much used early code tracing with ftrace in the past. In fact
once i debugged and early boot hang that happened so early before
_PRINTK_ was not functional yet (!).
So, to solve this bug, i hacked ftrace to use early_printk(), to print
out the last 10,000 functions executed before the hang - and that's how
i found the reason for the hang - i captured a huge trace via a serial
console. It was dead slow to capture, but it worked and sched_clock()
worked just fine in that kind of usecase as well.
[ Note that we added tracing/fastboot recently (for v2.6.28), to enable
the tracing of early boot code timings. Havent had a problem with it
yet on x86. ]
i havent used a TSC-less CPU in 10 years, i'm not sure i get this point
of yours. (and IIRC the division by zero was exactly on such CPUs where
we divided by cpu_khz - that's why it could even regress.)
note that sched_clock() will use the TSC whenever it is there physically
- even if GTOD does not use it anymore.
i wrote my first -pg/mcount based tracer about 11 years ago, to learn
more about the kernel. I traced everything with it. I then used it to
find performance bottlenecks in the kernel, and i used it to learn
kernel internals - when i saw a function in the trace that i did not
recognize, i read the source code.
Scheduler tracing came much later into the picture - the -pg tracer was
written well _before_ it was used for latency tracing purposes. But it
is indeed a pretty popular use of it. (but by no means the only one)
Ingo
--