On Tue, 16 Nov 2010 22:04:40 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
Interesting, I just tried it using the -tip kernel.
Looks like a good start, but there are some features I'm missing, see below.
How did you generate that? In other words is there a git command I can
use to apply the mm_pagefault/vfs_getname/trace patches on top of
2.6.36? Doing a git merge tip/tmp.perf/trace merged 2.6.37 for me.
For now I'm running a -tip kernel to test.
Nice! At a quick glance it didn't seem to impact performance much, I'll
have to do some measurements.
There is no strace-like "not finished":
clamd/11086 ( 0.000 ms): pread(0x9, 0x7f142f7f1000, 0x2000, 0xa8000, 0x2000, 0xa8) => 0x1
clamd/11087 ( 0.014 ms): futex(uaddr: 0x7f143fad3260, op: 0x80, val: 0x2, utime: 0x0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11087 ( 0.001 ms): futex(uaddr: 0x7f143fad3260, op: 0x81, val: 0x1, utime: 0xfe0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11086 ( 0.011 ms): ... [continued]: pread() => 0x2000
I assume that the first pread with 0ms time got interrupted, and it resumed/continued later. Is that the case?
Is the 0x1 return value bogus then? If so it would be less confusing if you output what strace usually does (".. not finished" IIRC).
Is it possible to use 'trace record' and 'perf record' at the same time?
I.e to do both strace-like (trace) and oprofile-like (perf record)
recording?
I see --call-graph in the trace record -h output, but it doesn't seem to work on x86_64 for me.
Do I need to pass some additional parameters to trace report, do I need additional CONFIG_ turned on?
I have:
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
# CONFIG_STACK_TRACER is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
Does it use the same code for callgraphs as ftrace, so I could choose userspace stacktrace?
1. I miss sorting.
Sure I can do something like this: trace report -d <N> |
sort -g -k2 -t\(, but I have to choose <N> first, and it has to output
all the data before sorting can begin (including the text part).
It would be better if I 'trace' itself could sort based on the syscall
latency time, and show me the slowest syscalls first.
2. Another feature I miss is ability to sum up syscall times per syscall, and output a summary.
3. The -s output is not very intuitive, the sched tracer output in /sys/debug/tracing was more intuitive IIRC,
showing that you swithced from process A to process B. I would like to see that in the -s output
(and if switching to a process not traced, it should say so).
4. Also it would be useful to know if during a long syscall (futex, pread, etc.) the CPU time was wasted, or
another thread of same process got scheduled in. Calculating the overhead of doing that would be interesting too.
I think this could be done with a tool that postprocesses the results, but maybe its easier to do in the tool itself.
For example:
thread1: futex(...) [... not finished]
thread2: (delta1 ms) ... scheduled in place of thread1 (switch thread1 -> thread2)
thread2: ... do stuff ...
thread1: (delta2 ms) ... scheduled in place of thread2 (switch thread2 -> thread1)
thread1: (delta3 ms) [... continued] futex() => ...
I would be interested to see delta1, and delta2, perhaps totalled up per syscall, per thread or per process.
5. Also it would be good if it could tell what a futex is used for:
1122.567 clamd/11082 (139.120 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x883, utime: 0x0, uaddr2: 0x0, val3: 0x441) => 0x0
1712.922 clamd/11082 (309.925 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x891, utime: 0x0, uaddr2: 0x0, val3: 0x448) => 0x0
2014.289 clamd/11082 (244.312 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x895, utime: 0x0, uaddr2: 0x0, val3: 0x44a) => 0x0
3639.956 clamd/11082 (104.370 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x919, utime: 0x0, uaddr2: 0x0, val3: 0x48c) => 0x0
4371.086 clamd/11082 (158.304 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x951, utime: 0x0, uaddr2: 0x0, val3: 0x4a8) => 0x0
5204.773 clamd/11082 (133.566 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x999, utime: 0x0, uaddr2: 0x0, val3: 0x4cc) => 0x0
Using /proc/`pidof clamd`/maps I found out that 0x1b59d44 is on the heap. Such large delays are probably
due to a pthread_cond_wait/cond_timed_wait, and after spending a bit of time in gdb I found out exactly which condition var it is:
(gdb) p &thr_pool->queueable_bulk_cond.__data.__futex
$10 = (unsigned int *) 0x1b59d44
But I don't want to repeat that for each futex call with a different address.
It would be good if 'trace' could figure out what kind of futex wait this is (i.e. mutex lock or condition variable wait), I hope
the kernel has a wait to tell them apart.
Large delays for waiting on condition variables is expected, and I want to exclude them from my trace.
However I do want to see if there are large delays on mutex locks, that is definetely something I am interested in.
Now of course 'strace' can't make the difference between these two, but 'trace' being a kernel feature might have more information.
Of course it'd be even better if it could show the name of the condition variable, but if the condition variable is on the heap that is hard to do.
I'll have to see if there is a way to script gdb to lookup the variable associated with an address (by looking through the local and global vars on the current thread).
Did you consider providing a tool to post-process the output and
perform that decoding?
Error: switch `p' requires a value
-P works though.
Best regards,
--Edwin
--