Hi Ingo, I looked at the system call instrumentation present in LTTng lately. I tried different solutions, e.g. hooking a kernel-wide syscall trace in do_syscall_trace, but it appears that I ended up re-doing another syscall table, which consists of specialized functions which extracts the string and data structure parameters from user-space. Since code duplication is not exactly wanted, I think that the original approach taken in my patchset, which is to instrument the kernel code at the sys_* level (e.g. sys_open), which is the earliest level where the parameter information is made available to the kernel, is still the best way to go. I would still identify the execution mode changes in the same way I do currently, which is by instrumenting do_syscall_trace, just to know as soon as possible when the mode has changed from user-space to kernel-space so we can do time accounting more accurately. I already have the patchset which adds the KERNEL_TRACE thread flag to every architectures. It's tested in assembly in the same way SYSCALL_TRACE is tested, but is activated globally by iterating on all the threads. So, the currently proposed scheme for a system call would be (for the open() example) shown as : kernel stack trace: event name (parameters) do_syscall_trace() trace: kernel_arch_syscall_entry (syscall id, instruction pointer) do_sys_open() trace: fs_open (fd, filename) do_syscall_trace() kernel_arch_syscall_exit (return value) If we take this open() example, filename is ready only in do_sys_open, which is called by sys_open and sys_openat. So the logical instrumentation site for this would really be do_sys_open(). The information about which system call has been done is made available in the kernel_arch_syscall_entry event. It is not present anymore at the do_sys_open level because this execution path can be called from more than one syscall. What do you think of this approach ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD...
hm, i'm not sure about this. I've implemented system call tracing in -rt
[embedded in the latency tracer] and it only needed changes in entry.S,
not in every system call site. Now, granted, that tracer was simpler
than what LTTng tries to do, but do we _really_ need more complexity? A
trace point that simply expresses:
sys_call_event(int sysno, long param1, long param2, long param3,
long param4, long param5, long param6);
would be more than enough in 99% of the cases. If you want to extract
the strings from the system call table, to make the filtering of these
syscall events easier, do it during build time (by for example modifying
the __SYSCALL() macros in unistd.h), instead of a parallel syscall
table.
OTOH, as long as it's just one simple nonintrusive line per syscall,
it's not a big deal - as long as it only traces the parameters as they
come from the syscall ABI - we wont change them anyway. I.e. hide the
ugly string portion, just turn them into a simple:
trace_sys_getpid();
trace_sys_time(tloc);
trace_sys_gettimeofday(tz, tv);
(although even such a solution would still need a general policy level
buy-in i guess - we dont want half of the syscalls traced, the other
half objected to by maintainers. So it's either all or nothing.)
and the question also arises: why not do this on a debuginfo level? This
information can be extracted from the debuginfo. We could change
'asmlinkage' to 'syscall_linkage' to make it clear which functions are
syscalls.
Ingo
--That would work for all system calls that doesn't have parameters like Hrm, not sure about that : grep asmlinkage include/linux/syscalls.h|wc -l 321 grep -e asmlinkage.*__user include/linux/syscalls.h | wc -l 162 The goal of my second syscall table was to call callbacks to extract the __user parameters, nothing else, really. I currently extract the system call function names by iterating on the real system call table and by using kallsyms (it could be turned into a built-time extraction since Given we already have the syscall table, I am not sure we would need any supplementary debuginfo support. Plus, the syscall table gives us the mapping between the syscall number and the symbol of the system call function, which is exactly what we need. The only thing I dislike in my current approach is that I depend on kallsyms; it would be good to have the system call symbols without depending on it. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
what's the problem with them? Do you want to follow those parameters? Ingo --
Ideally, I'd like to have this kind of high-level information : event name : kernel syscall syscall name : open arg1 (%s) : "somefile" <----- arg2 (%d) : flags arg3 (%d) : mode However, "somefile" has to be read from userspace. With the protection involved, it would cause a performance impact to read it a second time rather than tracing the string once it's been copied to kernel-space. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
performance is a secondary issue here, and copies are fast anyway _if_ someone wants to trace a syscall. (because the first copy brings the cacheline into the cache, subsequent copies are almost for free compared to the first copy) Ingo --
Hrm, a quick benchmark on my pentium 4 comparing a normal open() system
call executed in a loop to a modified open() syscall which executes the
lines added in the following patch adds 450 cycles to each open() system
call. I added a putname/getname on purpose to see the cost of a second
userspace copy and it's not exactly free.
The normal getname correctly nested, re-using the string previously
copied, should not suffer from that kind of performance hit. Also, given
that the string would be copied only once from userspace, it would
eliminate race scenarios where multithreaded applications could change
the string underneath, so the kernel would trace a different string than
the one being really used for the system call.
However, strings are not the only userspace arguments passed to system
calls. For all these other arguments, performance could be an issue as
well as racy user-level data modification which would let the kernel
trace a different paramenter than the one being used in the system call.
For those two reasons, I think extracting these parameters could be
faster/cleaner/safer if done in the system call function, where the
parameters are already copied in kernel space.
Mathieu
Index: linux-2.6-lttng/fs/open.c
===================================================================
--- linux-2.6-lttng.orig/fs/open.c 2008-05-19 22:51:16.000000000 -0400
+++ linux-2.6-lttng/fs/open.c 2008-05-19 23:11:07.000000000 -0400
@@ -1043,6 +1043,8 @@ long do_sys_open(int dfd, const char __u
int fd = PTR_ERR(tmp);
if (!IS_ERR(tmp)) {
+ putname(tmp);
+ tmp = getname(filename);
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, flags, mode);
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--On Mon, 19 May 2008 23:44:53 -0400 the audit subsystem already does all of this... why not use that?? copying twice does mean that if the user wants, he can cheat you. He can, in another thread, change the string under you. So say you're doing this for anti-virus purposes, he can make you scan one file and open another. The audit subsystem was carefully designed to avoid this trap... how about using that? --
Hrm, given tracing will have to grab __user * parameters passed to various system calls, not limited to strings, the getname/putname infrastructure would need to be expanded a lot. I doubt it's worth adding such complexity (copy to temporary memory buffers and reference counting) in those system calls to support kernel-wide tracing. On the other hand, adding a marker in the traced function, at a code location where the data copied into the kernel is accessible, won't add such complexity and will help to keep good locality of reference (the stack is meant to be a good cache-hot memory region). Because a dormant marker does not have a significant performance hit (actually, my benchmarks shows a small acceleration of the overall system, probably due to cache line code layout modifications), I think it's legitimate to add this kind of instrumentation in the existing kernel system call functions. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hi Ingo and Mathieu, I think that the code duplication is also an issue. If we introduce functions which copy userspace strings same as original syscall functions, we have to maintain similar codes. So, I think Mathieu's approach (separating syscall parameters from syscall entering event) is enough reasonable. BTW, it seems that using KERNEL_TRACE per thread flag might be a bit tricky to trace all threads. Out of curiosity:-), why did not you use a global percpu flag to do it? Thanks, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com --
| Alex Samad | page swap allocation error/failure in 2.6.25 |
| Bart Van Assche | Re: Integration of SCST in the mainstream Linux kernel |
| Andrea Arcangeli | [PATCH 06 of 11] rwsem contended |
| Chuck Ebbert | Why do so many machines need "noapic"? |
git: | |
| Andy Parkins | svn:externals using git submodules |
| Elijah Newren | Trying to use git-filter-branch to compress history by removing large, obsolete bi... |
| Marcus Griep | [PATCH] git-svn: Make it scream by minimizing temp files |
| Tommi Virtanen | [PATCH] "git shell" won't work, need "git-shell" |
| Marcos Laufer | dmesg IBM x3650 OpenBSD 4.3 |
| Theo de Raadt | Re: Chatting with developers? Is it soo 1996? |
| Ted Unangst | Re: MAXDSIZ 1GB memory limit for process |
| Richard Stallman | Real men don't attack straw men |
| Denys Fedoryshchenko | Re: thousands of classes, e1000 TX unit hang |
| Suresh Siddha | Re: Kernel oops with 2.6.26, padlock and ipsec: probably problem with fpu state ch... |
| Simon Horman | Re: [PATCH] sendfile() and UDP socket |
| Jeff Garzik | Re: [PATCH] sky2: jumbo frame regression fix |
