login
Header Space

 
 

Re: System call instrumentation

Previous thread: build issue #487 for v2.6.26-rc1-6-gafa26be :undefined reference to `es7000_plat' by Toralf on Sunday, May 4, 2008 - 9:35 am. (3 messages)

Next thread: none
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Sunday, May 4, 2008 - 9:48 am

Hi Ingo,

I looked at the system call instrumentation present in LTTng lately. I
tried different solutions, e.g. hooking a kernel-wide syscall trace in
do_syscall_trace, but it appears that I ended up re-doing another
syscall table, which consists of specialized functions which extracts
the string and data structure parameters from user-space. Since code
duplication is not exactly wanted, I think that the original approach
taken in my patchset, which is to instrument the kernel code at the
sys_* level (e.g. sys_open), which is the earliest level where the
parameter information is made available to the kernel, is still the best
way to go.

I would still identify the execution mode changes in the same way I do
currently, which is by instrumenting do_syscall_trace, just to know as
soon as possible when the mode has changed from user-space to
kernel-space so we can do time accounting more accurately. I already
have the patchset which adds the KERNEL_TRACE thread flag to every
architectures. It's tested in assembly in the same way SYSCALL_TRACE is
tested, but is activated globally by iterating on all the threads.

So, the currently proposed scheme for a system call would be (for the
open() example)

shown as : 
kernel stack
  trace: event name (parameters)


do_syscall_trace()
  trace: kernel_arch_syscall_entry (syscall id, instruction pointer)

do_sys_open()
  trace: fs_open (fd, filename)

do_syscall_trace()
  kernel_arch_syscall_exit (return value)

If we take this open() example, filename is ready only in do_sys_open,
which is called by sys_open and sys_openat. So the logical
instrumentation site for this would really be do_sys_open(). The
information about which system call has been done is made available in
the kernel_arch_syscall_entry event. It is not present anymore at the
do_sys_open level because this execution path can be called from more
than one syscall.

What do you think of this approach ?


Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 5, 2008 - 2:55 am

hm, i'm not sure about this. I've implemented system call tracing in -rt 
[embedded in the latency tracer] and it only needed changes in entry.S, 
not in every system call site. Now, granted, that tracer was simpler 
than what LTTng tries to do, but do we _really_ need more complexity? A 
trace point that simply expresses:

   sys_call_event(int sysno, long param1, long param2, long param3,
                             long param4, long param5, long param6);

would be more than enough in 99% of the cases. If you want to extract 
the strings from the system call table, to make the filtering of these 
syscall events easier, do it during build time (by for example modifying 
the __SYSCALL() macros in unistd.h), instead of a parallel syscall 
table.

OTOH, as long as it's just one simple nonintrusive line per syscall, 
it's not a big deal - as long as it only traces the parameters as they 
come from the syscall ABI - we wont change them anyway. I.e. hide the 
ugly string portion, just turn them into a simple:

  trace_sys_getpid();
  trace_sys_time(tloc);
  trace_sys_gettimeofday(tz, tv);

(although even such a solution would still need a general policy level 
buy-in i guess - we dont want half of the syscalls traced, the other 
half objected to by maintainers. So it's either all or nothing.)

and the question also arises: why not do this on a debuginfo level? This 
information can be extracted from the debuginfo. We could change 
'asmlinkage' to 'syscall_linkage' to make it clear which functions are 
syscalls.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 5, 2008 - 6:59 am

That would work for all system calls that doesn't have parameters like

Hrm, not sure about that :

grep asmlinkage include/linux/syscalls.h|wc -l
321

grep -e asmlinkage.*__user include/linux/syscalls.h | wc -l
162


The goal of my second syscall table was to call callbacks to extract the
__user parameters, nothing else, really. I currently extract the system
call function names by iterating on the real system call table and by
using kallsyms (it could be turned into a built-time extraction since


Given we already have the syscall table, I am not sure we would need any
supplementary debuginfo support. Plus, the syscall table gives us the
mapping between the syscall number and the symbol of the system call
function, which is exactly what we need. The only thing I dislike in my
current approach is that I depend on kallsyms; it would be good to have
the system call symbols without depending on it.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 5, 2008 - 7:10 am

what's the problem with them? Do you want to follow those parameters?

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 5, 2008 - 7:30 am

Ideally, I'd like to have this kind of high-level information :

event name : kernel syscall
syscall name : open
arg1 (%s) : "somefile"    &lt;-----
arg2 (%d) : flags
arg3 (%d) : mode

However, "somefile" has to be read from userspace. With the protection
involved, it would cause a performance impact to read it a second time
rather than tracing the string once it's been copied to kernel-space.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 5, 2008 - 8:28 am

performance is a secondary issue here, and copies are fast anyway _if_ 
someone wants to trace a syscall. (because the first copy brings the 
cacheline into the cache, subsequent copies are almost for free compared 
to the first copy)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Monday, May 19, 2008 - 11:44 pm

Hrm, a quick benchmark on my pentium 4 comparing a normal open() system
call executed in a loop to a modified open() syscall which executes the
lines added in the following patch adds 450 cycles to each open() system
call. I added a putname/getname on purpose to see the cost of a second
userspace copy and it's not exactly free.

The normal getname correctly nested, re-using the string previously
copied, should not suffer from that kind of performance hit. Also, given
that the string would be copied only once from userspace, it would
eliminate race scenarios where multithreaded applications could change
the string underneath, so the kernel would trace a different string than
the one being really used for the system call.

However, strings are not the only userspace arguments passed to system
calls. For all these other arguments, performance could be an issue as
well as racy user-level data modification which would let the kernel
trace a different paramenter than the one being used in the system call.

For those two reasons, I think extracting these parameters could be
faster/cleaner/safer if done in the system call function, where the
parameters are already copied in kernel space.

Mathieu


Index: linux-2.6-lttng/fs/open.c
===================================================================
--- linux-2.6-lttng.orig/fs/open.c	2008-05-19 22:51:16.000000000 -0400
+++ linux-2.6-lttng/fs/open.c	2008-05-19 23:11:07.000000000 -0400
@@ -1043,6 +1043,8 @@ long do_sys_open(int dfd, const char __u
 	int fd = PTR_ERR(tmp);
 
 	if (!IS_ERR(tmp)) {
+		putname(tmp);
+		tmp = getname(filename);
 		fd = get_unused_fd_flags(flags);
 		if (fd &gt;= 0) {
 			struct file *f = do_filp_open(dfd, tmp, flags, mode);


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Ingo Molnar <mingo@...>, <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Tuesday, May 20, 2008 - 10:18 am

On Mon, 19 May 2008 23:44:53 -0400

the audit subsystem already does all of this... why not use that??

copying twice does mean that if the user wants, he can cheat you. He
can, in another thread, change the string under you. So say you're
doing this for anti-virus purposes, he can make you scan one file and
open another.


The audit subsystem was carefully designed to avoid this trap... how
about using that?
--
To: Arjan van de Ven <arjan@...>
Cc: Ingo Molnar <mingo@...>, <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Thursday, May 22, 2008 - 8:47 am

Hrm, given tracing will have to grab __user * parameters passed to
various system calls, not limited to strings, the getname/putname
infrastructure would need to be expanded a lot. I doubt it's worth
adding such complexity (copy to temporary memory buffers and reference
counting) in those system calls to support kernel-wide tracing.

On the other hand, adding a marker in the traced function, at a code
location where the data copied into the kernel is accessible, won't add
such complexity and will help to keep good locality of reference (the
stack is meant to be a good cache-hot memory region). Because a dormant
marker does not have a significant performance hit (actually, my
benchmarks shows a small acceleration of the overall system, probably
due to cache line code layout modifications), I think it's legitimate to
add this kind of instrumentation in the existing kernel system call
functions.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <linux-kernel@...>, <systemtap@...>, Frank Ch. Eigler <fche@...>
Date: Tuesday, May 6, 2008 - 4:52 pm

Hi Ingo and Mathieu,


I think that the code duplication is also an issue.
If we introduce functions which copy userspace strings same as
original syscall functions, we have to maintain similar codes.
So, I think Mathieu's approach (separating syscall parameters from
syscall entering event) is enough reasonable.

BTW, it seems that using KERNEL_TRACE per thread flag might be a bit
tricky to trace all threads. Out of curiosity:-), why did not you use a
global percpu flag to do it?

Thanks,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


--
Previous thread: build issue #487 for v2.6.26-rc1-6-gafa26be :undefined reference to `es7000_plat' by Toralf on Sunday, May 4, 2008 - 9:35 am. (3 messages)

Next thread: none
speck-geostationary