This patchset implements Uprobes which enables you to dynamically break into any routine in a user space application and collect information non-disruptively. This patchset is a rework based on suggestions from discussions on lkml in January this year (http://lkml.org/lkml/2010/1/11/92 and http://lkml.org/lkml/2010/1/27/19). This implementation of uprobes doesnt depend on utrace. When a uprobe is registered, Uprobes makes a copy of the probed instruction, replaces the first byte(s) of the probed instruction with a breakpoint instruction. (Uprobes uses background page replacement mechanism and ensures that the breakpoint affects only that process.) When a CPU hits the breakpoint instruction, Uprobes gets notified of trap and finds the associated uprobe. It then executes the associated handler. Uprobes single-steps its copy of the probed instruction and resumes execution of the probed process at the instruction following the probepoint. Instruction copies to be single-stepped are stored in a per-process "execution out of line (XOL) area". Currently XOL area is allocated as one page vma. Advantages of uprobes over conventional debugging include: 1. Non-disruptive. 2. Much better handling of multithreaded programs because of XOL. 3. No context switch between tracer, tracee. 4. Allows multiple processes to trace same tracee. Here is the list of TODO Items. - Provide a perf interface to uprobes. (coming in next version) - Allowing probes across fork/exec. - Allowing probes on per-executable/per dso. - Allow multiple probes to share a probepoint. - Support for other architectures. - Return probes. - Uprobes booster. This patchset is based on 2.6.34-rc2. Please do provide your valuable comments. Thanks in advance. Srikar Srikar Dronamraju (10): 1. X86 instruction analysis: Move Macro W to insn.h 2. mm: Move replace_page() to mm/memory.c 3. mm: Enhance replace_page() to support pagecache 4. user_bkpt: User Space Breakpoint Assistance Layer 5. ...
Move Macro W to asm/insn.h
Macro W used to know if the instructions are valid for
user-space/kernel space. This macro is used by kprobes and
user_bkpt. (i.e user space breakpoint assistance layer.) So moving it
to a common header file asm/insn.h.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
arch/x86/include/asm/insn.h | 7 +++++++
arch/x86/kernel/kprobes.c | 7 -------
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 96c2e0a..8586820 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -23,6 +23,13 @@
/* insn_attr_t is defined in inat.h */
#include <asm/inat.h>
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
struct insn_field {
union {
insn_value_t value;
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b43bbae..4379b40 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -66,12 +66,6 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
-#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
- (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
- (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
- (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
- (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
- << (row % 32))
/*
* Undefined/reserved opcodes, conditional jump, Opcode Extension
* Groups, and some special opcodes can not boost.
@@ ...Hmm, I don't think this shortest macro name is good to expose commonly... And also, since we already have inat (instruction attribute) table, we'd better expand an inat bit to indicate which instruction can be probed/boosted. Thank you, -- Masami Hiramatsu e-mail: mhiramat@redhat.com --
Guess we would need three bits, - Instruction can be probed in kernel. - Instruction can be probed in user space. - Instruction can be boosted. Or do you have other ideas? -- Thanks and Regards Srikar --
Other two bits are ok for me :) -- Masami Hiramatsu e-mail: mhiramat@redhat.com --
Move replace_page() to mm/memory.c
Move replace_page from mm/ksm.c to mm/memory.c.
User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page. Now replace_page() loses its static attribute.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---
include/linux/mm.h | 2 ++
mm/ksm.c | 59 ----------------------------------------------------
mm/memory.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 59 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..0f43355 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -854,6 +854,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte);
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index a93f1b7..fd123de 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -766,65 +766,6 @@ out:
return err;
}
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
-
- addr ...Enhance replace_page() to support pagecache Currently replace_page would work only for anonymous pages. This patch enhances replace_page() to work for pagecache pages This enhancement is useful for user_bkpt's replace_page based background page replacement for insertion and removal of breakpoints. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> --- mm/memory.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8b3ca1b..cd5541c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2616,7 +2616,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page, } get_page(kpage); - page_add_anon_rmap(kpage, vma, addr); + if (PageAnon(kpage)) + page_add_anon_rmap(kpage, vma, addr); + else + page_add_file_rmap(kpage); flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep); --
User Space Breakpoint Assistance Layer (USER_BKPT) Currently there is no mechanism in kernel to insert/remove breakpoints. This patch implements user space breakpoint assistance layer provides kernel subsystems with architecture independent interface to establish breakpoints in user applications. This patch provides core implementation of user_bkpt and also wrappers for architecture dependent methods. USER_BKPT currently supports both single stepping inline and execution out of line strategies. Two different probepoints in the same process can have two different strategies. It handles pre-processing and post-processing of singlestep after a breakpoint hit. Single stepping inline strategy is the traditional method where original instructions replace the breakpointed instructions on a breakpoint hit. This method works well with single threaded applications. However its racy with multithreaded applications. Execution out of line strategy single steps on a copy of the instruction. This method works well for both single-threaded and multithreaded applications. There could be other strategies like emulating an instruction. However they are currently not implemented. Insertion and removal of breakpoints is by "Background page replacement". i.e make a copy of the page, modify its the contents, set the pagetable and flush the tlbs. This page uses enhanced replace_page to cow the page. Modified page is only reflected for the interested process. Others sharing the page will still see the old copy. You need to follow this up with the USER_BKPT patch for your architecture. Uprobes uses this facility to insert/remove breakpoint. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> --- arch/Kconfig | 14 + include/linux/user_bkpt.h | 296 +++++++++++++++++++++++ kernel/Makefile | 1 kernel/user_bkpt.c | 572 ...
copy_from_user() takes and returns an unsigned long arg but this function is converting these to and from ints. That's OK if we're 100% sure that we'll never get or return an arg >2G. Otherwise things could get ghastly. Please have a think. (Dittoes for some other functionss This looks like it has the wrong interface. It should take a `void __user *vaddr'. If any casting is to be done, it should be done at the highest level so that sparse can check that the thing is used correctly It might be smarter to allocate this page outside the mmap_sem region. kmap_atomic() is preferred - it's faster. kmap() is still deadlockable I guess if a process ever kmaps two pages at the same time. This code It used to be the case that the above linebreak is "wrong". (Nobody ever tests their kerneldoc output!) I have a vague feeling that this If this BUG_ON triggers, we won't know which of these pointers was NULL, ditto. Really, there's never much point in BUG_ON(!some_pointer); Just go ahead and dereference the pointer. If it's NULL then we'll get an oops which gives all the information which the BUG_ON would have --
Yes, that's OK now. Not a problem. -- ~Randy --
nbytes would not be greater than the maximum size of a instruction for that architecture. Hence I dont see it going above 2G. However I will take a relook. I will rework the rest of the comments as suggested by you. It would be part of the next version. -- Thanks and Regards Srikar --
x86 support for user breakpoint Infrastructure This patch provides x86 specific userspace breakpoint assistance implementation details. It uses the "x86: instruction decoder API" patch to do validate and analyze the instructions. This analysis is used at the time of post-processing of breakpoint hit to do the necessary fix-ups. Almost all instructions are handled for traditional strategy and execution out of line strategy. Instruction handled include the RIP relative instructions. This patch requires "x86: instruction decoder API" patch. http://lkml.org/lkml/2009/6/1/459 Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/x86/Kconfig | 1 arch/x86/include/asm/user_bkpt.h | 43 +++ arch/x86/kernel/Makefile | 2 arch/x86/kernel/user_bkpt.c | 574 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 620 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/user_bkpt.h create mode 100644 arch/x86/kernel/user_bkpt.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0eacb1f..851cedc 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -53,6 +53,7 @@ config X86 select HAVE_KERNEL_LZMA select HAVE_KERNEL_LZO select HAVE_HW_BREAKPOINT + select HAVE_USER_BKPT select PERF_EVENTS select ANON_INODES select HAVE_ARCH_KMEMCHECK diff --git a/arch/x86/include/asm/user_bkpt.h b/arch/x86/include/asm/user_bkpt.h new file mode 100644 index 0000000..df8a4a0 --- /dev/null +++ b/arch/x86/include/asm/user_bkpt.h @@ -0,0 +1,43 @@ +#ifndef _ASM_USER_BKPT_H +#define _ASM_USER_BKPT_H +/* + * User-space BreakPoint support (user_bkpt) for x86 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is ...
Slot allocation for Execution out of line strategy(XOL) This patch provides slot allocation mechanism for execution out of line strategy for use with user space breakpoint infrastructure. Traditional method of replacing the original instructions on breakpoint hit are racy when used on multithreaded applications. Alternatives for the traditional method include: - Emulating the breakpointed instruction. - Execution out of line. Emulating the instruction: This approach would use a in-kernel instruction emulator to emulate the breakpointed instruction. This approach could be looked in at a later point of time. Execution out of line: In execution out of line strategy, a new vma is injected into the target process, a copy of the instructions which are breakpointed is stored in one of the slots. On breakpoint hit, the copy of the instruction is single-stepped leaving the breakpoint instruction as is. This method is architecture independent. This method is useful while handling multithreaded processes. This patch allocates one page per process for slots to be used to copy the breakpointed instructions. Current slot allocation mechanism: 1. Allocate one dedicated slot per user breakpoint. Each slot is big enuf to accomodate the biggest instruction for that architecture. (16 bytes for x86). 2. We currently allocate only one page for slots. Hence the number of slots is limited to active breakpoint hits on that process. 3. Bitmap to track used slots. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/Kconfig | 4 + include/linux/user_bkpt_xol.h | 61 +++++++++ kernel/Makefile | 1 kernel/user_bkpt_xol.c | 290 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 356 insertions(+), 0 deletions(-) create mode 100644 include/linux/user_bkpt_xol.h create mode 100644 kernel/user_bkpt_xol.c diff --git a/arch/Kconfig b/arch/Kconfig index ...
Uprobes Implementation The uprobes infrastructure enables a user to dynamically establish probepoints in user applications and collect information by executing a handler function when a probepoint is hit. The user specifies the virtual address and the pid of the process of interest along with the action to be performed (handler). The handle Uprobes is implemented on the user-space breakpoint assistance layer and uses the execution out of line strategy. Uprobes follows lazy slot allocation. I.e, on the first probe hit for that process, a new vma (to hold the probed instructions for execution out of line) is allocated. Once allocated, this vma remains for the life of the process, and is reused as needed for subsequent probes. A slot in the vma is allocated for a probepoint when it is first hit. A slot is marked for reuse when the probe gets unregistered and no threads are using that slot. In a multithreaded process, a probepoint once registered is active for all threads of a process. If a thread specific action for a probepoint is required then the handler should be implemented to do the same. If a breakpoint already exists at a particular address (irrespective of who inserted the breakpoint including uprobes), uprobes will refuse to register any more probes at that address. You need to follow this up with the uprobes patch for your architecture. For more information: please refer to Documentation/uprobes.txt TODO: 1. Perf/trace events interface for uprobes. 2. Allow multiple probes at a probepoint. 3. Booster probes. 4. Allow probes to be inherited across fork. 5. probing function returns. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> --- arch/Kconfig | 13 + include/linux/sched.h | 3 include/linux/tracehook.h | 18 + include/linux/uprobes.h | 178 ++++++++++ kernel/Makefile | 1 kernel/fork.c | 3 kernel/uprobes.c | ...
I would still prefer to see something like:
vma:offset, instead of tid:vaddr
You want to probe a symbol in a DSO, filtering per-task comes after that
if desired.
Also, like we discussed in person, I think we can do away with the
handler_in_interrupt thing by letting the handler have an error return
value and doing something like:
do_int3:
uprobe = find_probe_point(addr);
pagefault_disable();
err = uprobe->handler(uprobe, regs);
pagefault_enable();
if (err == -EFAULT) {
/* set TIF flag and call the handler again from
task context */
}
This should allow the handler to optimistically access memory from the
trap handler, but in case it does need to fault pages in we'll call it
Everybody else simply places callbacks in kernel/fork.c and
kernel/exit.c, but as it is I don't think you want per-task state like
this.
One thing I would like to see is a slot per task, that has a number of
advantages over the current patch-set in that it doesn't have one page
limit in number of probe sites, nor do you need to insert vmas into each
and every address space that happens to have your DSO mapped.
Also, I would simply kill the user_bkpt stuff and merge it into uprobes,
we don't have a kernel_bkpt thing either, only kprobes.
--
To clarify: like I discussed with Jim in person. --
If I would want to trace malloc in a process $ objdump -T /lib64/libc.so.6 | grep malloc 000000357c274b80 g DF .text 0000000000000224 GLIBC_2.2.5 __libc_malloc 000000357c271000 w DF .text 0000000000000273 GLIBC_2.2.5 malloc_stats 000000357c275570 w DF .text 00000000000001fb GLIBC_2.2.5 malloc_get_state 000000357c5514f8 w DO .data 0000000000000008 GLIBC_2.2.5 __malloc_hook 000000357c274b80 g DF .text 0000000000000224 GLIBC_2.2.5 malloc 000000357c26f570 w DF .text 0000000000000033 GLIBC_2.2.5 malloc_usable_size 000000357c271420 w DF .text 000000000000024e GLIBC_2.2.5 malloc_trim 000000357c5529a0 w DO .bss 0000000000000008 GLIBC_2.2.5 __malloc_initialize_hook 000000357c271670 w DF .text 00000000000003c2 GLIBC_2.2.5 malloc_set_state $ $ cat /proc/9069/maps ............... 357c200000-357c34d000 r-xp 00000000 08:03 6115979 /lib64/libc-2.5.so 357c34d000-357c54d000 ---p 0014d000 08:03 6115979 /lib64/libc-2.5.so 357c54d000-357c551000 r--p 0014d000 08:03 6115979 /lib64/libc-2.5.so 357c551000-357c552000 rw-p 00151000 08:03 6115979 /lib64/libc-2.5.so ............... $ do you mean the user should be specifying 357c200000:74b80 to denote 000000357c274b80? or /lib64/libc.so.6:74b80 where are the per task slots stored? We had uprobes as one single layer. However it was suggested that breaking it up into two layers was useful because it would help code reuse. Esp it was felt that a generic user_bkpt layer would be far more useful than being used for just uprobes. Here are links where these discussion happened. http://sourceware.org/ml/systemtap/2007-q1/msg00570.html http://sourceware.org/ml/systemtap/2007-q1/msg00571.html -- Thanks and Regards Srikar --
Well userspace would simply specify something like: /lib/libc.so:malloc, we'd probably communicate that to the kernel using a filedesc and offset. And yes, all processes that share that DSO, consumers can install Don't do that ;-) What reason would you have to sleep from a int3 anyway? You want to log bits and get on with life, right? The only interesting case is faulting when some memory references you want are not currently available, and The per task slot (note the singular, each task needs only ever have a single slot since a task can only ever hit one trap at a time) would I'm so not going to read ancient emails on a funky list. What re-use? uprobe should be the only interface to this, there's no second interface to kprobes either is there? --
Hmm, for low-level interface, it will be good. If we provide
a user interface(trace_uprobe.c), we'd better add pid filter
Out of curiously, what does the task-context mean? ('current' is probed
task in int3, isn't it?). I think, uprobe handler can cause page fault
Hmm, I just worried about whether TLS/task stack can be executable
It will be good when we start working on 'ptrace2' :)
Anyway, the patch order looks a bit odd, because user_bkpt uses XOL
but XOL patch is introduced after user_bkpt patch...
Thank you,
--
Masami Hiramatsu
e-mail: mhiramat@redhat.com
--
Task context means the regular kernel task stack where we can schedule, int3 has its own exception stack and we cannot schedule from that. And yes, the fault thing is the one case where sleeping makes sense and is dealt with in my proposal, you don't need two handlers for that, just call it from trap context with pagefault_disable() and when it fails with -EFAULT set a TIF flag to deal with it later when we're back in task context. There is a very good probability that the memory you want to reference is mapped (because typically the program itself will want to access it as well) so doing the optimistic access with pagefault_disabled() will work most of the times and you only end up taking the slow path when it But why would ptrace2 use a different interface? Also, why introduce some abstraction layer now without having a user for it, you could always restructure things and or add interfaces later when you have a clear idea what it is you need. --
Ah, I see. so it will be done later. Actually, since int3 handler will hm, similar technique can be applied to kprobe-tracer too (for getting Because 'ptrace' doesn't have any breakpoint insertion helper. Programs which uses ptrace must setup single-stepping buffer and modify target code by themselves. This causes problems when multiple debuggers/tracers attach to the same process and try to modify same address. First program can see the original instruction, but next one will see int3! I think we'd better provide some abstraction interface for breakpoint setting in next generation ptrace (of course, we also need to provide memory peek interface which returns original instructions). But anyway, I agree with you, we don't need it *now*, but someday. Thank you, -- Masami Hiramatsu e-mail: mhiramat@redhat.com --
user_bkpt provides xol strategy. user_bkpt_xol patch only provides slot allocation for Execution out of line strategy. It doesnt implement execution out of line strategy. The current implementation assumes that we pass the user_bkpt structure as an argument while allocating/freeing a slot. user_bkpt knows how to handle execution out of line. Its working is independent of how and where the slot is allocated. The field xol_vaddr points to a location which holds the copy of the instruction to be single-stepped/executed. Hence user_bkpt patch was followed by user_bkpt_xol patch. -- Thanks and Regards --
Well, rewind back to 2006 to the first edition of uprobes; it had the 'global' tracing feature like what you indicate here, although Andrew wouldn't want to be reminded of *how* that was done (hooking readpages()) :-) At the time, global tracing was vehemently vetoed in favour of a per-process approach. With the TIF method, you get to the probed process' task context in do_notify_resume(), and have sufficient flexibility for non-perf users, like gdb, 'cos what uprobes provides now, is close to what Tom Tromey asked for gdb's usage. Ananth --
Both in-tree consumers of uprobes (ftrace and perf) are capable of task filters. But the thing is, dso:sym is very much not a task property, adding task filters afterwards sure makes sense in some cases but it should not be the primary mode. If people really want to optimize this we can easily add a few bits of task state which could tell the trap handler to not even bother looking up things but restart as fast as possible. --
If you wish this new uprobes to be useful to tools such as gdb, remember the value of preserving the property that processes not being debugged are not to be interfered with. You don't want a DoS due to some guy setting ten thousand breakpoints on glibc. Such considerations should overrule perf/ftrace's simplifying assumptions that after-the-fact event filtering is surely always sufficient. - FChE --
Are you suggesting we have the global tracing as default and then have task filters. I've already alluded to this being vetoed earlier, by people including Andrew Morton, Hugh Dickins, Arjan, Christoph Hellwig, Nick Piggin, etc. It's a route we'd prefer not to go down again... Aside, what are the mechanisms to do this? The current breakpoint insertion and removal, even for shared libraries, is process local since the only page tables of the process being traced is modified. In order to have a global visibility of dso probes, one obvious method is to put in the probes before the text hits pagecache. This approach works for 'yet-to-start' processes that would map the dso too. This was prototyped in the series at http://lkml.org/lkml/2006/5/9/25 did that and was suitably junked, for very valid reasons. Even Hugh Dickins thumbed down the pagecache approach (http://lkml.org/lkml/2006/5/9/209) Given the current design has enough flexibility to accommodate non perf users like gdb, a simple pid based approach for the lowest layer makes the most sense. I'd rather prefer a higher level entity (say, perf) do the difficult job of filtering down individual requests only for processes of interest, then the lower layer can iteratively do the probe insertions for all processes of interest. I am not sure if there is a better method to do probes with 'global' visibility. Did you have an easier approach in mind? Ananth --
I think perf would be using uprobes in one of the four ways. - Trace a particular process. - Trace a particular session. - Trace all instances of an executable. - Trace all programs in the system. If we use global approach, filtering would still be part of the handler. So even if we want to probe just one process, we would still take hit for all processes that map the DSO and hit that vaddr. Other process could be hitting the probepoint more often while the probed process could rarely be hitting the probepoint. This could place significant overhead on the system. Also with KSM, the page we are probing could be part of the stable tree and mapped by different virtual machines. Can this lead to interruptting work on an unrelated virtual machine? If yes, Is it okay to interrupt an unrelated VM? If not, what measures need to be taken? Currently perf can be used by priviledged users. However when perf gets to trace user space programs, would it still be limited to priviledged users. Do we have plans to allow users to trace their owned Though one of the usp of uprobes is non disruptive tracing, applications like debuggers who do disruptive tracing can benefit from uprobes. Debuggers could use uprobes as a feature to implement inserting/removing breakpoints and get the out of line single-stepping. In an earlier discussion http://lkml.org/lkml/2010/1/26/344 Tom Tromey did say that if a facility was given, it could be used in gdb. What I expect is the tracee to inform the tracer that it has hit the breakpoint and "wait" for the tracer to give indication to continue. Benefits could be - Debuggers can benefit from execution out of line and can debug multithread processes much better. - Two debbugers/tracers could trace the same process. One of the tracer could be strace, while the other one could be gdb. - perf and debugger could be interested in the same vaddr for that process and still continue to work. Lets say debugger and perf are interested in a ...
I'm not sure, currently all the tracing bits require root. One of the complications is that dynamic trace events (kprobes and uprobes) share a global namespace, so making that accessible to users might be interesting. So one thing we can do to avoid some of the trap overhead is to de-couple the trace event creation from trace event enable (pretty much already so for existing implementations), so while you define a dynamic trace event as dso:sym, you provide ways to enable it globally and per task. We'd basically need a global and per-task refcount on enable and make sure the breakpoint is installed properly for (global || task). That way a perf per-cpu event will do the global enable, and a perf A double scribble will be an issue for the current generation of debuggers anyway, right? But yes, I suppose if you want to use uprobes for debuggers then yes it makes sense to allow to put the task to sleep. One way would be to provide means for the handler to detect the context and simply always Before NX there simply was no option, anyway, I guess the writable requirement comes from being stack, and I'm not sure how TLS is done, but I guess that has similar constraints on being writable, right? I've heard from people that some other OS does indeed have the trampoline in TLS. --
Ulrich, Can you please comment if a slot in TLS can be used for storing and executing an instruction? Are there any additional issues that we need to take care of? Are there architectures that dont support TLS? -- Thanks and Regards Srikar --
Yes, when we allow two or more probes to co-exist at a probepoint, we double scribble as in two apps writing to the same address? Uprobes handles this by failing into insert probes at location where there is a breakpoint already inserted. So if both apps were to use the uprobes interface, then they could co-operate and co-exist. (This would need the feature in uprobes to have multiple probes per probepoint which is Yes, thats certainly possible. However lets consider the case when we allow multiple probes per probepoint and one handler faults (handler detects it could be sleeping) while the other handler may or may not fault (handler could be doing a copy_from_user). When the thread switches to task context and runs the first handler but it has no state information about the second handler having run in the interrupt context. So here we may be unable to decide if we should run the second handler or not. -- Thanks and Regards Srikar --
X86 support for Uprobes This patch provides x86 specific details for uprobes. This includes interrupt notifier for uprobes, enabling/disabling singlestep. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> --- arch/x86/Kconfig | 1 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/uprobes.c | 87 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 89 insertions(+), 0 deletions(-) create mode 100644 arch/x86/kernel/uprobes.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 851cedc..a860a9b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -54,6 +54,7 @@ config X86 select HAVE_KERNEL_LZO select HAVE_HW_BREAKPOINT select HAVE_USER_BKPT + select HAVE_UPROBES select PERF_EVENTS select ANON_INODES select HAVE_ARCH_KMEMCHECK diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 98c74b4..bfa48f0 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -118,6 +118,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o obj-$(CONFIG_USER_BKPT) += user_bkpt.o +obj-$(CONFIG_UPROBES) += uprobes.o ### # 64 bit specific files diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c new file mode 100644 index 0000000..1acce22 --- /dev/null +++ b/arch/x86/kernel/uprobes.c @@ -0,0 +1,87 @@ +/* + * Userspace Probes (UProbes) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You ...
Uprobes documentation. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Documentation/uprobes.txt | 244 +++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 244 insertions(+), 0 deletions(-) create mode 100644 Documentation/uprobes.txt diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt new file mode 100644 index 0000000..08bbf24 --- /dev/null +++ b/Documentation/uprobes.txt @@ -0,0 +1,244 @@ +Title : User-Space Probes (Uprobes) +Authors : Jim Keniston <jkenisto@us.ibm.com> + : Srikar Dronamraju <srikar@linux.vnet.ibm.com> + +CONTENTS + +1. Concepts: Uprobes +2. Architectures Supported +3. Configuring Uprobes +4. API Reference +5. Uprobes Features and Limitations +6. Probe Overhead +7. TODO +8. Uprobes Team +9. Uprobes Example + +1. Concepts: Uprobes + +Uprobes enables you to dynamically break into any routine in a +user application and collect debugging and performance information +non-disruptively. You can trap at any code address, specifying a +kernel handler routine to be invoked when the breakpoint is hit. + +A uprobe can be inserted on any instruction in the application's +virtual address space. The registration function register_uprobe() +specifies which process is to be probed, where the probe is to be +inserted, and what handler is to be called when the probe is hit. + +Uprobes-based instrumentation can be packaged as a kernel +module. In the simplest case, the module's init function installs +("registers") one or more probes, and the exit function unregisters +them. + +1.1 How Does a Uprobe Work? + +When a uprobe is registered, Uprobes makes a copy of the probed +instruction, stops the probed application, replaces the first byte(s) +of the probed instruction with a breakpoint instruction (e.g., int3 +on i386 and x86_64), and allows the probed application to continue. +(When inserting the breakpoint, Uprobes uses background page +replacement ...
no space after "archs/", just: -- ~Randy --
Thanks for the review, Your comment however made me realize that I had used user-bkpt here rather than user_bkpt. user_bkpt is a layer that provides breakpoint insertion and removal. I wanted to mention that uprobes depends on user_bkpt layer. I think "This user_bkpt based version" is probably better than "This user-breakpoint based version" -- Thanks and Regards Srikar --
Uprobes Samples
This provides an example uprobes module in the samples directory.
To run this module run (as root)
insmod uprobe_example.ko vaddr=<vaddr> pid=<pid>
Where <vaddr> is the address where we want to place the probe.
<pid> is the pid of the process we are interested to probe.
example: -
# cd samples/uprobes
[get the virtual address to place the probe.]
# vaddr=0x$(objdump -T /bin/bash |awk '/echo_builtin/ {print $1}')
[Run a bash shell in the background; have it echo 4 lines.]
# (sleep 10; echo 1; echo 2; echo 3; echo 4) &
[Probe calls echo_builtin() in the background bash process.]
# insmod uprobe_example.ko vaddr=$vaddr pid=$!
# sleep 10
# rmmod uprobe_example
# dmesg | tail -n 3
Registering uprobe on pid 10875, vaddr 0x45aa30
Unregistering uprobe on pid 10875, vaddr 0x45aa30
Probepoint was hit 4 times
#
[ Output shows that echo_builtin function was hit 4 times. ]
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
samples/Kconfig | 7 +++
samples/uprobes/Makefile | 17 ++++++++
samples/uprobes/uprobe_example.c | 83 ++++++++++++++++++++++++++++++++++++++
3 files changed, 107 insertions(+), 0 deletions(-)
create mode 100644 samples/uprobes/Makefile
create mode 100644 samples/uprobes/uprobe_example.c
diff --git a/samples/Kconfig b/samples/Kconfig
index 8924f72..50b8b1c 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -44,4 +44,11 @@ config SAMPLE_HW_BREAKPOINT
help
This builds kernel hardware breakpoint example modules.
+config SAMPLE_UPROBES
+ tristate "Build uprobes example -- loadable module only"
+ depends on UPROBES && m
+ help
+ This builds uprobes example module.
+
+
endif # SAMPLES
diff --git a/samples/uprobes/Makefile b/samples/uprobes/Makefile
new file mode 100644
index 0000000..f535f6f
--- /dev/null
+++ b/samples/uprobes/Makefile
@@ -0,0 +1,17 @@
+# builds the uprobes example kernel modules;
+# then to use one (as root):
+# insmod ...What's missing here is a description of why all this is useful. Presumably much of the functionality which this feature offers can be done wholly in userspace. So I think it would be useful if you were to carefully explain the thinking here - what the value is, how people will use it, why it needs to be done in-kernel, etc. Right now if I was asked "why did you merge that", I'd say "gee, I dunno". I say that a lot. Knowing all of this would perhaps help me to understand your thinking regarding ftrace integration. The code itself is positioned as non-x86-specific, but the implementation is x86-only. It would be nice to get some confirmation that other architectures can successfully use the core code. But that will be hard to arrange, so probably crossing our fingers is the best approach here. The code scares me a bit from the "how can malicious people exploit it" point of view. Breaking into other users programs/memory, causing the kernel to scribble on itself, causing unbound memory consumption, etc. No specific issues that I can point at, just vague fear. Do we know that exiting userspace will never ever already be using int3? What happens if I run this code in 2016 on a CPU which has new opcodes which this code didn't know about? When uprobes was being pushed five-odd years ago, it did all sorts of hair-raising things to avoid COWing shared pages. Lots of reasons were given why it *had* to avoid COW. But now it COWs. What were those reasons why COW was unacceptable, and what changed? --
Main motivations for uprobes - non-disruptive tracing. Current ptrace based mechanisms generally involve signals and stopped threads. Also it involves context switching between the tracer and tracee. The delay and involvement of signals can result in problems seen in production systems not seen while tracing. Uprobes tracing wouldnt involve signals, context switches between tracer and tracee. - Multithreaded support. Current ptrace based mechanisms for tracing apps use single stepping inline, i.e they copy back the original instruction on hitting a breakpoint. In such mechanisms tracers have to stop all the threads on a breakpoint hit or tracers will not be able to handle all hits to the location of interest. Uprobes uses execution out of line, where the instruction to be traced is analysed at the time of breakpoint insertion and a copy of instruction is stored at a different location. On breakpoint hit, uprobes jumps to that copied location and singlesteps the same instruction and does the necessary fixups post singlestepping. - Tracing multiple applications: A uprobe based tracer would be able to trace multiple (similar or different) applications. This could be very useful in understanding how different applications are interacting with each other. - Multiple tracers for an application: Multiple uprobes based tracer could work in unison to trace an application. There could one tracer that could be interested in generic events for a particular set of process. While there could be another tracer that is just interested in one specific event of a particular process thats part of the previous set of process. - Corelating events from kernels and userspace. Uprobes could be used with other tools like kprobes, tracepoints or as part of higher level tools like perf to give a consolidated set of events from kernel and userspace. In future we could look at a single backtrace showing application, library and kernel calls. We are looking at providing a perf interface for ...
