Re: [PATCH v1 7/10] Uprobes Implementation

Previous thread: Prize Won! by =?ISO-8859-1?Q?=BD=B2=B1R=ACw?= on Saturday, March 20, 2010 - 7:13 am. (1 message)

Next thread: [PATCH] Netfilter: Fix integer overflow in net/ipv6/netfilter/ip6_tables.c by wzt.wzt on Saturday, March 20, 2010 - 7:32 am. (11 messages)
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:24 am

This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset is a rework based on suggestions from discussions on lkml
in January this year (http://lkml.org/lkml/2010/1/11/92 and
http://lkml.org/lkml/2010/1/27/19).  This implementation of uprobes
doesnt depend on utrace.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-process "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

Advantages of uprobes over conventional debugging include:
1. Non-disruptive.
2. Much better handling of multithreaded programs because of XOL.
3. No context switch between tracer, tracee.
4. Allows multiple processes to trace same tracee.

Here is the list of TODO Items.

- Provide a perf interface to uprobes. (coming in next version)
- Allowing probes across fork/exec.
- Allowing probes on per-executable/per dso.
- Allow multiple probes to share a probepoint.
- Support for other architectures.
- Return probes.
- Uprobes booster.

This patchset is based on 2.6.34-rc2.

Please do provide your valuable comments.

Thanks in advance.
Srikar


Srikar Dronamraju (10):
 1.  X86 instruction analysis: Move Macro W to insn.h
 2.  mm: Move replace_page() to mm/memory.c
 3.  mm: Enhance replace_page() to support pagecache
 4.  user_bkpt: User Space Breakpoint Assistance Layer
 5.  ...
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:25 am

Move Macro W to asm/insn.h

Macro W used to know if the instructions are valid for
user-space/kernel space.  This macro is used by kprobes and
user_bkpt. (i.e user space breakpoint assistance layer.) So moving it
to a common header file asm/insn.h.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/include/asm/insn.h |    7 +++++++
 arch/x86/kernel/kprobes.c   |    7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)


diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 96c2e0a..8586820 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -23,6 +23,13 @@
 /* insn_attr_t is defined in inat.h */
 #include <asm/inat.h>
 
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
 struct insn_field {
 	union {
 		insn_value_t value;
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b43bbae..4379b40 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -66,12 +66,6 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
 #define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
 
-#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
-	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
-	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
-	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
-	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
-	 << (row % 32))
 	/*
 	 * Undefined/reserved opcodes, conditional jump, Opcode Extension
 	 * Groups, and some special opcodes can not boost.
@@ ...
From: Masami Hiramatsu
Date: Saturday, March 20, 2010 - 8:50 am

Hmm, I don't think this shortest macro name is good to expose
commonly... And also, since we already have inat (instruction
attribute) table, we'd better expand an inat bit to indicate
which instruction can be probed/boosted.

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

--

From: Srikar Dronamraju
Date: Sunday, March 21, 2010 - 11:24 pm

Guess we would need three bits, 
- Instruction can be probed in kernel.
- Instruction can be probed in user space.
- Instruction can be boosted.

Or do you have other ideas?

--
Thanks and Regards
Srikar
--

From: Masami Hiramatsu
Date: Monday, March 22, 2010 - 7:11 am

Other two bits are ok for me :)


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

--

From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:25 am

Move replace_page() to mm/memory.c

Move replace_page from mm/ksm.c to mm/memory.c.
User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page.  Now replace_page() loses its static attribute.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 include/linux/mm.h |    2 ++
 mm/ksm.c           |   59 ----------------------------------------------------
 mm/memory.c        |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 59 deletions(-)


diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..0f43355 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -854,6 +854,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+		struct page *kpage, pte_t orig_pte);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index a93f1b7..fd123de 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -766,65 +766,6 @@ out:
 	return err;
 }
 
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma:      vma that holds the pte pointing to page
- * @page:     the page we are replacing by kpage
- * @kpage:    the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
-			struct page *kpage, pte_t orig_pte)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *ptep;
-	spinlock_t *ptl;
-	unsigned long addr;
-	int err = -EFAULT;
-
-	addr ...
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:25 am

Enhance replace_page() to support pagecache

Currently replace_page would work only for anonymous pages.
This patch enhances replace_page() to work for pagecache pages

This enhancement is useful for user_bkpt's replace_page based
background page replacement for insertion and removal of breakpoints.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 mm/memory.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)


diff --git a/mm/memory.c b/mm/memory.c
index 8b3ca1b..cd5541c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2616,7 +2616,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	if (PageAnon(kpage))
+		page_add_anon_rmap(kpage, vma, addr);
+	else
+		page_add_file_rmap(kpage);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
--

From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:25 am

User Space Breakpoint Assistance Layer (USER_BKPT)

Currently there is no mechanism in kernel to insert/remove breakpoints.

This patch implements user space breakpoint assistance layer provides
kernel subsystems with architecture independent interface to establish
breakpoints in user applications. This patch provides core
implementation of user_bkpt and also wrappers for architecture dependent
methods.

USER_BKPT currently supports both single stepping inline and execution
out of line strategies. Two different probepoints in the same process
can have two different strategies. It handles pre-processing and
post-processing of singlestep after a breakpoint hit.

Single stepping inline strategy is the traditional method where original
instructions replace the breakpointed instructions on a breakpoint hit.
This method works well with single threaded applications. However its
racy with multithreaded applications.

Execution out of line strategy single steps on a copy of the
instruction. This method works well for both single-threaded and
multithreaded applications.

There could be other strategies like emulating an instruction. However
they are currently not implemented.

Insertion and removal of breakpoints is by "Background page
replacement". i.e make a copy of the page, modify its the contents, set
the pagetable and flush the tlbs. This page uses enhanced replace_page
to cow the page. Modified page is only reflected for the interested
process. Others sharing the page will still see the old copy.

You need to follow this up with the USER_BKPT patch for your
architecture.

Uprobes uses this facility to insert/remove breakpoint.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 arch/Kconfig              |   14 +
 include/linux/user_bkpt.h |  296 +++++++++++++++++++++++
 kernel/Makefile           |    1 
 kernel/user_bkpt.c        |  572 ...
From: Andrew Morton
Date: Monday, March 22, 2010 - 6:40 pm

copy_from_user() takes and returns an unsigned long arg but this
function is converting these to and from ints.  That's OK if we're 100%
sure that we'll never get or return an arg >2G.  Otherwise things could
get ghastly.  Please have a think.  (Dittoes for some other functionss

This looks like it has the wrong interface.  It should take a `void
__user *vaddr'.  If any casting is to be done, it should be done at the
highest level so that sparse can check that the thing is used correctly


It might be smarter to allocate this page outside the mmap_sem region. 

kmap_atomic() is preferred - it's faster.  kmap() is still deadlockable
I guess if a process ever kmaps two pages at the same time.  This code

It used to be the case that the above linebreak is "wrong".  (Nobody
ever tests their kerneldoc output!) I have a vague feeling that this

If this BUG_ON triggers, we won't know which of these pointers was NULL,

ditto.

Really, there's never much point in

	BUG_ON(!some_pointer);

Just go ahead and dereference the pointer.  If it's NULL then we'll get
an oops which gives all the information which the BUG_ON would have


--

From: Randy Dunlap
Date: Monday, March 22, 2010 - 9:48 pm

Yes, that's OK now.  Not a problem.

-- 
~Randy
--

From: Srikar Dronamraju
Date: Tuesday, March 23, 2010 - 4:26 am

nbytes would not be greater than the maximum size of a instruction for
that architecture. Hence I dont see it going above 2G. However I will
take a relook.


I will rework the rest of the comments as suggested by you.
It would be part of the next version.

--
Thanks and Regards
Srikar
--

From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:25 am

x86 support for user breakpoint Infrastructure

This patch provides x86 specific userspace breakpoint assistance
implementation details. It uses the "x86: instruction decoder API" patch
to do validate and analyze the instructions. This analysis is used at
the time of post-processing of breakpoint hit to do the necessary
fix-ups.

Almost all instructions are handled for traditional strategy and
execution out of line strategy. Instruction handled include the RIP
relative instructions.

This patch requires "x86: instruction decoder API" patch.
http://lkml.org/lkml/2009/6/1/459

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/Kconfig                 |    1 
 arch/x86/include/asm/user_bkpt.h |   43 +++
 arch/x86/kernel/Makefile         |    2 
 arch/x86/kernel/user_bkpt.c      |  574 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 620 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/user_bkpt.h
 create mode 100644 arch/x86/kernel/user_bkpt.c


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..851cedc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,6 +53,7 @@ config X86
 	select HAVE_KERNEL_LZMA
 	select HAVE_KERNEL_LZO
 	select HAVE_HW_BREAKPOINT
+	select HAVE_USER_BKPT
 	select PERF_EVENTS
 	select ANON_INODES
 	select HAVE_ARCH_KMEMCHECK
diff --git a/arch/x86/include/asm/user_bkpt.h b/arch/x86/include/asm/user_bkpt.h
new file mode 100644
index 0000000..df8a4a0
--- /dev/null
+++ b/arch/x86/include/asm/user_bkpt.h
@@ -0,0 +1,43 @@
+#ifndef _ASM_USER_BKPT_H
+#define _ASM_USER_BKPT_H
+/*
+ * User-space BreakPoint support (user_bkpt) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is ...
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:26 am

Slot allocation for Execution out of line strategy(XOL)

This patch provides slot allocation mechanism for execution out of
line strategy for use with user space breakpoint infrastructure.

Traditional method of replacing the original instructions on breakpoint
hit are racy when used on multithreaded applications.

Alternatives for the traditional method include:
	- Emulating the breakpointed instruction.
	- Execution out of line.

Emulating the instruction:
	This approach would use a in-kernel instruction emulator to
emulate the breakpointed instruction. This approach could be looked in
at a later point of time.

Execution out of line:
	In execution out of line strategy, a new vma is injected into
the target process, a copy of the instructions which are breakpointed is
stored in one of the slots. On breakpoint hit, the copy of the
instruction is single-stepped leaving the breakpoint instruction as is.
This method is architecture independent.

This method is useful while handling multithreaded processes.

This patch allocates one page per process for slots to be used to copy the
breakpointed instructions.

Current slot allocation mechanism:
1. Allocate one dedicated slot per user breakpoint. Each slot is big
enuf to accomodate the biggest instruction for that architecture. (16
bytes for x86).
2. We currently allocate only one page for slots. Hence the number of
slots is limited to active breakpoint hits on that process.
3. Bitmap to track used slots.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/Kconfig                  |    4 +
 include/linux/user_bkpt_xol.h |   61 +++++++++
 kernel/Makefile               |    1 
 kernel/user_bkpt_xol.c        |  290 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/user_bkpt_xol.h
 create mode 100644 kernel/user_bkpt_xol.c


diff --git a/arch/Kconfig b/arch/Kconfig
index ...
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:26 am

Uprobes Implementation

The uprobes infrastructure enables a user to dynamically establish
probepoints in user applications and collect information by executing a
handler function when a probepoint is hit.

The user specifies the virtual address and the pid of the process of
interest along with the action to be performed (handler). The handle
Uprobes is implemented on the user-space breakpoint assistance layer
and uses the execution out of line strategy. Uprobes follows lazy slot
allocation. I.e, on the first probe hit for that process, a new vma (to
hold the probed instructions for execution out of line) is allocated.
Once allocated, this vma remains for the life of the process, and is
reused as needed for subsequent probes.  A slot in the vma is allocated
for a probepoint when it is first hit.

A slot is marked for reuse when the probe gets unregistered and no
threads are using that slot.

In a multithreaded process, a probepoint once registered is active for
all threads of a process. If a thread specific action for a probepoint
is required then the handler should be implemented to do the same.

If a breakpoint already exists at a particular address (irrespective of
who inserted the breakpoint including uprobes), uprobes will refuse to
register any more probes at that address.

You need to follow this up with the uprobes patch for your
architecture.

For more information: please refer to Documentation/uprobes.txt

TODO:
1. Perf/trace events interface for uprobes.
2. Allow multiple probes at a probepoint.
3. Booster probes.
4. Allow probes to be inherited across fork.
5. probing function returns.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
---

 arch/Kconfig              |   13 +
 include/linux/sched.h     |    3 
 include/linux/tracehook.h |   18 +
 include/linux/uprobes.h   |  178 ++++++++++
 kernel/Makefile           |    1 
 kernel/fork.c             |    3 
 kernel/uprobes.c          |  ...
From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 4:01 am

I would still prefer to see something like:

 vma:offset, instead of tid:vaddr

You want to probe a symbol in a DSO, filtering per-task comes after that
if desired.

Also, like we discussed in person, I think we can do away with the
handler_in_interrupt thing by letting the handler have an error return
value and doing something like:

do_int3:

  uprobe = find_probe_point(addr);

  pagefault_disable();
  err = uprobe->handler(uprobe, regs);
  pagefault_enable();

  if (err == -EFAULT) {
    /* set TIF flag and call the handler again from
       task context */
  }

This should allow the handler to optimistically access memory from the
trap handler, but in case it does need to fault pages in we'll call it


Everybody else simply places callbacks in kernel/fork.c and
kernel/exit.c, but as it is I don't think you want per-task state like
this.

One thing I would like to see is a slot per task, that has a number of
advantages over the current patch-set in that it doesn't have one page
limit in number of probe sites, nor do you need to insert vmas into each
and every address space that happens to have your DSO mapped.

Also, I would simply kill the user_bkpt stuff and merge it into uprobes,
we don't have a kernel_bkpt thing either, only kprobes.


--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 4:04 am

To clarify: like I discussed with Jim in person.
--

From: Srikar Dronamraju
Date: Tuesday, March 23, 2010 - 5:23 am

If I would want to trace malloc in a process 
$ objdump -T /lib64/libc.so.6 | grep malloc 
000000357c274b80 g    DF .text  0000000000000224  GLIBC_2.2.5 __libc_malloc
000000357c271000  w   DF .text  0000000000000273  GLIBC_2.2.5 malloc_stats
000000357c275570  w   DF .text  00000000000001fb  GLIBC_2.2.5 malloc_get_state
000000357c5514f8  w   DO .data  0000000000000008  GLIBC_2.2.5 __malloc_hook
000000357c274b80 g    DF .text  0000000000000224  GLIBC_2.2.5 malloc
000000357c26f570  w   DF .text  0000000000000033  GLIBC_2.2.5 malloc_usable_size
000000357c271420  w   DF .text  000000000000024e  GLIBC_2.2.5 malloc_trim
000000357c5529a0  w   DO .bss   0000000000000008  GLIBC_2.2.5 __malloc_initialize_hook
000000357c271670  w   DF .text  00000000000003c2  GLIBC_2.2.5 malloc_set_state
$
$ cat /proc/9069/maps
...............
357c200000-357c34d000 r-xp 00000000 08:03 6115979 /lib64/libc-2.5.so
357c34d000-357c54d000 ---p 0014d000 08:03 6115979 /lib64/libc-2.5.so
357c54d000-357c551000 r--p 0014d000 08:03 6115979 /lib64/libc-2.5.so
357c551000-357c552000 rw-p 00151000 08:03 6115979 /lib64/libc-2.5.so
...............
$

do you mean the user should be specifying 357c200000:74b80 to denote
000000357c274b80? or /lib64/libc.so.6:74b80


where are the per task slots stored?

We had uprobes as one single layer. However it was suggested that
breaking it up into two layers was useful because it would help code
reuse. Esp it was felt that a generic user_bkpt layer would be far more
useful than being used for just uprobes.
Here are links where these discussion happened.
http://sourceware.org/ml/systemtap/2007-q1/msg00570.html
http://sourceware.org/ml/systemtap/2007-q1/msg00571.html

--
Thanks and Regards
Srikar
 
--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 6:46 am

Well userspace would simply specify something like: /lib/libc.so:malloc,
we'd probably communicate that to the kernel using a filedesc and
offset.

And yes, all processes that share that DSO, consumers can install

Don't do that ;-)

What reason would you have to sleep from a int3 anyway? You want to log
bits and get on with life, right? The only interesting case is faulting
when some memory references you want are not currently available, and

The per task slot (note the singular, each task needs only ever have a
single slot since a task can only ever hit one trap at a time) would

I'm so not going to read ancient emails on a funky list. What re-use?
uprobe should be the only interface to this, there's no second interface
to kprobes either is there?
--

From: Masami Hiramatsu
Date: Tuesday, March 23, 2010 - 7:20 am

Hmm, for low-level interface, it will be good. If we provide
a user interface(trace_uprobe.c), we'd better add pid filter

Out of curiously, what does the task-context mean? ('current' is probed
task in int3, isn't it?). I think, uprobe handler can cause page fault

Hmm, I just worried about whether TLS/task stack can be executable

It will be good when we start working on 'ptrace2' :)
Anyway, the patch order looks a bit odd, because user_bkpt uses XOL
but XOL patch is introduced after user_bkpt patch...

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 8:15 am

Task context means the regular kernel task stack where we can schedule,
int3 has its own exception stack and we cannot schedule from that.

And yes, the fault thing is the one case where sleeping makes sense and
is dealt with in my proposal, you don't need two handlers for that, just
call it from trap context with pagefault_disable() and when it fails
with -EFAULT set a TIF flag to deal with it later when we're back in
task context.

There is a very good probability that the memory you want to reference
is mapped (because typically the program itself will want to access it
as well) so doing the optimistic access with pagefault_disabled() will
work most of the times and you only end up taking the slow path when it


But why would ptrace2 use a different interface? Also, why introduce
some abstraction layer now without having a user for it, you could
always restructure things and or add interfaces later when you have a
clear idea what it is you need.
--

From: Masami Hiramatsu
Date: Tuesday, March 23, 2010 - 10:36 am

Ah, I see. so it will be done later. Actually, since int3 handler will

hm, similar technique can be applied to kprobe-tracer too (for getting


Because 'ptrace' doesn't have any breakpoint insertion helper.
Programs which uses ptrace must setup single-stepping buffer and
modify target code by themselves. This causes problems when
multiple debuggers/tracers attach to the same process and
try to modify same address. First program can see the original
instruction, but next one will see int3! I think we'd better
provide some abstraction interface for breakpoint setting in
next generation ptrace (of course, we also need to provide
memory peek interface which returns original instructions).

But anyway, I agree with you, we don't need it *now*, but someday.

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com
--

From: Srikar Dronamraju
Date: Wednesday, March 24, 2010 - 3:22 am

user_bkpt provides xol strategy.

user_bkpt_xol patch only provides slot allocation for Execution out of
line strategy.  It doesnt implement execution out of line strategy.
The current implementation assumes that we pass the user_bkpt structure
as an argument while allocating/freeing a slot.

user_bkpt knows how to handle execution out of line. Its working is
independent of how and where the slot is allocated.  The field xol_vaddr
points to a location which holds the copy of the instruction to be
single-stepped/executed.

Hence user_bkpt patch was followed by user_bkpt_xol patch.

--
Thanks and Regards
--

From: Ananth N Mavinakayanahalli
Date: Tuesday, March 23, 2010 - 8:05 am

Well, rewind back to 2006 to the first edition of uprobes; it had the
'global' tracing feature like what you indicate here, although Andrew
wouldn't want to be reminded of *how* that was done (hooking
readpages()) :-)

At the time, global tracing was vehemently vetoed in favour of a per-process
approach.


With the TIF method, you get to the probed process' task context in 
do_notify_resume(), and have sufficient flexibility for non-perf users, 
like gdb, 'cos what uprobes provides now, is close to what Tom Tromey
asked for gdb's usage.

Ananth
--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 8:15 am

Both in-tree consumers of uprobes (ftrace and perf) are capable of task
filters.

But the thing is, dso:sym is very much not a task property, adding task
filters afterwards sure makes sense in some cases but it should not be
the primary mode.

If people really want to optimize this we can easily add a few bits of
task state which could tell the trap handler to not even bother looking
up things but restart as fast as possible.
--

From: Frank Ch. Eigler
Date: Tuesday, March 23, 2010 - 8:26 am

If you wish this new uprobes to be useful to tools such as gdb,
remember the value of preserving the property that processes not being
debugged are not to be interfered with.  You don't want a DoS due to
some guy setting ten thousand breakpoints on glibc.  Such
considerations should overrule perf/ftrace's simplifying assumptions
that after-the-fact event filtering is surely always sufficient.

- FChE
--

From: Ananth N Mavinakayanahalli
Date: Tuesday, March 23, 2010 - 10:59 pm

Are you suggesting we have the global tracing as default and
then have task filters. I've already alluded to this being vetoed
earlier, by people including Andrew Morton, Hugh Dickins, Arjan,
Christoph Hellwig, Nick Piggin, etc. It's a route we'd prefer not to
go down again...

Aside, what are the mechanisms to do this?

The current breakpoint insertion and removal, even for shared libraries,
is process local since the only page tables of the process being traced
is modified.

In order to have a global visibility of dso probes, one obvious method
is to put in the probes before the text hits pagecache. This approach
works for 'yet-to-start' processes that would map the dso too. This was
prototyped in the series at http://lkml.org/lkml/2006/5/9/25 did that
and was suitably junked, for very valid reasons. Even Hugh Dickins
thumbed down the pagecache approach (http://lkml.org/lkml/2006/5/9/209)

Given the current design has enough flexibility to accommodate non perf
users like gdb, a simple pid based approach for the lowest layer makes
the most sense. I'd rather prefer a higher level entity (say, perf) do
the difficult job of filtering down individual requests only for
processes of interest, then the lower layer can iteratively do the probe
insertions for all processes of interest.

I am not sure if there is a better method to do probes with 'global'
visibility. Did you have an easier approach in mind?

Ananth
--

From: Srikar Dronamraju
Date: Wednesday, March 24, 2010 - 12:58 am

I think perf would be using uprobes in one of the four ways.
- Trace a particular process.
- Trace a particular session.
- Trace all instances of an executable. 
- Trace all programs in the system.

If we use global approach, filtering would still be part of the handler.
So even if we want to probe just one process, we would still take hit
for all processes that map the DSO and hit that vaddr.
Other process could be hitting the probepoint more often while the
probed process could rarely be hitting the probepoint. This could
place significant overhead on the system.

Also with KSM, the page we are probing could be part of the stable tree
and mapped by different virtual machines. Can this lead to interruptting
work on an unrelated virtual machine? If yes, Is it okay to interrupt an
unrelated VM? If not, what measures need to be taken?

Currently perf can be used by priviledged users. However when perf gets
to trace user space programs, would it still be limited to priviledged
users. Do we have plans to allow users to trace their owned

Though one of the usp of uprobes is non disruptive tracing, applications
like debuggers who do disruptive tracing can benefit from uprobes. 

Debuggers could use uprobes as a feature to implement inserting/removing
breakpoints and get the out of line single-stepping. In an earlier
discussion http://lkml.org/lkml/2010/1/26/344 Tom Tromey did say that if
a facility was given, it could be used in gdb.

What I expect is the tracee to inform the tracer that it has hit the
breakpoint and "wait" for the tracer to give indication to continue.

Benefits could be 
- Debuggers can benefit from execution out of line and can debug
  multithread processes much better. 

- Two debbugers/tracers could trace the same process. One of the tracer
  could be strace, while the other one could be gdb.

- perf and debugger could be interested in the same vaddr for that
process and still continue to work. 
Lets say debugger and perf are interested in a ...
From: Peter Zijlstra
Date: Wednesday, March 24, 2010 - 6:00 am

I'm not sure, currently all the tracing bits require root. One of the
complications is that dynamic trace events (kprobes and uprobes) share a
global namespace, so making that accessible to users might be
interesting.

So one thing we can do to avoid some of the trap overhead is to
de-couple the trace event creation from trace event enable (pretty much
already so for existing implementations), so while you define a dynamic
trace event as dso:sym, you provide ways to enable it globally and per
task.

We'd basically need a global and per-task refcount on enable and make
sure the breakpoint is installed properly for (global || task).

That way a perf per-cpu event will do the global enable, and a perf

A double scribble will be an issue for the current generation of
debuggers anyway, right?

But yes, I suppose if you want to use uprobes for debuggers then yes it
makes sense to allow to put the task to sleep. One way would be to
provide means for the handler to detect the context and simply always

Before NX there simply was no option, anyway, I guess the writable
requirement comes from being stack, and I'm not sure how TLS is done,
but I guess that has similar constraints on being writable, right?

I've heard from people that some other OS does indeed have the
trampoline in TLS.


--

From: Srikar Dronamraju
Date: Thursday, March 25, 2010 - 12:56 am

Ulrich, 

Can you please comment if a slot in TLS can be used for storing and
executing an instruction? Are there any additional issues that we need
to take care of? Are there architectures that dont support TLS?

--
Thanks and Regards
Srikar
--

From: Srikar Dronamraju
Date: Thursday, March 25, 2010 - 1:41 am

Yes, when we allow two or more probes to co-exist at a probepoint, we

double scribble as in two apps writing to the same address? Uprobes
handles this by failing into insert probes at location where there is a
breakpoint already inserted. So if both apps were to use the uprobes
interface, then they could co-operate and co-exist. (This would need the
feature in uprobes to have multiple probes per probepoint which is

Yes, thats certainly possible. However lets consider the case when we
allow multiple probes per probepoint and one handler faults (handler
detects it could be sleeping) while the other handler may or may not
fault (handler could be doing a copy_from_user). 
When the thread switches to task context and runs the first handler but
it has no state information about the second handler having run in the
interrupt context. So here we may be unable to decide if we should run
the second handler or not.

--
Thanks and Regards
Srikar
--

From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:26 am

X86 support for Uprobes

This patch provides x86 specific details for uprobes.
This includes interrupt notifier for uprobes, enabling/disabling
singlestep.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 arch/x86/Kconfig          |    1 +
 arch/x86/kernel/Makefile  |    1 +
 arch/x86/kernel/uprobes.c |   87 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/uprobes.c


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 851cedc..a860a9b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select HAVE_KERNEL_LZO
 	select HAVE_HW_BREAKPOINT
 	select HAVE_USER_BKPT
+	select HAVE_UPROBES
 	select PERF_EVENTS
 	select ANON_INODES
 	select HAVE_ARCH_KMEMCHECK
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 98c74b4..bfa48f0 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -118,6 +118,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 
 obj-$(CONFIG_USER_BKPT)			+= user_bkpt.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..1acce22
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,87 @@
+/*
+ *  Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You ...
From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:26 am

Uprobes documentation.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 Documentation/uprobes.txt |  244 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 244 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/uprobes.txt


diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt
new file mode 100644
index 0000000..08bbf24
--- /dev/null
+++ b/Documentation/uprobes.txt
@@ -0,0 +1,244 @@
+Title	: User-Space Probes (Uprobes)
+Authors	: Jim Keniston <jkenisto@us.ibm.com>
+	: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Concepts: Uprobes
+2. Architectures Supported
+3. Configuring Uprobes
+4. API Reference
+5. Uprobes Features and Limitations
+6. Probe Overhead
+7. TODO
+8. Uprobes Team
+9. Uprobes Example
+
+1. Concepts: Uprobes
+
+Uprobes enables you to dynamically break into any routine in a
+user application and collect debugging and performance information
+non-disruptively. You can trap at any code address, specifying a
+kernel handler routine to be invoked when the breakpoint is hit.
+
+A uprobe can be inserted on any instruction in the application's
+virtual address space.  The registration function register_uprobe()
+specifies which process is to be probed, where the probe is to be
+inserted, and what handler is to be called when the probe is hit.
+
+Uprobes-based instrumentation can be packaged as a kernel
+module.  In the simplest case, the module's init function installs
+("registers") one or more probes, and the exit function unregisters
+them.
+
+1.1 How Does a Uprobe Work?
+
+When a uprobe is registered, Uprobes makes a copy of the probed
+instruction, stops the probed application, replaces the first byte(s)
+of the probed instruction with a breakpoint instruction (e.g., int3
+on i386 and x86_64), and allows the probed application to continue.
+(When inserting the breakpoint, Uprobes uses background page
+replacement ...
From: Randy Dunlap
Date: Sunday, March 21, 2010 - 8:00 pm

no space after "archs/", just:


-- 
~Randy
--

From: Srikar Dronamraju
Date: Sunday, March 21, 2010 - 10:34 pm

Thanks for the review, 

Your comment however made me realize that I had used user-bkpt here
rather than user_bkpt.

user_bkpt is a layer that provides breakpoint insertion and removal. 
I wanted to mention that uprobes depends on user_bkpt layer.
I think "This user_bkpt based version" is probably better than
"This user-breakpoint based version"

--
Thanks and Regards
Srikar
--

From: Randy Dunlap
Date: Monday, March 22, 2010 - 7:51 am

I see.  Sure, that's fine.

-- 
~Randy
--

From: Srikar Dronamraju
Date: Saturday, March 20, 2010 - 7:26 am

Uprobes Samples

This provides an example uprobes module in the samples directory.

To run this module run (as root)
 insmod uprobe_example.ko vaddr=<vaddr> pid=<pid>
	 Where <vaddr> is the address where we want to place the probe.
		<pid> is the pid of the process we are interested to probe.

 example: -
# cd samples/uprobes

[get the virtual address to place the probe.]
# vaddr=0x$(objdump -T /bin/bash |awk '/echo_builtin/ {print $1}')

[Run a bash shell in the background; have it echo 4 lines.]
# (sleep 10; echo 1; echo 2; echo 3; echo 4) &
[Probe calls echo_builtin() in the background bash process.]

# insmod uprobe_example.ko vaddr=$vaddr pid=$!
# sleep 10
# rmmod uprobe_example
# dmesg | tail -n 3
Registering uprobe on pid 10875, vaddr 0x45aa30
Unregistering uprobe on pid 10875, vaddr 0x45aa30
Probepoint was hit 4 times
#
[ Output shows that echo_builtin function was hit 4 times. ]

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 samples/Kconfig                  |    7 +++
 samples/uprobes/Makefile         |   17 ++++++++
 samples/uprobes/uprobe_example.c |   83 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 samples/uprobes/Makefile
 create mode 100644 samples/uprobes/uprobe_example.c


diff --git a/samples/Kconfig b/samples/Kconfig
index 8924f72..50b8b1c 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -44,4 +44,11 @@ config SAMPLE_HW_BREAKPOINT
 	help
 	  This builds kernel hardware breakpoint example modules.
 
+config SAMPLE_UPROBES
+	tristate "Build uprobes example -- loadable module only"
+	depends on UPROBES && m
+	help
+	  This builds uprobes example module.
+
+
 endif # SAMPLES
diff --git a/samples/uprobes/Makefile b/samples/uprobes/Makefile
new file mode 100644
index 0000000..f535f6f
--- /dev/null
+++ b/samples/uprobes/Makefile
@@ -0,0 +1,17 @@
+# builds the uprobes example kernel modules;
+# then to use one (as root):
+# insmod ...
From: Andrew Morton
Date: Monday, March 22, 2010 - 6:38 pm

What's missing here is a description of why all this is useful. 
Presumably much of the functionality which this feature offers can be
done wholly in userspace.  So I think it would be useful if you were to
carefully explain the thinking here - what the value is, how people
will use it, why it needs to be done in-kernel, etc.  Right now if I
was asked "why did you merge that", I'd say "gee, I dunno".  I say that
a lot.  Knowing all of this would perhaps help me to understand your
thinking regarding ftrace integration.

The code itself is positioned as non-x86-specific, but the
implementation is x86-only.  It would be nice to get some confirmation
that other architectures can successfully use the core code.  But that
will be hard to arrange, so probably crossing our fingers is the best
approach here.

The code scares me a bit from the "how can malicious people exploit it"
point of view.  Breaking into other users programs/memory, causing the
kernel to scribble on itself, causing unbound memory consumption, etc. 
No specific issues that I can point at, just vague fear.

Do we know that exiting userspace will never ever already be using int3?

What happens if I run this code in 2016 on a CPU which has new opcodes
which this code didn't know about?

When uprobes was being pushed five-odd years ago, it did all sorts of
hair-raising things to avoid COWing shared pages.  Lots of reasons were
given why it *had* to avoid COW.  But now it COWs.  What were those
reasons why COW was unacceptable, and what changed?

--

From: Srikar Dronamraju
Date: Tuesday, March 23, 2010 - 3:55 am

Main motivations for uprobes 
- non-disruptive tracing.
Current ptrace based mechanisms generally involve signals and stopped
threads. Also it involves context switching between the tracer and
tracee. The delay and involvement of signals can result in problems seen
in production systems not seen while tracing. Uprobes tracing wouldnt
involve signals, context switches between tracer and tracee.

- Multithreaded support.
Current ptrace based mechanisms for tracing apps use single stepping
inline, i.e they copy back the original instruction on hitting a breakpoint.
In such mechanisms tracers have to stop all the threads on a breakpoint hit
or tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy of
instruction is stored at a different location.  On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

- Tracing multiple applications:
A uprobe based tracer would be able to trace multiple (similar or
different) applications. This could be very useful in understanding how
different applications are interacting with each other.

- Multiple tracers for an application:
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in generic
events for a particular set of process. While there could be another
tracer that is just interested in one specific event of a particular
process thats part of the previous set of process.

- Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace.
In future we could look at a single backtrace showing application,
library and kernel calls.


We are looking at providing a perf interface for ...
Previous thread: Prize Won! by =?ISO-8859-1?Q?=BD=B2=B1R=ACw?= on Saturday, March 20, 2010 - 7:13 am. (1 message)

Next thread: [PATCH] Netfilter: Fix integer overflow in net/ipv6/netfilter/ip6_tables.c by wzt.wzt on Saturday, March 20, 2010 - 7:32 am. (11 messages)