Re: [PATCHv11 2.6.36-rc2-tip 5/15] 5: uprobes: Uprobes (un)registration and exception handling.

Previous thread: [RFC V2 PATCH 0/3] timer: patchset focus on del_timer_sync() by Yong Zhang on Wednesday, August 25, 2010 - 6:45 am. (6 messages)

Next thread: [REPOST] Re: [PATCH] GSoC 2010 - Memory hotplug support for Xen guests - third fully working version by Daniel Kiper on Wednesday, August 25, 2010 - 7:00 am. (1 message)
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:41 am

Changelog from V10:
 * Replaced function pointers in user_bkpt structure with weak
   functions as suggested by Peter Zijlstra.
 * CONFIG_PROBE_EVENTS now selects uprobe-tracer and kprobe-tracer
   as suggested by Frederic.
 * Split perf-probe listing patches into smaller patches.

Changelog from V9:
 * Resolved comments from Arnaldo on perf support for uprobes.
 * perf probe -S will now list only global binding functions as
   requested by Christoph Hellwig.
 * Moved Changelog to below Signed-off-by: line, so that its not part
   of the patch description. (Suggested by Christoph.)

Changelog from V8:
 * Fix build issues reported by Christoph.
 * List available probes in a file without need to specify pid.

Changelog from V7:
 * New feature: perf probe lists available probes.
 * Fix perf probes for uprobes to exit with a error message on dwarf
   based probes.
 * Merge changes to kprobes traceevent infrastructure.
 * Merge changes to perf.

Changelog from V6:
 * Remove perf adjust symbols patch.

Changelog from V5:
 * Merged user_bkpt and user_bkpt_xol into uprobes.
 * Addressed comments till now.

Changelog from V4:
 * Rebased to tip tree. (2.6.35-rc3-tip)

Changelog from v3:
 * Reverted to background page replacement as suggested by Peter Zijlstra.
 * Dso in 'perf probe' can be either be a short name or a absolute path.
 * Addressed comments from Masami, Frederic, Steven on traceevents and perf

Changelog from v2:
 * Addressed comments from Oleg, including removal of interrupt context
    handlers, reverting background page replacement in favour of
    access_process_vm().

 * Provides perf interface for uprobes.

Changelog from v1:
 * Added trace_event interface for uprobes.
 * Addressed comments from Andrew Morton and Randy Dunlap.

For previous posting: please refer:
http://lkml.org/lkml/2010/7/27/121, http://lkml.org/lkml/2010/7/12/67,
http://lkml.org/lkml/2010/7/8/239, http://lkml.org/lkml/2010/6/29/299,
http://lkml.org/lkml/2010/6/14/41, ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:41 am

User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page and write_protect_page.
Now replace_page() loses its static attribute.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 include/linux/mm.h |    4 ++
 mm/ksm.c           |  112 -------------------------------------------------
 mm/memory.c        |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+), 112 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 831c693..3f014e4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -863,6 +863,10 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+					struct page *kpage, pte_t orig_pte);
+int write_protect_page(struct vm_area_struct *vma, struct page *page,
+						      pte_t *orig_pte);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index e2ae004..8a792d0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -694,118 +694,6 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 	return !memcmp_pages(page1, page2);
 }
 
-static int write_protect_page(struct vm_area_struct *vma, struct page *page,
-			      pte_t *orig_pte)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr;
-	pte_t *ptep;
-	spinlock_t *ptl;
-	int swapped;
-	int err = -EFAULT;
-
-	addr = page_address_in_vma(page, vma);
-	if (addr == -EFAULT)
-		goto out;
-
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
-	if (!ptep)
-		goto out;
-
-	if (pte_write(*ptep)) {
-		pte_t entry;
-
-		swapped ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:41 am

Provides a mechanism in kernel to insert/remove breakpoints in
user space applications including
   - architecture independent mechanism to establish breakpoints in
     userspace applications.
   - helper functions for reading/writing/validating data/opcodes from
     target process's address space.
   - wrappers and default implementation(whereever possible) of
     architecture dependent functions(setting breakpoint)
   - preprocessing and postprocessing of singlestep on breakpoint hit

Single stepping inline is the traditional method where original
instructions replace the breakpointed instructions on a breakpoint
hit.  This method works well with single threaded applications.
However its racy with multithreaded applications.

In execution out of line, threads single steps on a copy of the
instruction. This method works well for both single-threaded and
multithreaded applications.

Uprobes uses execution out of line method.

There could be other strategies like emulating an instruction. However
they are currently not implemented.

Insertion and removal of breakpoints is by "Background page
replacement". i.e make a copy of the page, modify its the contents,
set the pagetable and flush the tlbs. This page uses enhanced
replace_page to cow the page. Modified page is only reflected for the
interested process. Others sharing the page will still see the old
copy.

You need to follow this up with the uprobes patch for your
architecture to define architecture specific functionality for
reading/writing/validating data/opcodes.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V10: (replace function pointers with weak functions)
  * Removed architecture specific function pointers in user_bkpt
    structure and replaced them with weak functions as suggested by
    Peter Zijlstra.

Changelog from V5: (Merge user_bkpt into uprobes)
  * Merged user_bkpt into uprobes as suggested by Christoph ...
From: Peter Zijlstra
Date: Wednesday, September 1, 2010 - 12:38 pm

That really wants to be static, 'arch' is a way too generic a name to

either: s/uprobes_read_vm/uprobes_read_data/ or


Something like:

  /* private, read-only, executable maps only */
  if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) != (VM_READ|VM_EXEC))


This assumes user_bkpt_opcode_t is a scalar value, but there's no
assertion of that, if someone were to define it like char[5] or somesuch

you fail to check vma->vm_end


So here check_vma() is the default implementation of validate_address(),

I hope not,.. the pte swizzle we do above does not require any such


Why would we even consider calling this function on something that would
fail the validate_address() test? If that fails we would not have
installed the breakpoint to begin with, hence there would be no reason

Again, assumes the instruction thing is a scalar.



The big thing I'm missing in this patch is generic code handling the
actual breakpoint.. but maybe that's somewhere in the next patches.. /me
goes look.


--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:42 am

Provides x86 specific functions for instruction analysis and
instruction validation and x86 specific pre-processing and
post-processing of singlestep especially for RIP relative
instructions. Uses "x86: instruction decoder API" for validation and
analysis of user space instructions. This analysis is used at the time
of post-processing of breakpoint hit to do the necessary fix-ups.
There is support for breakpointing RIP relative instructions. However
there are still few instructions that cannot be singlestepped.

Also defines TIF_UPROBE flag for x86.

This patch requires "x86: instruction decoder API"
http://lkml.org/lkml/2009/6/1/459

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V10: (replace function pointers with weak function
      * Removed architecture specific function pointers in user_bkpt
        structure and replaced them with weak functions as suggested
        Peter Zijlstra.


Changelog from V5: Merged into uprobes layer.

Changelog from V1:
   set UPROBES_FIX_SLEEPY if post_xol might sleep.
---
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/thread_info.h |    2 
 arch/x86/include/asm/uprobes.h     |   43 +++
 arch/x86/kernel/Makefile           |    2 
 arch/x86/kernel/uprobes.c          |  561 ++++++++++++++++++++++++++++++++++++
 5 files changed, 609 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/uprobes.h
 create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f0ee331..4710268 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select HAVE_KERNEL_LZO
 	select HAVE_HW_BREAKPOINT
 	select HAVE_MIXED_BREAKPOINTS_REGS
+	select ARCH_SUPPORTS_UPROBES
 	select PERF_EVENTS
 	select HAVE_PERF_EVENTS_NMI
 	select ANON_INODES
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f0b6e5d..5b9c9f0 100644
--- ...
From: Andi Kleen
Date: Friday, September 3, 2010 - 3:26 am

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

Quick high level review. I did not attempt to validate the basic

One general comment here: since with uprobes the instruction
decoder becomes security critical did you do any fuzz tests
on it (e.g. like using it on crashme or on code that has 



Shouldn't all this stuff be in the instruction decoder? 


These functions that just do a single printk seem weird. I would
do that in the caller. Also the message could be shortened I guess

This check is not fully correct because it's valid to have
32bit code in 64bit programs and vice versa.  The only good
way to check that is to look at the code segment at runtime
though (and it gets complicated if you want to handle LDTs,
but that could be optional). May be difficult to do though.

Also the compat bit is not necessarily set if no system call is

goto is automatically unlikely and unlikely is deprecated anyways.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Srikar Dronamraju
Date: Friday, September 3, 2010 - 10:48 am

I havent tried any fuzz tests with the instruction decoder. But I am
not sure if Masami has tried that out some of these. 
One question: Do you want to test uprobes with crashme or test

Even Peter wasnt comfortable with user_bkpt. How about user_bp?
i.e the above field would be user_bp_opcode_t. I felt 
user_breakpoint_opcode_t might look long. Also we would have to
rename other structures accordingly like user_bkpt_task_arch_info
would become user_breakpoint_task_arch_info. Do let me know your



Okay, I can move the printk to the caller, I will try to shorten the
message, Would something like "uprobes: no support for 2-byte

validate_insn_32bit is able to identify all valid instructions in a 32
bit app and validate_insn_64bits is a superset of
validate_insn_32bits; i.e it considers valid 32 bit codes as valid too.

Did you get a chance to look at
validate_insn_32bit/validate_insn_64bits? If you feel that
validate_insn_32bit/validate_insn_64bits? are unable to detect


Okay, shall remove unlikely from the above.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 11:00 am

My main objection was the uprobe.c and user_bkpt.c splitup, its all
about uprobes, but as to this name, you can simply name it
uprobe_opcode_t, no need to preserve the whole user breakpoint thing at
all.
--

From: Andi Kleen
Date: Monday, September 6, 2010 - 12:53 am

On Fri, 3 Sep 2010 23:18:32 +0530
Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:


Ideally both, but as a minimum the part that is exposed
to user space, that is uprobes.

BTW if you test it I would test it both with real crashme


Yes that's fine. Optionally you could supply a short
script like scripts/decodecode that feeds it through objdump -d

How can this be? e.g. 32bit has 1 byte INC/DEC but on 64bit
these are REX prefixes and can be in front of nearly anything.

I don't think you can do a 100% solution because for 100%
you would need to know the code segment the CPU is going
to use later, and that's not possible in advance.

A heuristic is reasonable (and leave out applications
that generate 64bit code from 32bit executables or vice versa)

Hmm actually I double checked and this is a separate bit.
So scratch that, TIF_32BIT is ok to test.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Srikar Dronamraju
Date: Monday, September 6, 2010 - 6:44 am

You are right, the validate_insn_32bits refers to good_insns_32 and
validate_insn_64bits refers to good_insns_64 to decode 1 byte
instructions. Some instructions like 0x06 and 0x0e seem to be valid in

I think you are referring to RIP related instructions, this how we
handle them. 
Please correct us if we are wrong, but here is what we do 
- While analyzing the instruction, take into account which register acts
  as the code segment register.

- When interrupted (but before singlestep), copy the contents of the
  register which we think acts as code segment register in our
  above analysis into per-task scratch variable.

- After singlestepping we retrieve the saved per-task scratch

Okay, Thanks for confirming this.

--
Thanks and Regards
Srikar
--

From: Andi Kleen
Date: Monday, September 6, 2010 - 7:16 am

On Mon, 6 Sep 2010 19:14:07 +0530

crashme and valid 1/2 bit corrupted code please if possible. I'm 

I just meant regarding long mode vs compat mode which defines
whether REX prefixes are valid or not. Because this can
change any time (if the application does a long jump) you
cannot know in advance what it is going to use. But 
it's also very rare to use long jumps at all, so this
can be probably ignored (but should be documented somewhere),
and just guess based on the executable. I just wanted
to point out that it's not a 100% solution.

I don't think you need to care about segment bases either. While
they can be used (16bit Wine or dosemu) it's quite rare
and not supporting uprobes for this is totally reasonable.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Masami Hiramatsu
Date: Monday, September 6, 2010 - 5:56 pm

As you can see in kernel tree, x86 insn decoder has a test
which decodes vmlinux and compares results with objdump.
Similar tests had been done for glibc etc. by Jim.

Hmm, if you need to validate all instructions, you'd better to
enhance x86 decoder for checking bad instructions.
I think it can be done mostly by adding inat bitflags.

--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:42 am

The uprobes infrastructure enables a user to dynamically establish
probepoints in user applications and collect information by executing
a handler function when a probepoint is hit.

The user specifies the virtual address and the pid of the process of
interest along with the action to be performed.  Uprobes uses the
execution out of line strategy and follows lazy slot allocation. I.e,
on the first probe hit for that process, a new vma (to hold the probed
instructions for execution out of line) is allocated.  Once allocated,
this vma remains for the life of the process, and is reused as needed
for subsequent probes.  A slot in the vma is allocated for a
probepoint when it is first hit.

A slot is marked for reuse only when the probe gets unregistered and
there are no threads in the vicinity.

In a multithreaded process, a probepoint once registered is active for
all threads of a process. If a thread specific action for a probepoint
is required then the handler should be implemented to do the same.

If a breakpoint already exists at a particular address (irrespective
of who inserted the breakpoint including uprobes), uprobes will refuse
to register any more probes at that address.

You need to follow this up with the uprobes patch for your
architecture.

For more information: please refer to Documentation/uprobes.txt

TODO:
1. Allow multiple probes at a probepoint.
2. Booster probes.
3. Allow probes to be inherited across fork.
4. probing function returns.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
---

Changelog from V5:
   - Merged user_bkpt and user_bkpt_xol layers into uprobes.

Changelog from V2:
   - Introduce TIF_UPROBE flag.
   - uprobes hooks now in fork/exec/exit paths instead of tracehooks.
   - uprobe_process is now part of the mm struct and is shared between
     processes that share the mm.
   - per thread information is now allocated on the fly.
     * Hence allocation and ...
From: Peter Zijlstra
Date: Wednesday, September 1, 2010 - 2:43 pm

I wouldn't worry about that, focus on inode attached probes and you get

Seems like a weird place for this hunk, does this want to live


I find the _process postfix a bit weird in this context, how about
something like:

  struct mm_uprobes *mm_uprobes;


Make that: 

	struct uprobe_task_state uprobe_state;



Like previously said, I would much rather see an inode/offset based
interface and do the pid/fork/pgroup/cgroup etc.. stuff as filters on



Its customary to write it like:

	spinlock_t	 list_lock; /* protects uprobe_list, nr_uprobes */
	struct list_head uprobe_list;

Why void * and not simply:

	struct uprobes_xol_area xol_area;

That struct is small enough and you only get one per mm and saves you an


So this thing is a link between the process and the probe, I'm not quite
sure what you need the refcount for, it seems to me you can only have on
of these per process/probe combination.

If you had used inode/offset based probes they would have been unique in
the system and you could have had an {inode,offset} indexed global tree
(or possibly a tree per inode, but that would mean adding to the inode
structure, which I think is best avoided).

That would also reduce the mm state to purely the xol area, no need to

I would be thinking you can obtain the active probe point from the
address the task is stuck at and the state seems fairly redundant. Which
leaves you with the arch state, which afaict is exactly as large as the

All that can be replaced by unconditional functions, simply stub them

Wandering hunks, these seem to want to get folded back to wherever the



The grand thing about not having any of this process state is that you

You can replace this with:

 addr = instruction_pointer(task_pt_regs(current)) -
ip_advancement_by_brkpt_insn;

and then proceed from there like described below to obtain the struct
uprobe.

You can infer the SS/HIT state by checking if the user-addr is in the

Its an address, not a struct uprobe_probept ...
From: Peter Zijlstra
Date: Thursday, September 2, 2010 - 1:12 am

Right, so one problem I overlooked is that you need to have the actual
probe to compute the jump address, but that could be fixed by

Except we need to stabilize the vma tree to do the lookup and that
currently requires mmap_sem, I guess until we get Nick's per-pte
vma-tree we could fudge that by adding a spinlock around the rb_tree
--

From: Srikar Dronamraju
Date: Friday, September 3, 2010 - 9:42 am

I am working on the file based probing. It compiles but havent got it to
test it yet, I can post the patch if you are interested. It should
achieve similar to inode probing.

However I would have an issue with making inode based probing the
default.
1. Making all probing based on inode can be a performance hog.

2. Since unlike kernel space, every process has a different space, so
why would we have to insert breakpoints in each of its process space if
we are not interested in them.

3. Ingo has a requirement for allowing normal users to use uprobes thro
perf. When this feature gets implemented, we have to be careful about a
normal users trying to just trace their application resulting in it
hitting performance all other users. 

	For example: one user places a probe on /usr/lib/libc.so: malloc
	- Another normal users looks at the current userspace probes and
	  constructs a program that just does malloc/free just to
	  degrade the performance of the system.

	- user could be interested in just one process which could be
	  calling malloc just 10 times. However during the same time
	  there are 1000 processes which could all together call 100000
	  times during the same time.

So even when we allow file based tracing across the system, it should be
restricted to just the root user.

As we discussed in previous discussions, Inode based tracing wasnt
accepted back in 2006. May be the approach was a problem then but what










Unlike kernel probing, uprobes has a disadvantage.
Lets assume that the request for removing a probepoint when some of the
threads have actually hit the probe. Because the handlers in uprobes can
sleep, we cant remove the probepoint at the same time as the request for
the removing the probe. This is where refcount steps in and helps us to
decide when we can remove the probepoint. Even inoode based

Lets assume the thread is about to singlestep (or has singlestepped)
So the instruction pointer is pointing to one of the slot (or it ...
From: Peter Zijlstra
Date: Friday, September 3, 2010 - 10:19 am

You don't have to, but you can. The problem I have with this stuff is
that it makes the pid thing a primary interface, whereas it should be




The to singlestep or not would be implied by the IP pointing to the
start of a slot or not, but yes, I guess that as long as you do
singlestep you need some state.. sucks though. Boosted probes are much
nicer, they don't need that extra arch storage either, they can simply

No particular other lock in mind, you could cmpxchg the pointer if
that's all you need it for. The problem is that if you want inode based

What meta-data? You can find the uprobe itself from inode:offset, and
you know the return address from the trap site + orig ins size.

You don't need the probepoint, and there'd be only a single uprobe
instance.

The Xol area can be found at current->mm->xol_area, I don't think you

if its not the start of a slot, you've already single-stepped. Ideally
you'd directly implement boosted probes, but I realize that's a tad more

Is that because of the singlestep overhead? With boosted probes I would
think it'd be much faster to take 1 trap, deal with it and continue
execution, than to frob tons of kernel code in between.



A bit more about these filter thingies, add a method to struct uprobe,
something like int uprobe::wants_probe(struct task_struct *p) and add a
single bit to task_struct (there's a few bitfields with holes in there).

The on clone()/mmap() call the relevant wants_probe() methods, if one is
true, set the task_struct::has_uprobe flag and install the probes.

If nothing in the process wants probing, you'll never install the probes
and nothing ever triggers, of only one of many tasks in the process gets
tagged, you'll have to look up the probe anyway to know where to
continue, but you can avoid calling the handler.


--

From: Srikar Dronamraju
Date: Monday, September 6, 2010 - 10:46 am

The breakpoint exception and singlestep account for a substaintial time

I think the otherway, 
Why instrument a process and filter it out, if we are not interested in it.
While instrumenting kernel, we dont have this flexibility. So
having a pid based filter is the right thing to do for kernel
based tracing.

If we can get the per process based tracing right, we can build
higher lever stuff including the file based tracing easily.

All tools/debuggers in the past have all worked with process based
tracing.

Tools like gdb can actually use the displaced singlestepping
feature that uprobes provides. Some gdb developers have told on LKML
earlier that they would be willing to use displaced singlestepping if
the kernel provides an API that they can use.

Also about the security perspective when allowing normal users use
perf to trace their applications. Using this model, we dont have to
write extra filters to limit them. These filters might allow uprobe
handlers on only tasks belonging to that user. However it still
interrupts task of other users. And as I said earlier, breakpoint
exception and singlestepping actually make a very very substantial part
of the handling. The actual uprobe handler depending on what it does

Same namespace as the requestor. i.e whichever name space

What if the called does something like this when one or more
threads are processing the breakpoint.
unregister_uprobe(u);
kfree(u);

In the current implementation, the probepoint structure might be
released much later after the uprobe structure is released.
Unlike uprobe struct, probepoint structure is allocated by uprobes
sub-system and it knows how to release it cleanly. However we dont have

Yes, I agree, we may not need the state after boosted probes.
I am not sure at this time if we can do boosted probes for all


The difference between running handlers in task context and running in
interrupt context is the extra do_notify_resume() that gets called from
task context. But we have more ...
From: Peter Zijlstra
Date: Monday, September 6, 2010 - 11:15 am

That's what atomic_inc_unless_zero() and RCU are for.
--

From: Peter Zijlstra
Date: Monday, September 6, 2010 - 11:15 am

You're really not getting it, are you? No, it would result in the exact


Urgh,.. I really oppose the whole pid-centric thing, if that means
process wide and not per task its even worse.


--

From: Srikar Dronamraju
Date: Monday, September 6, 2010 - 11:48 pm

If there is just one instance of traced process for the inode then yes the
number of breakpoints when traced with pid or based on inode would be the
same. However if there are multiple instances of the traced process [example
bash/zsh] (or the inode corresponds to a library that gets mapped into
multiple processes example libc), and the user is interested in tracing
just one instance of the process, then dont wont the inode based tracing

I would disagree. 
Lets consider a user wants to trace his single threaded app say bash for
few heavy used calls in libc say read/select systemcall stub. If this user
wants to keep recording at discreet intervals.  i.e record for 5 minutes,
stop for 5 minutes, record again for 5 minutes, ....  Can you list how you

Since breakpoints are shared across the tasks of the same process, we cant do
per-task based tracing. We can only do a per process tracing and filter
per-task which if the request is for per-task tracing and thats what I
think you were alluding to in the filter in one of your mails. I am okay with
filtering per-task within a given process.

--
Thanks and Regards
Srikar

--

From: Peter Zijlstra
Date: Tuesday, September 7, 2010 - 2:33 am

Not if your filter function works.

So let me try this again, (assumes boosted probes):

struct uprobe {
	struct inode	*inode;	/* we hold a ref */
	unsigned long	offset;

	int (*handler)(void); /* arguments.. ? */
	int (*filter)(struct task_struct *);

	int		insn_size;		/* size of */
	char		insn[MAX_INSN_SIZE];	/* the original insn */

	int		ret_addr_offset;	/* return addr offset
						   in the slot */
	char		replacement[SLOT_SIZE]; /* replacement
						   instructions */
	
	atomic_t	ref; /* lifetime muck */
	struct rcu_head	rcu;
};

static struct {
	raw_spinlock_t	tree_lock;
	rb_root		tree;
} uprobes;

static void uprobes_add(struct uprobe *uprobe)
{
	/* add to uprobes.tree, sorted on inode:offset */
}

static void uprobes_del(struct uprobe *uprobe)
{
	/* delete from uprobes.tree */
}

static struct uprobe *
uprobes_find_get(struct address_space *mapping, unsigned long offset)
{
	unsigned long flags;
	struct uprobe *uprobe;

	raw_spin_lock_irqsave(&uprobes.treelock, flags);
	uprobe = find_in_tree(&uprobes.tree);
	if (!atomic_inc_not_zero(&uprobe->ref))
		uprobe = NULL;
	raw_spin_unlock_irqrestore(&uprobes.treelock, flags);

	return uprobe;
}

static void __uprobe_free(struct rcu_head *head)
{
	struct uprobe *uprobe = container_of(head, struct uprobe, rcu);

	kfree(uprobe);
}

static void put_uprobe(struct uprobe *uprobe)
{
	if (atomic_dec_and_test(&uprobe->ref))
		call_rcu(&uprobe->rcu, __uprobe_free);
}

static inline int valid_vma(struct vm_area_struct *vma)
{
	if (!vma->vm_file)
		return 0;

	if (vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED) ==
			    (VM_READ|VM_EXEC))
		return 1;

	return 0;
}

int register_uprobe(struct uprobe *uprobe)
{
	struct vm_area_struct *vma;

	inode_get(uprobe->inode);
	atomic_set(1, &uprobe->ref);

	uprobes_add(uprobe); /* add before the rmap walk, so that 
				new mmap()s will find it too */

	for_each_rmap_vma(vma, uprobe->inode->i_mapping) {
		struct mm_struct *mm = ...
From: Srikar Dronamraju
Date: Tuesday, September 7, 2010 - 4:51 am

struct uprobe is a input structure. Do we want to have

Wouldnt this be a scalability issue on bigger machines?
Every probehit having to parse a global tree to figureout which
uprobe it was seems a overkill.
Consider a 5000 uprobes placed on a 128 box with probes placed on

How are we synchronizing put_uprobe and a thread that has hit the
breakpoint and searching thro global probes list? 

One Nit: On probe hit we increment the ref only few times. However 
we are decrementing everytime. So if two probes occur on two cpus
simultaneously, we have a chance of uprobe being freed after both of
I understand that perf top calls perf record in a loop.
For every perf record, we would be looping thro each vma associated with
the inode. 
For a probe on a libc, we would iterate thro all vmas. If the

Are you looking at listing of uprobes per vma?

For each mmap, we are traversing all elements in the global tree?
What would happen if we have a huge number of uprobes in a system all

uprobe_hit I assume is going to be called in interrupt context.

Again for every probehit, we are going through the list of vmas and
checking if it has a probe which I think is unnecessary.

Nit: In some archs, the instruction pointer might be pointing to th next

What if we were pre-empted after this. Would preemption notifiers also
do a copy of instruction to the new slot? If yes, can you please
update me with more pointers.

And I dont know if we can do a boosting for all instructions.
I think even on kprobes we dont do a boosting for all instructions.

Yes, I see its advantages and disadvantages,  I feel this
implementation wouldnt scale. Just because we dont want to
housekeep some information, we are looping thro the global tree to
figure out if there is uprobe specific stuff to be done.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Tuesday, September 7, 2010 - 5:25 am

I didn't consider the user-space interface at all, consuming the uprobe

Use a seqcount, its a read-mostly data structure, its just that the

RCU, see the above atomic_inc_not_zero() it will not obtain a reference
after the final put, the object will stay valid until we pass an rcu



Feh, probe register should be considered an utter slow path.

We do rmap walks on pages all the time, I can't see it being a problem

Yeah, it does a range lookup in the tree [inode:0 - inode:-1). O(log(n))
to find the first entry, O(log(n)) for each consecutive entry, unless we
thread the tree.


Only mmap() of that particular inode, the range lookup would be the
regular O(log(n)) for an empty range.

But again, mmap() is a relative slow path, and you need something like

I assumed process context here, but its trivial to make it work from
interrupt context if you want, all we need is a spinlock/seqlock around





Its mostly read-only data (adding/removing probes is rare), its all
O(log(n)), I really don't see a problem with that.

If you really worry about it you could try a hash lookup for the inode
part and keep a tree per probed inode.

--

From: Mathieu Desnoyers
Date: Monday, September 6, 2010 - 11:25 am

I think we are both partially right in slightly different ways. I think Peter is
right in that the PID should not be mandatory (e.g. specifying a PID of 0 should
apply to all tasks), and you are also right in that being able to apply the
"filter" directly at the executable image level is vital for performance.

So how about this: we can provide both task and inode selection arguments. The
task selection argument can be 0 (apply to all tasks) or non-zero (one task
specifically). The inode argument would be mandatory.

Then, eventually, we can enhance the generic filtering facility so it can be
made aware of filtering shortcuts provided by the instrumentation (in this case,
uprobes would provide a per-tgid filtering shortcut).

Thoughts ?

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Christoph Hellwig
Date: Monday, September 6, 2010 - 1:40 pm

I have the feeling that you guys are at least partially talking past
each other.

For the "perf probe --add" interface the only sane interface is one by
filename and then symbol / liner number / etc.

But that is just the interface - these probes don't nessecarily have to
be armed and cause global overhead once they are define.  If the
implenmentation is smart enough it will defer arming the probe until
we actually use it, and that will be per-process quite often.

Which btw, brings up two more issues, one in uprobes and one in perf.
For one even in userspace I think the dynamic probes will really just
be the tip of the iceberg and we'll get more bang for the buck from
static traces, which is something that's no supported in uprobes yet.
As a start supporting the dtrace-style sdt.h header would be a great
help, and then we can decide if we need somthing even better on top.

The other things is that perf currently only supports per-kernel pid
recording, while we'd really need per Posix process, which may contain
multiple threads for useful tracing of complex userspace applications.
I also suspect that this will fit the uprobes model much better given
that the probes will be in any given address space.

--

From: Peter Zijlstra
Date: Monday, September 6, 2010 - 2:06 pm

The implementation I outlined a few messages ago, would in fact, as you

perf does report both:

         *      { u32                   pid, tid; } && PERF_SAMPLE_TID

the pid is the process id (thread group leader like) and tid is the
task/thread id.
--

From: Christoph Hellwig
Date: Monday, September 6, 2010 - 2:12 pm

It records both, but I haven't found a way to only record samples
or trace things in a Posix Process.

E.g. perf record -p seems to be only per-thread, not per-process.
If that has changes recently everything is fine of course.

--

From: Peter Zijlstra
Date: Monday, September 6, 2010 - 2:18 pm

Hrm, the record code seems to look up all threads for -p and use only a
single thread for -t, didn't actually try it though so it could be
borken.


--

From: Srikar Dronamraju
Date: Tuesday, September 7, 2010 - 5:02 am

Agree, probing by file name is a requirement and I am working

Agree, That why I am trying to build file-based probing on

Yes, Static tracing using dtrace style sdt.h is a cool thing to do.
Already SystemTap has this facility. However I think its probably
better done at perf user interface level.

The way I look at it is perf probe decodes the static markers and asks
uprobes to place probepoints over there.
Do you see a different approach? If yes can you tell what you were
looking at?


--
Thanks and Regards
Srikar
--

From: Mathieu Desnoyers
Date: Tuesday, September 7, 2010 - 9:47 am

We currently have this feature in UST. We're adding "markers" into the
applications, and a UST daemon talks with an in-process library helper thread to
enable/disable markers and control tracing over unix sockets.

We're currently in the process of moving from markers to the
TRACE_EVENT()+tracepoints infrastructure.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 10:27 am

For that I guess we should have a way to tie a uprobe to a filedesc or
somesuch, that way whenever the owner dies, the probe goes away.

Such probes would also obviuosly get a filter that limits it to tasks of
its own user etc..
--

From: Peter Zijlstra
Date: Wednesday, September 1, 2010 - 2:46 pm

Right, so in short, I think that if you rework this to be inode:offset
based you'll end up with a much simpler codebase.


--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:42 am

Provides x86 specific details for uprobes.
This includes interrupt notifier for uprobes, enabling/disabling
singlestep.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

Changelog from V5: Using local_irq_enable() instead of
    native_irq_enable and no more disabling irqs as suggested by Oleg
    Nesterov.

 arch/x86/kernel/signal.c  |   13 +++++++++++
 arch/x86/kernel/uprobes.c |   52 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..3657563 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -848,6 +848,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
 
+	if (thread_info_flags & _TIF_UPROBE) {
+		clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+		/*
+		 * On x86_32, do_notify_resume() gets called with
+		 * interrupts disabled. Hence enable interrupts if they
+		 * are still disabled.
+		 */
+		local_irq_enable();
+#endif
+		uprobe_notify_resume(regs);
+	}
+
 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ceaedc9..6985b4c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -26,6 +26,7 @@
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
 
+#include <linux/kdebug.h>
 #include <asm/insn.h>
 
 #ifdef CONFIG_X86_32
@@ -559,3 +560,54 @@ struct user_bkpt_arch_info user_bkpt_arch_info = {
 	.ip_advancement_by_bkpt_insn = 1,
 	.max_insn_bytes = MAX_UINSN_BYTES,
 };
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobes_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data)
+{
+	struct die_args *args = ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:42 am

Uprobes Documentation.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V5: Removed references to Modules, Samples, and
   probe Overhead.

Changelog from v3: Updated measurements.

Changelog from v2: Updated measurements.

Changelog from v1: Addressed comments from Randy Dunlap.
		 : Updated measurements.

 Documentation/uprobes.txt |  188 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 188 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/uprobes.txt

diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt
new file mode 100644
index 0000000..5b620d8
--- /dev/null
+++ b/Documentation/uprobes.txt
@@ -0,0 +1,188 @@
+Title	: User-Space Probes (Uprobes)
+Authors	: Jim Keniston <jkenisto@us.ibm.com>
+	: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Concepts: Uprobes
+2. Architectures Supported
+3. API Reference
+4. Uprobes Features and Limitations
+5. TODO
+6. Uprobes Team
+
+1. Concepts: Uprobes
+
+Uprobes enables you to dynamically break into any routine in a
+user application and collect debugging and performance information
+non-disruptively. You can trap at any code address, specifying a
+kernel handler routine to be invoked when the breakpoint is hit.
+
+A uprobe can be inserted on any instruction in the application's
+virtual address space.  The registration function register_uprobe()
+specifies which process is to be probed, where the probe is to be
+inserted, and what handler is to be called when the probe is hit.
+
+1.1 How Does a Uprobe Work?
+
+When a uprobe is registered, Uprobes makes a copy of the probed
+instruction, stops the probed application, replaces the first byte(s)
+of the probed instruction with a breakpoint instruction (e.g., int3
+on i386 and x86_64), and allows the probed application to continue.
+(When inserting the breakpoint, Uprobes uses background page
+replacement mechanism, so ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:42 am

Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

TODO: Merge both events to a single probe event.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V10: delete references to is_kprobe and make
         is_return a bool.

Changelog from V7: Merge changes due to string support in kprobes
	traceevent.

Changelog from V5: Addressed comments from Masami Hiramatsu
	and Steven Rostedt. Also shared lot more code from kprobes
        traceevents.

 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/trace_kprobe.c |  752 +------------------------------------------
 kernel/trace/trace_probe.c  |  648 +++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |  155 +++++++++
 5 files changed, 822 insertions(+), 738 deletions(-)
 create mode 100644 kernel/trace/trace_probe.c
 create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..d709697 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -353,6 +353,7 @@ config KPROBE_EVENT
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
 	bool "Enable kprobes-based dynamic events"
 	select TRACING
+	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -365,6 +366,9 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config PROBE_EVENTS
+	def_bool n
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 53f3381..95d2043 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_EVENT_TRACING) += power-traces.o
 ifeq ($(CONFIG_TRACING),y)
 ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:41 am

Provides slot allocation mechanism for execution out of line for use
with user space breakpointing.

Traditional method of replacing the original instructions on
breakpoint hit are racy when used on multithreaded applications.

Alternatives for the traditional method include:
	- Emulating the breakpointed instruction.
	- Execution out of line.

Emulating the instruction:
	This approach would use a in-kernel instruction emulator to
emulate the breakpointed instruction. This approach could be looked in
at a later point of time.

Execution out of line:
	In execution out of line strategy, a new vma is injected into
the target process, a copy of the instructions which are breakpointed
is stored in one of the slots. On breakpoint hit, the copy of the
instruction is single-stepped leaving the breakpoint instruction as
is.  This method is architecture independent.

This method is useful while handling multithreaded processes.

This patch allocates one page per process for slots to be used to copy
the breakpointed instructions.

Current slot allocation mechanism:
1. Allocate one dedicated slot per user breakpoint. Each slot is big
enuf to accomodate the biggest instruction for that architecture. (16
bytes for x86).
2. We currently allocate only one page for slots. Hence the number of
slots is limited to active breakpoint hits on that process.
3. Bitmap to track used slots.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V5: Merged into uprobes.

Changelog form V3:
 * Added a memory barrier after the slot gets initialized.

Changelog from V2: (addressing Oleg's comments)
 * Removed code in !CONFIG_UPROBES_XOL
 * Functions now pass pointer to uprobes_xol_area instead of pointer
   to void.

 include/linux/uprobes.h |    2 
 kernel/uprobes.c        |  283 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 285 insertions(+), 0 deletions(-)

diff --git ...
From: Peter Zijlstra
Date: Wednesday, September 1, 2010 - 1:13 pm

Since you have a static sized bitmap, why not simply declare it here?


Naughty kernel modules we don't care about, but yeah, it appears vma's
installed using install_special_mapping() can be unmapped by the process
itself,.. curious. 

Anyway, you could install your own vm_ops and provide a close method to

Seems interesting,.. why not use install_special_mapping(), that's what

It doesn't actually do that, xol_add_vma() does that, this allocates the

There's a nice way to not have to write that:



I would call that allocate, find would imply a constant operation, but



if (!xol_vaddr)
  goto bail;

gives nices code, and saves an indent level.

Also, why would we ever get here with !user_bkpt->vaddr.




funny code flow,.. s/found = 1/return/ and loose the conditional and

This doesn't actually appear used in this patch,.. does it want to live
elsewhere?
--

From: Srikar Dronamraju
Date: Friday, September 3, 2010 - 9:40 am

Okay, I hadnt looked at install_special_mapping earlier so I will take a
look and incorporate it. However I am not clear at this point what
install_special_mapping is giving us here.  Also install_special_mapping
is already defining its own vm_ops esp a close method thats doesnt seem
to be doing anything. So at this point I am not clear how we are link






For now, user_bkpt->vaddr will always be set when we are here. 
However when we add uretprobe support, we would then get here with
user_bkpt->vaddr being NULL. 

I would drop the check for now, but add it later when we add the return

I had renamed the structure from ubp to user_bkpt based on your
comments. I had actually mentioned this in the summary mail that I had
sent on Jan 22 this year. I am fine to rename it to user_bp if that

I dont want the compiler to reorder the instructions and do the
assignment for user_bkpt to be done before we complete the copy above.

If the assignment happens before we copy the content into the slot,
someother thread that might hit the same probe actually things the slot
is ready and tries to jump to that slot even before the slot is
initialized.



Yes, xol_validate_vaddr gets used in the next patch.  So probably it can
be moved to the next patch.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 9:51 am

If you want a compiler barrier, use barrier(), but here you seem to
describe a multi-threaded situation, in which case the observer thread
needs at least a rmb() in order for that mb() to mean anything other
than the compiler barrier it implies.

Also, use smp_* barriers.


--

From: Srikar Dronamraju
Date: Friday, September 3, 2010 - 10:26 am

Okay,  would something like this suffice?


static unsigned long xol_get_insn_slot(struct user_bkpt *user_bkpt,
				struct uprobes_xol_area *xol_area)
{
	unsigned long flags, xol_vaddr = 0;
	int len;

	if (unlikely(!xol_area))
		return 0;

	smp_rmb();
	if (user_bkpt->xol_vaddr)
		return user_bkpt->xol_vaddr;

	spin_lock_irqsave(&xol_area->lock, flags);
	xol_vaddr = xol_take_insn_slot(xol_area);
	spin_unlock_irqrestore(&xol_area->lock, flags);

	/*
	 * Initialize the slot if user_bkpt->vaddr points to valid
	 * instruction slot.
	 */
	if (!xol_vaddr)
		return 0;

	len = access_process_vm(current, xol_vaddr, user_bkpt->insn,
					UPROBES_XOL_SLOT_BYTES, 1);
	if (unlikely(len < UPROBES_XOL_SLOT_BYTES))
		printk(KERN_ERR "Failed to copy instruction at %#lx "
				"len = %d\n", user_bkpt->vaddr, len);

	/*
	 * Update user_bkpt->xol_vaddr after giving a chance for the slot to
	 * be initialized.
	 */
	smp_mb();
	user_bkpt->xol_vaddr = xol_vaddr;
	return user_bkpt->xol_vaddr;
}

-- 
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 10:41 am

Racy like you won't believe..

Suppose multiple threads hitting the trap at the same time, every thread
will end up failing the check and allocating a new slot for it, at the
end the slowest thread will end up setting the value.


--

From: Srikar Dronamraju
Date: Sunday, September 5, 2010 - 10:38 pm

Agree, I shall fix this up.
Since set_bit and clear_bit are atomic, I shall change the
area->lock from a spinlock to a mutex, and have the mutex released
after the slot has been updated with the "single-stepping
instruction".

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 10:25 am

What you're doing might well be the right thing, I was just wondering.

I think that, after thinking about it more, that the shmem file thing
you're doing has the added benefit that the things gets auto-magic
paging, which is a good thing.


--

From: Peter Zijlstra
Date: Thursday, September 2, 2010 - 1:23 am

An alternative method would be to have 1 slot per cpu, and manage the
slot content using preemption notifiers. That gives you a fixed number
of slots and an unlimited number of probe points.

If the preemption happens to be a migration you need to rewrite the
userspace IP to point to the new slot -- if indeed the task was inside
one when it got preempted -- but that all should be doable.


--

From: Srikar Dronamraju
Date: Thursday, September 2, 2010 - 10:47 am

Certainly doable but it has its share of drawbacks.
1. On every probe hit we have to copy the instruction into the
slot, so there is a performance penalty. 

2  This might complicate booster probe, because the jump
instruction that follows the original instruction now actually have to
coded every time.

3. Yes migration is an issue esp
-  if a thread of the same process that hit a breakpoint is scheduled into the same cpu and that newly scheduled thread hits a breakpoint. 
- Something similar can happen if a multithreaded process runs on a
  uniprocessor machine.

4. I dont see a need for clearing slots after post processing, but if
we need to clear we then are adding more penalties because not only are
we clearing the slots but the post processing then cant happen in
interrupt context.

5. I think we are covered on the cpu hotplug too, (i.e not sure if we have
to make uprobes cpu hot plug aware.).

6. We would still be allocating a page for the slots. Unless we want
to expand to more slots than available in one page, I dont see the
disadvantages with the current approach.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Friday, September 3, 2010 - 12:26 am

Yeah, although I imagine its nearly free since you need to pay the

Why can't you keep the whole replacement sequence in-tact? Simply copy


post-processing? you mean the probe handler? Why couldn't that be done

Not if you use a slot per cpu and use preemption notifiers, the

The current approach limits the number of probes to what fits in a page.
The slot per cpu approach will have no such limit.
--

From: Srikar Dronamraju
Date: Monday, September 6, 2010 - 10:59 am

Lets say the thread while singlestepping the process gets
pre-empted. Eventually the cpu might run some other thread of the same
process before picking the first run thread. Or the first run
thread could after migration due to load balancing or whatever end up


yes the limit on number of probes is a limitation. For now the
implementation would be straight and easy. We could either rework on the

Yes, if we use jump absolute then the replacement sequence stays
in-tact.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Monday, September 6, 2010 - 11:20 am

So assuming we're preempted while the IP is inside the slot:

On the preempt-out we store the slot relative ip (ip - start_of_slot),
on preempt-in we write the replacement instructions in our cpu slot
(could be the same cpu, could be another) and re-position the ip to
point to the same relative position inside that slot, then go!

It really doesn't matter what happens in between.
--

From: Peter Zijlstra
Date: Monday, September 6, 2010 - 11:28 am

Right, but with the proposed slot-per-cpu we'd be able to have unlimited
active probes within that single page, even with boosted probes,
assuming 16 bytes per instruction:

 push reg
 mov reg,foo
 insn
 pop reg
 jmp

and cacheline alignment we'd end up with 128 bytes per slot, we can
service 32 cpus per page. Which, for now, means that all my machines
need but a single page.
--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:43 am

Implements trace_event support for uprobes. In its
current form it can be used to put probes at a specified text address
in a process and dump the required registers when the code flow reaches
the probed address.

TODO: Documentation/trace/uprobetrace.txt

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from v5: Addressed comments from Masami Hiramatsu and Steven
      Rostedt. Some changes because of changes in common probe events.

Changelog from v4: (Merged to 2.6.35-rc3-tip)

Changelog from v2/v3: (Addressing comments from Steven Rostedt
					and Frederic Weisbecker)
	* removed pit field from uprobe_trace_entry.
	* share common parts with kprobe trace events.
	* use trace_create_file instead of debugfs_create_file.


The following example shows how to dump the instruction pointer and %ax a
register at the probed text address.

Start a process to trace. Get the address to trace.
  [Here pid is asssumed as 6016]
  [Address to trace is 0x0000000000446420]
  [Registers to be dumped are %ip and %ax]

# cd /sys/kernel/debug/tracing/
# echo 'p 6016:0x0000000000446420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_6016_0x0000000000446420 6016:0x0000000000446420 %ip=%ip %ax=%ax
# cat events/uprobes/p_6016_0x0000000000446420/enable
0
[enable the event]
# echo 1 > events/uprobes/p_6016_0x0000000000446420/enable
# cat events/uprobes/p_6016_0x0000000000446420/enable
1
# #### do some activity on the program so that it hits the breakpoint
# cat uprobe_profile
  6016 p_6016_0x0000000000446420                                234
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-6016  [004] 227931.093579: p_6016_0x0000000000446420: (0x446420) %ip=446421 %ax=79
             zsh-6016  [005] 227931.097541: p_6016_0x0000000000446420: (0x446420) %ip=446421 %ax=79
             zsh-6016  [000] 227931.124909: p_6016_0x0000000000446420: (0x446420) %ip=446421 ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:43 am

Given a dso, list the symbols in ascending order. Needed for listing
available symbols from perf-probe.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 tools/perf/util/symbol.c |   14 ++++++++++++++
 tools/perf/util/symbol.h |    1 +
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 1a36773..ca22032 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -388,6 +388,20 @@ size_t dso__fprintf_buildid(struct dso *self, FILE *fp)
 	return fprintf(fp, "%s", sbuild_id);
 }
 
+size_t dso__fprintf_symbols(struct dso *self, enum map_type type, FILE *fp)
+{
+	size_t ret = 0;
+	struct rb_node *nd;
+	struct symbol_name_rb_node *pos;
+
+	for (nd = rb_first(&self->symbol_names[type]); nd; nd = rb_next(nd)) {
+		pos = rb_entry(nd, struct symbol_name_rb_node, rb_node);
+		fprintf(fp, "%s\n", pos->sym.name);
+	}
+
+	return ret;
+}
+
 size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp)
 {
 	struct rb_node *nd;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index b7a8da4..72ef973 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -181,6 +181,7 @@ size_t machines__fprintf_dsos(struct rb_root *self, FILE *fp);
 size_t machines__fprintf_dsos_buildid(struct rb_root *self, FILE *fp, bool with_hits);
 
 size_t dso__fprintf_buildid(struct dso *self, FILE *fp);
+size_t dso__fprintf_symbols(struct dso *self, enum map_type type, FILE *fp);
 size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp);
 
 enum dso_origin {
--

From: Arnaldo Carvalho de Melo
Date: Wednesday, August 25, 2010 - 4:21 pm

Applied after renaming it to 'dso__fprintf_symbols_by_name', as at first
I was scratching my head to figure out if we could reuse it in
dso__fprintf() to then notice that it is in ascending _name_ order, not
the default that is ordered by addr :-)

Please fixup the users, i.e. perf probe in your patchset.

- Arnaldo
--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 9:32 pm

Yes, Shall send the refreshed patch now

-- 
Thanks and Regards
Srikar


--

From: tip-bot for Srikar Dronamraju
Date: Monday, August 30, 2010 - 1:35 am

Commit-ID:  90f18e63fbd005133624bf18a5e8b75c92e90f4d
Gitweb:     http://git.kernel.org/tip/90f18e63fbd005133624bf18a5e8b75c92e90f4d
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 25 Aug 2010 19:13:29 +0530
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 25 Aug 2010 17:28:59 -0300

perf symbols: List symbols in a dso in ascending name order

Given a dso, list the symbols in ascending name order. Needed for
listing available symbols from perf probe.

Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Mark Wielaard <mjw@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Naren A Devaiah <naren.devaiah@in.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100825134329.5447.92261.sendpatchset@localhost6.localdomain6>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/symbol.c |   14 ++++++++++++++
 tools/perf/util/symbol.h |    1 +
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 1a36773..a08e1cb 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -388,6 +388,20 @@ size_t dso__fprintf_buildid(struct dso *self, FILE *fp)
 	return fprintf(fp, "%s", sbuild_id);
 }
 
+size_t dso__fprintf_symbols_by_name(struct dso *self, enum map_type type, FILE *fp)
+{
+	size_t ret = 0;
+	struct rb_node *nd;
+	struct symbol_name_rb_node *pos;
+
+	for (nd = ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:43 am

Selecting CONFIG_PROBE_EVENTS enables both kprobe-based and
uprobes-based dynamic events. However kprobe-tracer or uprobe-tracer
can still be individually selected or disabled.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/trace/Kconfig |   51 +++++++++++++++++++++++++++++---------------------
 1 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 55ba474..205c12b 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -76,7 +76,7 @@ config RING_BUFFER_ALLOW_SWAP
 # All tracer options should select GENERIC_TRACER. For those options that are
 # enabled by all tracers (context switch and event tracer) they select TRACING.
 # This allows those options to appear when no other tracer is selected. But the
-# options do not appear when something else selects it. We need the two options
+# options do not appear when something else selects it. We need the two option
 # GENERIC_TRACER and TRACING to avoid circular dependencies to accomplish the
 # hiding of the automatic options.
 
@@ -162,7 +162,7 @@ config IRQSOFF_TRACER
 	  This option measures the time spent in irqs-off critical
 	  sections, with microsecond accuracy.
 
-	  The default measurement method is a maximum search, which is
+	  The default measurement method is a maximum search, which i
 	  disabled by default and can be runtime (re-)started
 	  via:
 
@@ -184,7 +184,7 @@ config PREEMPT_TRACER
 	  This option measures the time spent in preemption-off critical
 	  sections, with microsecond accuracy.
 
-	  The default measurement method is a maximum search, which is
+	  The default measurement method is a maximum search, which i
 	  disabled by default and can be runtime (re-)started
 	  via:
 
@@ -228,7 +228,7 @@ choice
 	prompt "Branch Profiling"
 	default BRANCH_PROFILE_NONE
 	help
-	 The branch profiling is a software profiler. It will add hooks
+	 The ...
From: Masami Hiramatsu
Date: Wednesday, August 25, 2010 - 11:02 pm

Hmm, without this series, KPROBE_EVENT is set "y" by default.
(PROBE_EVENTS is introduced by 8/15)
I'd like to set this "y" by default, because it doesn't
affect other parts.


--

From: Srikar Dronamraju
Date: Friday, August 27, 2010 - 2:31 am

Okay will correct them.


This is based on what we discussed here
http://lkml.org/lkml/2010/8/2/86.

To recollect, 
Frederic wanted that there should be one option to select both
UPROBE_EVENT and KPROBE_EVENT. 

However if we make PROBE_EVENTS (which is the option to enable both
events) default "Y", then both UPROBE_EVENT and KPROBE_EVENT will be
selected.

Also if we look at http://lkml.org/lkml/2010/6/21/160, Steven
Rostedt didnt want UPROBE_EVENT to selected by default.

I agree that we should keep UPROBE_EVENT to be 'default n' till it gets
tested. Hence we have two choices. Either set the common knob to be
'default n' or dont have the common knob for now (i.e drop this
patch for now).

I think we should go with the first one, i.e have a common knob thats
by default unselected.

-- 
Thanks and Regards
Sriakr
--

From: Masami Hiramatsu
Date: Friday, August 27, 2010 - 4:04 am

Yeah, I'm OK to have a common knob, but I just don't like to set
KPROBE_EVENT unselected by default. I think there is no reason
to change default selecting (currently, KPROBE_EVENT=y by default.)

So, I think we should have below selecting list;
--- Tracers
...
[*] Enable dynamic events
   [ ] Enable user-space dynamic events (EXPERIMENTAL)
...

What would you think about this ? :)

Thank you,


-- 
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com
--

From: Srikar Dronamraju
Date: Friday, August 27, 2010 - 5:17 am

Wouldnt it negate the purpose of common knob?
Because people would still have go and select UPROBE_EVENTS,

I think when Frederic asked for a common knob, he was looking at
enabling both or disabling both and an option to selectively
select one of the tracers. 

--
Thanks and Regards
Srikar
--

From: Masami Hiramatsu
Date: Friday, August 27, 2010 - 8:37 am

Hmm, I think this just seems an enhancement of dynamic events,
and also you can enable it by default on some point.
I mean, eventually, there will be only "Enable dynamic events"

Yeah, but I'd like to ask Frederic that he expected disabling
KPROBE_EVENT by default too, even though it changes current
default config.

Thank you,
--

From: Srikar Dronamraju
Date: Friday, August 27, 2010 - 7:10 am

Selecting CONFIG_PROBE_EVENTS enables both kprobe-based and
uprobes-based dynamic events. However kprobe-tracer or uprobe-tracer
can still be individually selected or disabled.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
---

Changelog from V10: Fixed few erroneous changes: missing s at eol.
	reported by Masami Hiramatsu.
 
 kernel/trace/Kconfig |   21 +++++++++++++++------
 1 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 55ba474..77e04b0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -351,9 +351,8 @@ config BLK_DEV_IO_TRACE
 config KPROBE_EVENT
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
+	depends on PROBE_EVENTS
 	bool "Enable kprobes-based dynamic events"
-	select TRACING
-	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -370,10 +369,9 @@ config UPROBE_EVENT
 	bool "Enable uprobes-based dynamic events"
 	depends on ARCH_SUPPORTS_UPROBES
 	depends on MMU
+	depends on PROBE_EVENTS
 	select UPROBES
-	select PROBE_EVENTS
-	select TRACING
-	default n
+	default y
 	help
 	  This allows the user to add tracing events on top of userspace dynamic
 	  events (similar to tracepoints) on the fly via the traceevents interface.
@@ -383,7 +381,18 @@ config UPROBE_EVENT
 	  tools on user space applications.
 
 config PROBE_EVENTS
-	def_bool n
+	bool "Enable kprobes and uprobe based dynamic events"
+	select TRACING
+	default n
+	help
+	  This allows a user to add dynamic tracing events in
+	  kernel using kprobe-tracer and in userspace using
+	  uprobe-tracer. However users can still selectively
+	  disable one of these events.
+
+	  For more information on kprobe-tracer and uprobe-tracer
+	  please refer help under KPROBE_EVENT and UPROBE_EVENT
+	  respectively.
 
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:43 am

Introduces -S/--show_functions option for perf-probe.
This lists function names in a File. If no file is specified, then lists
functions in the current running kernel.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V10: As suggested by Arnaldo, filtering is now
 based on sym.binding.

Changelog from V9: Filter labels, weak, and local binding functions
from listing as suggested by Christoph Hellwig.

Show last 10 functions in /bin/zsh.

# perf probe -S -D /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -S -D /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Show last 10 functions in kernel.

# perf probe -S | tail
zlib_inflateInit2
zlib_inflateReset
zlib_inflate_blob
zlib_inflate_table
zlib_inflate_workspacesize
zone_pcp_update
zone_reclaim
zone_reclaimable_pages
zone_statistics
zone_watermark_ok

 tools/perf/builtin-probe.c    |   43 ++++++++++++++++++++++++
 tools/perf/util/probe-event.c |   72 +++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/probe-event.h |    1 +
 tools/perf/util/symbol.c      |    8 +++++
 tools/perf/util/symbol.h      |    1 +
 5 files changed, 124 insertions(+), 1 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 199d5e1..fa63245 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -50,9 +50,11 @@ static struct {
 	bool list_events;
 	bool force_add;
 	bool show_lines;
+	bool list_functions;
 	int nevents;
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
+	struct strlist *limitlist;
 	struct line_range line_range;
 	int max_probe_points;
 } params;
@@ -132,6 +134,19 @@ static int opt_show_lines(const struct ...
From: Srikar Dronamraju
Date: Friday, August 27, 2010 - 7:21 am

Introduces -S/--show_functions option for perf-probe.
This lists function names in a File. If no file is specified, then lists
functions in the current running kernel.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V11: accomodate name change dso_fprintf_symbols_by_name.

Changelog from V10: As suggested by Arnaldo, filtering is now
 based on sym.binding.

Changelog from V9: Filter labels, weak, and local binding functions
from listing as suggested by Christoph Hellwig.

Show last 10 functions in /bin/zsh.

# perf probe -S -D /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -S -D /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Show last 10 functions in kernel.

# perf probe -S | tail
zlib_inflateInit2
zlib_inflateReset
zlib_inflate_blob
zlib_inflate_table
zlib_inflate_workspacesize
zone_pcp_update
zone_reclaim
zone_reclaimable_pages
zone_statistics
zone_watermark_ok
---
 tools/perf/builtin-probe.c    |   43 ++++++++++++++++++++++++
 tools/perf/util/probe-event.c |   72 +++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/probe-event.h |    1 +
 tools/perf/util/symbol.c      |    8 +++++
 tools/perf/util/symbol.h      |    1 +
 5 files changed, 124 insertions(+), 1 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 199d5e1..fa63245 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -50,9 +50,11 @@ static struct {
 	bool list_events;
 	bool force_add;
 	bool show_lines;
+	bool list_functions;
 	int nevents;
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
+	struct strlist *limitlist;
 	struct line_range line_range;
 	int ...
From: Masami Hiramatsu
Date: Wednesday, October 20, 2010 - 2:56 am

Hi Srikar,


Hmm, I think the basic functionality of this patch (I mean
functions in running kernel) could be merged separately

However, I'd rather use --funcs/-F and --dso/-D instead of
above. :)

Thank you,

-- 
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com
--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:43 am

Introduces map_groups_for_each_map that iterates over a map_group.
This is useful while listing functions through perf-probe.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@infradead.org>
---
 tools/perf/util/map.h |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/tools/perf/util/map.h b/tools/perf/util/map.h
index 7857579..45b5f50 100644
--- a/tools/perf/util/map.h
+++ b/tools/perf/util/map.h
@@ -54,6 +54,33 @@ struct map_groups {
 	struct machine	 *machine;
 };
 
+/* For map_groups iteration */
+static inline struct map *map__first(struct map_groups *self,
+						enum map_type type)
+{
+	struct rb_node *rn = rb_first(&self->maps[type]);
+	return rn ? rb_entry(rn, struct map, rb_node) : NULL;
+}
+
+static inline struct map *map__next(struct map *map)
+{
+	struct rb_node *rn;
+	if (!map)
+		return NULL;
+	rn = rb_next(&map->rb_node);
+	return rn ? rb_entry(rn, struct map, rb_node) : NULL;
+}
+
+/**
+ * map_groups__for_each_map      - iterate over a map_group
+ * @pos:	the &struct map to use as a loop cursor.
+ * @type:	the map type.
+ * @self:	the &struct map_groups for loop.
+ */
+#define map_groups__for_each_map(pos, type, self)	\
+	for (pos = map__first(self, type); pos;		\
+				pos = map__next(pos))
+
 /* Native host kernel uses -1 as pid index in machine */
 #define	HOST_KERNEL_ID			(-1)
 #define	DEFAULT_GUEST_KERNEL_ID		(0)
--

From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:44 am

Introduces an option to list potential probes to probe using perf probe
command. Also introduces an option to limit the dso to list the potential
probes. Listing of potential probes is sorted by dso and
alphabetical order.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V9:
	Filter labels, weak, and local binding functions from listing
as suggested by Christoph Hellwig.
	Incorporated comments from Arnaldo on Version 9 of patchset.

Show all potential probes in the current running kernel and limit to
the last 10 functions.
# perf probe -S | tail
zlib_inflateInit2
zlib_inflateReset
zlib_inflate_blob
zlib_inflate_table
zlib_inflate_workspacesize
zone_pcp_update
zone_reclaim
zone_reclaimable_pages
zone_statistics
zone_watermark_ok

Show all potential probes in a process by pid 3104 across all dsos
and limit to the last 10 functions.
# perf probe -S -p 3104 | tail
_nss_files_setgrent
_nss_files_sethostent
_nss_files_setnetent
_nss_files_setnetgrent
_nss_files_setprotoent
_nss_files_setpwent
_nss_files_setrpcent
_nss_files_setservent
_nss_files_setspent
_nss_netgroup_parseline

Show all potentail probes in a process by pid 3104 limit to zsh dso
and limit to the last 10 functions.
# perf probe -S -p 3104 -D zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

 tools/perf/builtin-probe.c    |    2 +
 tools/perf/util/probe-event.c |   68 +++++++++++++++++++++++++++++++++--------
 tools/perf/util/probe-event.h |    4 +-
 3 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index afca6ae..f5893d9 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -274,7 +274,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 					" --line.\n");
 			usage_with_options(probe_usage, options);
 		}
-		ret = show_possible_probes(params.limitlist);
+		ret = ...
From: Srikar Dronamraju
Date: Wednesday, August 25, 2010 - 6:44 am

Enhances perf probe to accept pid and user vaddr.
Provides very basic support for uprobes.

TODO:
Update perf-probes.txt.
Global tracing.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog from V10: split few independent hunks into different
 patches as suggested by Arnaldo.

Changelog from v9: Renaming common fields/functions to refer to
probe instead of kprobe. This was suggested by Arnaldo.

Changelog from v8: Fixed a build break reported by Christoph Hellwig.

Changelog from v6: Changelog from v6: Fixed a bug reported by Masami.
  i.e Throw an error message and exit if perf probe is for a dwarf
  based probes.

Changelog from v4: Merged to 2.6.35-rc3-tip.

Changelog from v3: (addressed comments from Masami Hiramatsu)
	* Every process id has a different group name.
	* event name starts with function name.
	* If vaddr is specified, event name has vaddr appended
	  along with function name, (this is to avoid subsequent probes
	  using same event name.)
	* warning if -p and --list options are used together.

	Also dso can either be a short name or absolute path.

Here is a terminal snapshot of placing, using and removing a probe on a
process with pid 3591 (corresponding to zsh)

[ Probing a function in the executable using function name  ]
-------------------------------------------------------------
[root@ABCD]# perf probe -p 3591 zfree@zsh
Added new event:
  probe_3591:zfree                       (on 0x446420)

You can now use it on all perf tools, such as:

	perf record -e probe_3591:zfree -a sleep 1
[root@ABCD]# perf probe --list
probe_3591:zfree                       (on 3591:0x0000000000446420)
[root@ABCD]# cat /sys/kernel/debug/tracing/uprobe_events
p:probe_3591/zfree 3591:0x0000000000446420
[root@ABCD]# perf record -f -e probe_3591:zfree -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]
[root@ABCD]# perf probe -p 3591 --del ...
From: Christoph Hellwig
Date: Friday, October 29, 2010 - 2:23 am

It's been a while since the last posting.  Did you make any progress on
uprobes, especially allowing to define probes based on a file name?

--

From: Srikar Dronamraju
Date: Friday, October 29, 2010 - 3:48 am

Thanks for checking, I discussed with Peter offline and ironed out most
of the issues. I am thankful for Peter for all the suggestions

I am still getting the inode based uprobes to shape. 
Here is the brief summary of the discussion.

Significant differences from the previous patchset are:

- All probes would be maintained in a global rbtree sorted by inode and
  offset.
- There can be one or more consumers per probe. With each consumer there
  will be one handler and one (optional) filter.
- Filter restricts the processes/tasks that the handler is active.
- uprobe structure is dynamically created when the first consumer
  registers to the probe. It gets deallocated when all consumers
  unregisters from the probe.
- While registering a probe, we walk thro the list of vmas that are
  mapped to the inode, check if the consumer wants to probe the task
  corresponding to the vma and inserts the breakpoint.
- unregistering a probe also does something similar except for deleting
  the probe.
- There will be a hook in mmap/unmap to install probes as and when the
  vma gets loaded into process address space. This hook would walk thro
  the tree of probes for that inode and for each probe, walk thro the
  list of consumers and insert/delete breakpoints accordingly.
- There will be a hook in fork to install probes in newly created
  processes. This hook would walk thro the tree of probes for that inode
  and for each probe, walk thro the list of consumers and insert/delete
  breakpoints accordingly.
- Slots will still hang-out of mm_struct.
- Instead of the per-probe slot, we would have to use a per-thread slot.
  (This slot is for single stepping out of line). On every probehit, the
  slot has to be refreshed with the correct contents. 
- Since probe information is stored as inode:offset, probe
  identification on a breakpoint hit can only happen in task context.

Current issues: Given a vma; finding all tasks that have this vma
mapped. The current solution seems to walk thro ...
From: Christoph Hellwig
Date: Thursday, November 4, 2010 - 11:45 am

I don't see a way around that if we have to find the task by the vma.

You'll have to start with vma->vm_mm->owner and then walk the list

The performance numbers are pretty drastic.  But I'll let Peter comment
on the desire in more detail.  I'm really not in enough touch with this

I really prefer the new interface.  But as said before I'm just a user
here and I don't care how it's implemented underneath.  I'll defer to
Peter and others knowing the code in more detail to make the trade offs
between the different low level implementations.

--

Previous thread: [RFC V2 PATCH 0/3] timer: patchset focus on del_timer_sync() by Yong Zhang on Wednesday, August 25, 2010 - 6:45 am. (6 messages)

Next thread: [REPOST] Re: [PATCH] GSoC 2010 - Memory hotplug support for Xen guests - third fully working version by Daniel Kiper on Wednesday, August 25, 2010 - 7:00 am. (1 message)