Changelog from V10:
* Replaced function pointers in user_bkpt structure with weak
functions as suggested by Peter Zijlstra.
* CONFIG_PROBE_EVENTS now selects uprobe-tracer and kprobe-tracer
as suggested by Frederic.
* Split perf-probe listing patches into smaller patches.
Changelog from V9:
* Resolved comments from Arnaldo on perf support for uprobes.
* perf probe -S will now list only global binding functions as
requested by Christoph Hellwig.
* Moved Changelog to below Signed-off-by: line, so that its not part
of the patch description. (Suggested by Christoph.)
Changelog from V8:
* Fix build issues reported by Christoph.
* List available probes in a file without need to specify pid.
Changelog from V7:
* New feature: perf probe lists available probes.
* Fix perf probes for uprobes to exit with a error message on dwarf
based probes.
* Merge changes to kprobes traceevent infrastructure.
* Merge changes to perf.
Changelog from V6:
* Remove perf adjust symbols patch.
Changelog from V5:
* Merged user_bkpt and user_bkpt_xol into uprobes.
* Addressed comments till now.
Changelog from V4:
* Rebased to tip tree. (2.6.35-rc3-tip)
Changelog from v3:
* Reverted to background page replacement as suggested by Peter Zijlstra.
* Dso in 'perf probe' can be either be a short name or a absolute path.
* Addressed comments from Masami, Frederic, Steven on traceevents and perf
Changelog from v2:
* Addressed comments from Oleg, including removal of interrupt context
handlers, reverting background page replacement in favour of
access_process_vm().
* Provides perf interface for uprobes.
Changelog from v1:
* Added trace_event interface for uprobes.
* Addressed comments from Andrew Morton and Randy Dunlap.
For previous posting: please refer:
http://lkml.org/lkml/2010/7/27/121, http://lkml.org/lkml/2010/7/12/67,
http://lkml.org/lkml/2010/7/8/239, http://lkml.org/lkml/2010/6/29/299,
http://lkml.org/lkml/2010/6/14/41, ...User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page and write_protect_page.
Now replace_page() loses its static attribute.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---
include/linux/mm.h | 4 ++
mm/ksm.c | 112 -------------------------------------------------
mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 124 insertions(+), 112 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 831c693..3f014e4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -863,6 +863,10 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte);
+int write_protect_page(struct vm_area_struct *vma, struct page *page,
+ pte_t *orig_pte);
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index e2ae004..8a792d0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -694,118 +694,6 @@ static inline int pages_identical(struct page *page1, struct page *page2)
return !memcmp_pages(page1, page2);
}
-static int write_protect_page(struct vm_area_struct *vma, struct page *page,
- pte_t *orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long addr;
- pte_t *ptep;
- spinlock_t *ptl;
- int swapped;
- int err = -EFAULT;
-
- addr = page_address_in_vma(page, vma);
- if (addr == -EFAULT)
- goto out;
-
- ptep = page_check_address(page, mm, addr, &ptl, 0);
- if (!ptep)
- goto out;
-
- if (pte_write(*ptep)) {
- pte_t entry;
-
- swapped ...Provides a mechanism in kernel to insert/remove breakpoints in
user space applications including
- architecture independent mechanism to establish breakpoints in
userspace applications.
- helper functions for reading/writing/validating data/opcodes from
target process's address space.
- wrappers and default implementation(whereever possible) of
architecture dependent functions(setting breakpoint)
- preprocessing and postprocessing of singlestep on breakpoint hit
Single stepping inline is the traditional method where original
instructions replace the breakpointed instructions on a breakpoint
hit. This method works well with single threaded applications.
However its racy with multithreaded applications.
In execution out of line, threads single steps on a copy of the
instruction. This method works well for both single-threaded and
multithreaded applications.
Uprobes uses execution out of line method.
There could be other strategies like emulating an instruction. However
they are currently not implemented.
Insertion and removal of breakpoints is by "Background page
replacement". i.e make a copy of the page, modify its the contents,
set the pagetable and flush the tlbs. This page uses enhanced
replace_page to cow the page. Modified page is only reflected for the
interested process. Others sharing the page will still see the old
copy.
You need to follow this up with the uprobes patch for your
architecture to define architecture specific functionality for
reading/writing/validating data/opcodes.
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog from V10: (replace function pointers with weak functions)
* Removed architecture specific function pointers in user_bkpt
structure and replaced them with weak functions as suggested by
Peter Zijlstra.
Changelog from V5: (Merge user_bkpt into uprobes)
* Merged user_bkpt into uprobes as suggested by Christoph ...That really wants to be static, 'arch' is a way too generic a name to either: s/uprobes_read_vm/uprobes_read_data/ or Something like: /* private, read-only, executable maps only */ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) != (VM_READ|VM_EXEC)) This assumes user_bkpt_opcode_t is a scalar value, but there's no assertion of that, if someone were to define it like char[5] or somesuch you fail to check vma->vm_end So here check_vma() is the default implementation of validate_address(), I hope not,.. the pte swizzle we do above does not require any such Why would we even consider calling this function on something that would fail the validate_address() test? If that fails we would not have installed the breakpoint to begin with, hence there would be no reason Again, assumes the instruction thing is a scalar. The big thing I'm missing in this patch is generic code handling the actual breakpoint.. but maybe that's somewhere in the next patches.. /me goes look. --
Provides x86 specific functions for instruction analysis and instruction validation and x86 specific pre-processing and post-processing of singlestep especially for RIP relative instructions. Uses "x86: instruction decoder API" for validation and analysis of user space instructions. This analysis is used at the time of post-processing of breakpoint hit to do the necessary fix-ups. There is support for breakpointing RIP relative instructions. However there are still few instructions that cannot be singlestepped. Also defines TIF_UPROBE flag for x86. This patch requires "x86: instruction decoder API" http://lkml.org/lkml/2009/6/1/459 Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog from V10: (replace function pointers with weak function * Removed architecture specific function pointers in user_bkpt structure and replaced them with weak functions as suggested Peter Zijlstra. Changelog from V5: Merged into uprobes layer. Changelog from V1: set UPROBES_FIX_SLEEPY if post_xol might sleep. --- arch/x86/Kconfig | 1 arch/x86/include/asm/thread_info.h | 2 arch/x86/include/asm/uprobes.h | 43 +++ arch/x86/kernel/Makefile | 2 arch/x86/kernel/uprobes.c | 561 ++++++++++++++++++++++++++++++++++++ 5 files changed, 609 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/uprobes.h create mode 100644 arch/x86/kernel/uprobes.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f0ee331..4710268 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -54,6 +54,7 @@ config X86 select HAVE_KERNEL_LZO select HAVE_HW_BREAKPOINT select HAVE_MIXED_BREAKPOINTS_REGS + select ARCH_SUPPORTS_UPROBES select PERF_EVENTS select HAVE_PERF_EVENTS_NMI select ANON_INODES diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index f0b6e5d..5b9c9f0 100644 --- ...
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: Quick high level review. I did not attempt to validate the basic One general comment here: since with uprobes the instruction decoder becomes security critical did you do any fuzz tests on it (e.g. like using it on crashme or on code that has Shouldn't all this stuff be in the instruction decoder? These functions that just do a single printk seem weird. I would do that in the caller. Also the message could be shortened I guess This check is not fully correct because it's valid to have 32bit code in 64bit programs and vice versa. The only good way to check that is to look at the code segment at runtime though (and it gets complicated if you want to handle LDTs, but that could be optional). May be difficult to do though. Also the compat bit is not necessarily set if no system call is goto is automatically unlikely and unlikely is deprecated anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I havent tried any fuzz tests with the instruction decoder. But I am not sure if Masami has tried that out some of these. One question: Do you want to test uprobes with crashme or test Even Peter wasnt comfortable with user_bkpt. How about user_bp? i.e the above field would be user_bp_opcode_t. I felt user_breakpoint_opcode_t might look long. Also we would have to rename other structures accordingly like user_bkpt_task_arch_info would become user_breakpoint_task_arch_info. Do let me know your Okay, I can move the printk to the caller, I will try to shorten the message, Would something like "uprobes: no support for 2-byte validate_insn_32bit is able to identify all valid instructions in a 32 bit app and validate_insn_64bits is a superset of validate_insn_32bits; i.e it considers valid 32 bit codes as valid too. Did you get a chance to look at validate_insn_32bit/validate_insn_64bits? If you feel that validate_insn_32bit/validate_insn_64bits? are unable to detect Okay, shall remove unlikely from the above. -- Thanks and Regards Srikar --
My main objection was the uprobe.c and user_bkpt.c splitup, its all about uprobes, but as to this name, you can simply name it uprobe_opcode_t, no need to preserve the whole user breakpoint thing at all. --
On Fri, 3 Sep 2010 23:18:32 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: Ideally both, but as a minimum the part that is exposed to user space, that is uprobes. BTW if you test it I would test it both with real crashme Yes that's fine. Optionally you could supply a short script like scripts/decodecode that feeds it through objdump -d How can this be? e.g. 32bit has 1 byte INC/DEC but on 64bit these are REX prefixes and can be in front of nearly anything. I don't think you can do a 100% solution because for 100% you would need to know the code segment the CPU is going to use later, and that's not possible in advance. A heuristic is reasonable (and leave out applications that generate 64bit code from 32bit executables or vice versa) Hmm actually I double checked and this is a separate bit. So scratch that, TIF_32BIT is ok to test. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
You are right, the validate_insn_32bits refers to good_insns_32 and validate_insn_64bits refers to good_insns_64 to decode 1 byte instructions. Some instructions like 0x06 and 0x0e seem to be valid in I think you are referring to RIP related instructions, this how we handle them. Please correct us if we are wrong, but here is what we do - While analyzing the instruction, take into account which register acts as the code segment register. - When interrupted (but before singlestep), copy the contents of the register which we think acts as code segment register in our above analysis into per-task scratch variable. - After singlestepping we retrieve the saved per-task scratch Okay, Thanks for confirming this. -- Thanks and Regards Srikar --
On Mon, 6 Sep 2010 19:14:07 +0530 crashme and valid 1/2 bit corrupted code please if possible. I'm I just meant regarding long mode vs compat mode which defines whether REX prefixes are valid or not. Because this can change any time (if the application does a long jump) you cannot know in advance what it is going to use. But it's also very rare to use long jumps at all, so this can be probably ignored (but should be documented somewhere), and just guess based on the executable. I just wanted to point out that it's not a 100% solution. I don't think you need to care about segment bases either. While they can be used (16bit Wine or dosemu) it's quite rare and not supporting uprobes for this is totally reasonable. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
As you can see in kernel tree, x86 insn decoder has a test which decodes vmlinux and compares results with objdump. Similar tests had been done for glibc etc. by Jim. Hmm, if you need to validate all instructions, you'd better to enhance x86 decoder for checking bad instructions. I think it can be done mostly by adding inat bitflags. --
The uprobes infrastructure enables a user to dynamically establish
probepoints in user applications and collect information by executing
a handler function when a probepoint is hit.
The user specifies the virtual address and the pid of the process of
interest along with the action to be performed. Uprobes uses the
execution out of line strategy and follows lazy slot allocation. I.e,
on the first probe hit for that process, a new vma (to hold the probed
instructions for execution out of line) is allocated. Once allocated,
this vma remains for the life of the process, and is reused as needed
for subsequent probes. A slot in the vma is allocated for a
probepoint when it is first hit.
A slot is marked for reuse only when the probe gets unregistered and
there are no threads in the vicinity.
In a multithreaded process, a probepoint once registered is active for
all threads of a process. If a thread specific action for a probepoint
is required then the handler should be implemented to do the same.
If a breakpoint already exists at a particular address (irrespective
of who inserted the breakpoint including uprobes), uprobes will refuse
to register any more probes at that address.
You need to follow this up with the uprobes patch for your
architecture.
For more information: please refer to Documentation/uprobes.txt
TODO:
1. Allow multiple probes at a probepoint.
2. Booster probes.
3. Allow probes to be inherited across fork.
4. probing function returns.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
---
Changelog from V5:
- Merged user_bkpt and user_bkpt_xol layers into uprobes.
Changelog from V2:
- Introduce TIF_UPROBE flag.
- uprobes hooks now in fork/exec/exit paths instead of tracehooks.
- uprobe_process is now part of the mm struct and is shared between
processes that share the mm.
- per thread information is now allocated on the fly.
* Hence allocation and ...I wouldn't worry about that, focus on inode attached probes and you get
Seems like a weird place for this hunk, does this want to live
I find the _process postfix a bit weird in this context, how about
something like:
struct mm_uprobes *mm_uprobes;
Make that:
struct uprobe_task_state uprobe_state;
Like previously said, I would much rather see an inode/offset based
interface and do the pid/fork/pgroup/cgroup etc.. stuff as filters on
Its customary to write it like:
spinlock_t list_lock; /* protects uprobe_list, nr_uprobes */
struct list_head uprobe_list;
Why void * and not simply:
struct uprobes_xol_area xol_area;
That struct is small enough and you only get one per mm and saves you an
So this thing is a link between the process and the probe, I'm not quite
sure what you need the refcount for, it seems to me you can only have on
of these per process/probe combination.
If you had used inode/offset based probes they would have been unique in
the system and you could have had an {inode,offset} indexed global tree
(or possibly a tree per inode, but that would mean adding to the inode
structure, which I think is best avoided).
That would also reduce the mm state to purely the xol area, no need to
I would be thinking you can obtain the active probe point from the
address the task is stuck at and the state seems fairly redundant. Which
leaves you with the arch state, which afaict is exactly as large as the
All that can be replaced by unconditional functions, simply stub them
Wandering hunks, these seem to want to get folded back to wherever the
The grand thing about not having any of this process state is that you
You can replace this with:
addr = instruction_pointer(task_pt_regs(current)) -
ip_advancement_by_brkpt_insn;
and then proceed from there like described below to obtain the struct
uprobe.
You can infer the SS/HIT state by checking if the user-addr is in the
Its an address, not a struct uprobe_probept ...Right, so one problem I overlooked is that you need to have the actual probe to compute the jump address, but that could be fixed by Except we need to stabilize the vma tree to do the lookup and that currently requires mmap_sem, I guess until we get Nick's per-pte vma-tree we could fudge that by adding a spinlock around the rb_tree --
I am working on the file based probing. It compiles but havent got it to test it yet, I can post the patch if you are interested. It should achieve similar to inode probing. However I would have an issue with making inode based probing the default. 1. Making all probing based on inode can be a performance hog. 2. Since unlike kernel space, every process has a different space, so why would we have to insert breakpoints in each of its process space if we are not interested in them. 3. Ingo has a requirement for allowing normal users to use uprobes thro perf. When this feature gets implemented, we have to be careful about a normal users trying to just trace their application resulting in it hitting performance all other users. For example: one user places a probe on /usr/lib/libc.so: malloc - Another normal users looks at the current userspace probes and constructs a program that just does malloc/free just to degrade the performance of the system. - user could be interested in just one process which could be calling malloc just 10 times. However during the same time there are 1000 processes which could all together call 100000 times during the same time. So even when we allow file based tracing across the system, it should be restricted to just the root user. As we discussed in previous discussions, Inode based tracing wasnt accepted back in 2006. May be the approach was a problem then but what Unlike kernel probing, uprobes has a disadvantage. Lets assume that the request for removing a probepoint when some of the threads have actually hit the probe. Because the handlers in uprobes can sleep, we cant remove the probepoint at the same time as the request for the removing the probe. This is where refcount steps in and helps us to decide when we can remove the probepoint. Even inoode based Lets assume the thread is about to singlestep (or has singlestepped) So the instruction pointer is pointing to one of the slot (or it ...
You don't have to, but you can. The problem I have with this stuff is that it makes the pid thing a primary interface, whereas it should be The to singlestep or not would be implied by the IP pointing to the start of a slot or not, but yes, I guess that as long as you do singlestep you need some state.. sucks though. Boosted probes are much nicer, they don't need that extra arch storage either, they can simply No particular other lock in mind, you could cmpxchg the pointer if that's all you need it for. The problem is that if you want inode based What meta-data? You can find the uprobe itself from inode:offset, and you know the return address from the trap site + orig ins size. You don't need the probepoint, and there'd be only a single uprobe instance. The Xol area can be found at current->mm->xol_area, I don't think you if its not the start of a slot, you've already single-stepped. Ideally you'd directly implement boosted probes, but I realize that's a tad more Is that because of the singlestep overhead? With boosted probes I would think it'd be much faster to take 1 trap, deal with it and continue execution, than to frob tons of kernel code in between. A bit more about these filter thingies, add a method to struct uprobe, something like int uprobe::wants_probe(struct task_struct *p) and add a single bit to task_struct (there's a few bitfields with holes in there). The on clone()/mmap() call the relevant wants_probe() methods, if one is true, set the task_struct::has_uprobe flag and install the probes. If nothing in the process wants probing, you'll never install the probes and nothing ever triggers, of only one of many tasks in the process gets tagged, you'll have to look up the probe anyway to know where to continue, but you can avoid calling the handler. --
The breakpoint exception and singlestep account for a substaintial time I think the otherway, Why instrument a process and filter it out, if we are not interested in it. While instrumenting kernel, we dont have this flexibility. So having a pid based filter is the right thing to do for kernel based tracing. If we can get the per process based tracing right, we can build higher lever stuff including the file based tracing easily. All tools/debuggers in the past have all worked with process based tracing. Tools like gdb can actually use the displaced singlestepping feature that uprobes provides. Some gdb developers have told on LKML earlier that they would be willing to use displaced singlestepping if the kernel provides an API that they can use. Also about the security perspective when allowing normal users use perf to trace their applications. Using this model, we dont have to write extra filters to limit them. These filters might allow uprobe handlers on only tasks belonging to that user. However it still interrupts task of other users. And as I said earlier, breakpoint exception and singlestepping actually make a very very substantial part of the handling. The actual uprobe handler depending on what it does Same namespace as the requestor. i.e whichever name space What if the called does something like this when one or more threads are processing the breakpoint. unregister_uprobe(u); kfree(u); In the current implementation, the probepoint structure might be released much later after the uprobe structure is released. Unlike uprobe struct, probepoint structure is allocated by uprobes sub-system and it knows how to release it cleanly. However we dont have Yes, I agree, we may not need the state after boosted probes. I am not sure at this time if we can do boosted probes for all The difference between running handlers in task context and running in interrupt context is the extra do_notify_resume() that gets called from task context. But we have more ...
That's what atomic_inc_unless_zero() and RCU are for. --
You're really not getting it, are you? No, it would result in the exact Urgh,.. I really oppose the whole pid-centric thing, if that means process wide and not per task its even worse. --
If there is just one instance of traced process for the inode then yes the number of breakpoints when traced with pid or based on inode would be the same. However if there are multiple instances of the traced process [example bash/zsh] (or the inode corresponds to a library that gets mapped into multiple processes example libc), and the user is interested in tracing just one instance of the process, then dont wont the inode based tracing I would disagree. Lets consider a user wants to trace his single threaded app say bash for few heavy used calls in libc say read/select systemcall stub. If this user wants to keep recording at discreet intervals. i.e record for 5 minutes, stop for 5 minutes, record again for 5 minutes, .... Can you list how you Since breakpoints are shared across the tasks of the same process, we cant do per-task based tracing. We can only do a per process tracing and filter per-task which if the request is for per-task tracing and thats what I think you were alluding to in the filter in one of your mails. I am okay with filtering per-task within a given process. -- Thanks and Regards Srikar --
Not if your filter function works.
So let me try this again, (assumes boosted probes):
struct uprobe {
struct inode *inode; /* we hold a ref */
unsigned long offset;
int (*handler)(void); /* arguments.. ? */
int (*filter)(struct task_struct *);
int insn_size; /* size of */
char insn[MAX_INSN_SIZE]; /* the original insn */
int ret_addr_offset; /* return addr offset
in the slot */
char replacement[SLOT_SIZE]; /* replacement
instructions */
atomic_t ref; /* lifetime muck */
struct rcu_head rcu;
};
static struct {
raw_spinlock_t tree_lock;
rb_root tree;
} uprobes;
static void uprobes_add(struct uprobe *uprobe)
{
/* add to uprobes.tree, sorted on inode:offset */
}
static void uprobes_del(struct uprobe *uprobe)
{
/* delete from uprobes.tree */
}
static struct uprobe *
uprobes_find_get(struct address_space *mapping, unsigned long offset)
{
unsigned long flags;
struct uprobe *uprobe;
raw_spin_lock_irqsave(&uprobes.treelock, flags);
uprobe = find_in_tree(&uprobes.tree);
if (!atomic_inc_not_zero(&uprobe->ref))
uprobe = NULL;
raw_spin_unlock_irqrestore(&uprobes.treelock, flags);
return uprobe;
}
static void __uprobe_free(struct rcu_head *head)
{
struct uprobe *uprobe = container_of(head, struct uprobe, rcu);
kfree(uprobe);
}
static void put_uprobe(struct uprobe *uprobe)
{
if (atomic_dec_and_test(&uprobe->ref))
call_rcu(&uprobe->rcu, __uprobe_free);
}
static inline int valid_vma(struct vm_area_struct *vma)
{
if (!vma->vm_file)
return 0;
if (vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED) ==
(VM_READ|VM_EXEC))
return 1;
return 0;
}
int register_uprobe(struct uprobe *uprobe)
{
struct vm_area_struct *vma;
inode_get(uprobe->inode);
atomic_set(1, &uprobe->ref);
uprobes_add(uprobe); /* add before the rmap walk, so that
new mmap()s will find it too */
for_each_rmap_vma(vma, uprobe->inode->i_mapping) {
struct mm_struct *mm = ...struct uprobe is a input structure. Do we want to have Wouldnt this be a scalability issue on bigger machines? Every probehit having to parse a global tree to figureout which uprobe it was seems a overkill. Consider a 5000 uprobes placed on a 128 box with probes placed on How are we synchronizing put_uprobe and a thread that has hit the breakpoint and searching thro global probes list? One Nit: On probe hit we increment the ref only few times. However we are decrementing everytime. So if two probes occur on two cpus simultaneously, we have a chance of uprobe being freed after both of I understand that perf top calls perf record in a loop. For every perf record, we would be looping thro each vma associated with the inode. For a probe on a libc, we would iterate thro all vmas. If the Are you looking at listing of uprobes per vma? For each mmap, we are traversing all elements in the global tree? What would happen if we have a huge number of uprobes in a system all uprobe_hit I assume is going to be called in interrupt context. Again for every probehit, we are going through the list of vmas and checking if it has a probe which I think is unnecessary. Nit: In some archs, the instruction pointer might be pointing to th next What if we were pre-empted after this. Would preemption notifiers also do a copy of instruction to the new slot? If yes, can you please update me with more pointers. And I dont know if we can do a boosting for all instructions. I think even on kprobes we dont do a boosting for all instructions. Yes, I see its advantages and disadvantages, I feel this implementation wouldnt scale. Just because we dont want to housekeep some information, we are looping thro the global tree to figure out if there is uprobe specific stuff to be done. -- Thanks and Regards Srikar --
I didn't consider the user-space interface at all, consuming the uprobe Use a seqcount, its a read-mostly data structure, its just that the RCU, see the above atomic_inc_not_zero() it will not obtain a reference after the final put, the object will stay valid until we pass an rcu Feh, probe register should be considered an utter slow path. We do rmap walks on pages all the time, I can't see it being a problem Yeah, it does a range lookup in the tree [inode:0 - inode:-1). O(log(n)) to find the first entry, O(log(n)) for each consecutive entry, unless we thread the tree. Only mmap() of that particular inode, the range lookup would be the regular O(log(n)) for an empty range. But again, mmap() is a relative slow path, and you need something like I assumed process context here, but its trivial to make it work from interrupt context if you want, all we need is a spinlock/seqlock around Its mostly read-only data (adding/removing probes is rare), its all O(log(n)), I really don't see a problem with that. If you really worry about it you could try a hash lookup for the inode part and keep a tree per probed inode. --
I think we are both partially right in slightly different ways. I think Peter is right in that the PID should not be mandatory (e.g. specifying a PID of 0 should apply to all tasks), and you are also right in that being able to apply the "filter" directly at the executable image level is vital for performance. So how about this: we can provide both task and inode selection arguments. The task selection argument can be 0 (apply to all tasks) or non-zero (one task specifically). The inode argument would be mandatory. Then, eventually, we can enhance the generic filtering facility so it can be made aware of filtering shortcuts provided by the instrumentation (in this case, uprobes would provide a per-tgid filtering shortcut). Thoughts ? Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
I have the feeling that you guys are at least partially talking past each other. For the "perf probe --add" interface the only sane interface is one by filename and then symbol / liner number / etc. But that is just the interface - these probes don't nessecarily have to be armed and cause global overhead once they are define. If the implenmentation is smart enough it will defer arming the probe until we actually use it, and that will be per-process quite often. Which btw, brings up two more issues, one in uprobes and one in perf. For one even in userspace I think the dynamic probes will really just be the tip of the iceberg and we'll get more bang for the buck from static traces, which is something that's no supported in uprobes yet. As a start supporting the dtrace-style sdt.h header would be a great help, and then we can decide if we need somthing even better on top. The other things is that perf currently only supports per-kernel pid recording, while we'd really need per Posix process, which may contain multiple threads for useful tracing of complex userspace applications. I also suspect that this will fit the uprobes model much better given that the probes will be in any given address space. --
The implementation I outlined a few messages ago, would in fact, as you
perf does report both:
* { u32 pid, tid; } && PERF_SAMPLE_TID
the pid is the process id (thread group leader like) and tid is the
task/thread id.
--
It records both, but I haven't found a way to only record samples or trace things in a Posix Process. E.g. perf record -p seems to be only per-thread, not per-process. If that has changes recently everything is fine of course. --
Hrm, the record code seems to look up all threads for -p and use only a single thread for -t, didn't actually try it though so it could be borken. --
Agree, probing by file name is a requirement and I am working Agree, That why I am trying to build file-based probing on Yes, Static tracing using dtrace style sdt.h is a cool thing to do. Already SystemTap has this facility. However I think its probably better done at perf user interface level. The way I look at it is perf probe decodes the static markers and asks uprobes to place probepoints over there. Do you see a different approach? If yes can you tell what you were looking at? -- Thanks and Regards Srikar --
We currently have this feature in UST. We're adding "markers" into the applications, and a UST daemon talks with an in-process library helper thread to enable/disable markers and control tracing over unix sockets. We're currently in the process of moving from markers to the TRACE_EVENT()+tracepoints infrastructure. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
For that I guess we should have a way to tie a uprobe to a filedesc or somesuch, that way whenever the owner dies, the probe goes away. Such probes would also obviuosly get a filter that limits it to tasks of its own user etc.. --
Right, so in short, I think that if you rework this to be inode:offset based you'll end up with a much simpler codebase. --
Provides x86 specific details for uprobes.
This includes interrupt notifier for uprobes, enabling/disabling
singlestep.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---
Changelog from V5: Using local_irq_enable() instead of
native_irq_enable and no more disabling irqs as suggested by Oleg
Nesterov.
arch/x86/kernel/signal.c | 13 +++++++++++
arch/x86/kernel/uprobes.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..3657563 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -848,6 +848,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);
+ if (thread_info_flags & _TIF_UPROBE) {
+ clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+ /*
+ * On x86_32, do_notify_resume() gets called with
+ * interrupts disabled. Hence enable interrupts if they
+ * are still disabled.
+ */
+ local_irq_enable();
+#endif
+ uprobe_notify_resume(regs);
+ }
+
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ceaedc9..6985b4c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -26,6 +26,7 @@
#include <linux/ptrace.h>
#include <linux/uprobes.h>
+#include <linux/kdebug.h>
#include <asm/insn.h>
#ifdef CONFIG_X86_32
@@ -559,3 +560,54 @@ struct user_bkpt_arch_info user_bkpt_arch_info = {
.ip_advancement_by_bkpt_insn = 1,
.max_insn_bytes = MAX_UINSN_BYTES,
};
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobes_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = ...Uprobes Documentation. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog from V5: Removed references to Modules, Samples, and probe Overhead. Changelog from v3: Updated measurements. Changelog from v2: Updated measurements. Changelog from v1: Addressed comments from Randy Dunlap. : Updated measurements. Documentation/uprobes.txt | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 188 insertions(+), 0 deletions(-) create mode 100644 Documentation/uprobes.txt diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt new file mode 100644 index 0000000..5b620d8 --- /dev/null +++ b/Documentation/uprobes.txt @@ -0,0 +1,188 @@ +Title : User-Space Probes (Uprobes) +Authors : Jim Keniston <jkenisto@us.ibm.com> + : Srikar Dronamraju <srikar@linux.vnet.ibm.com> + +CONTENTS + +1. Concepts: Uprobes +2. Architectures Supported +3. API Reference +4. Uprobes Features and Limitations +5. TODO +6. Uprobes Team + +1. Concepts: Uprobes + +Uprobes enables you to dynamically break into any routine in a +user application and collect debugging and performance information +non-disruptively. You can trap at any code address, specifying a +kernel handler routine to be invoked when the breakpoint is hit. + +A uprobe can be inserted on any instruction in the application's +virtual address space. The registration function register_uprobe() +specifies which process is to be probed, where the probe is to be +inserted, and what handler is to be called when the probe is hit. + +1.1 How Does a Uprobe Work? + +When a uprobe is registered, Uprobes makes a copy of the probed +instruction, stops the probed application, replaces the first byte(s) +of the probed instruction with a breakpoint instruction (e.g., int3 +on i386 and x86_64), and allows the probed application to continue. +(When inserting the breakpoint, Uprobes uses background page +replacement mechanism, so ...
Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.
TODO: Merge both events to a single probe event.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog from V10: delete references to is_kprobe and make
is_return a bool.
Changelog from V7: Merge changes due to string support in kprobes
traceevent.
Changelog from V5: Addressed comments from Masami Hiramatsu
and Steven Rostedt. Also shared lot more code from kprobes
traceevents.
kernel/trace/Kconfig | 4
kernel/trace/Makefile | 1
kernel/trace/trace_kprobe.c | 752 +------------------------------------------
kernel/trace/trace_probe.c | 648 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_probe.h | 155 +++++++++
5 files changed, 822 insertions(+), 738 deletions(-)
create mode 100644 kernel/trace/trace_probe.c
create mode 100644 kernel/trace/trace_probe.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..d709697 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -353,6 +353,7 @@ config KPROBE_EVENT
depends on HAVE_REGS_AND_STACK_ACCESS_API
bool "Enable kprobes-based dynamic events"
select TRACING
+ select PROBE_EVENTS
default y
help
This allows the user to add tracing events (similar to tracepoints)
@@ -365,6 +366,9 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.
+config PROBE_EVENTS
+ def_bool n
+
config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 53f3381..95d2043 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_EVENT_TRACING) += power-traces.o
ifeq ($(CONFIG_TRACING),y)
...Provides slot allocation mechanism for execution out of line for use with user space breakpointing. Traditional method of replacing the original instructions on breakpoint hit are racy when used on multithreaded applications. Alternatives for the traditional method include: - Emulating the breakpointed instruction. - Execution out of line. Emulating the instruction: This approach would use a in-kernel instruction emulator to emulate the breakpointed instruction. This approach could be looked in at a later point of time. Execution out of line: In execution out of line strategy, a new vma is injected into the target process, a copy of the instructions which are breakpointed is stored in one of the slots. On breakpoint hit, the copy of the instruction is single-stepped leaving the breakpoint instruction as is. This method is architecture independent. This method is useful while handling multithreaded processes. This patch allocates one page per process for slots to be used to copy the breakpointed instructions. Current slot allocation mechanism: 1. Allocate one dedicated slot per user breakpoint. Each slot is big enuf to accomodate the biggest instruction for that architecture. (16 bytes for x86). 2. We currently allocate only one page for slots. Hence the number of slots is limited to active breakpoint hits on that process. 3. Bitmap to track used slots. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog from V5: Merged into uprobes. Changelog form V3: * Added a memory barrier after the slot gets initialized. Changelog from V2: (addressing Oleg's comments) * Removed code in !CONFIG_UPROBES_XOL * Functions now pass pointer to uprobes_xol_area instead of pointer to void. include/linux/uprobes.h | 2 kernel/uprobes.c | 283 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 285 insertions(+), 0 deletions(-) diff --git ...
Since you have a static sized bitmap, why not simply declare it here? Naughty kernel modules we don't care about, but yeah, it appears vma's installed using install_special_mapping() can be unmapped by the process itself,.. curious. Anyway, you could install your own vm_ops and provide a close method to Seems interesting,.. why not use install_special_mapping(), that's what It doesn't actually do that, xol_add_vma() does that, this allocates the There's a nice way to not have to write that: I would call that allocate, find would imply a constant operation, but if (!xol_vaddr) goto bail; gives nices code, and saves an indent level. Also, why would we ever get here with !user_bkpt->vaddr. funny code flow,.. s/found = 1/return/ and loose the conditional and This doesn't actually appear used in this patch,.. does it want to live elsewhere? --
Okay, I hadnt looked at install_special_mapping earlier so I will take a look and incorporate it. However I am not clear at this point what install_special_mapping is giving us here. Also install_special_mapping is already defining its own vm_ops esp a close method thats doesnt seem to be doing anything. So at this point I am not clear how we are link For now, user_bkpt->vaddr will always be set when we are here. However when we add uretprobe support, we would then get here with user_bkpt->vaddr being NULL. I would drop the check for now, but add it later when we add the return I had renamed the structure from ubp to user_bkpt based on your comments. I had actually mentioned this in the summary mail that I had sent on Jan 22 this year. I am fine to rename it to user_bp if that I dont want the compiler to reorder the instructions and do the assignment for user_bkpt to be done before we complete the copy above. If the assignment happens before we copy the content into the slot, someother thread that might hit the same probe actually things the slot is ready and tries to jump to that slot even before the slot is initialized. Yes, xol_validate_vaddr gets used in the next patch. So probably it can be moved to the next patch. -- Thanks and Regards Srikar --
If you want a compiler barrier, use barrier(), but here you seem to describe a multi-threaded situation, in which case the observer thread needs at least a rmb() in order for that mb() to mean anything other than the compiler barrier it implies. Also, use smp_* barriers. --
Okay, would something like this suffice?
static unsigned long xol_get_insn_slot(struct user_bkpt *user_bkpt,
struct uprobes_xol_area *xol_area)
{
unsigned long flags, xol_vaddr = 0;
int len;
if (unlikely(!xol_area))
return 0;
smp_rmb();
if (user_bkpt->xol_vaddr)
return user_bkpt->xol_vaddr;
spin_lock_irqsave(&xol_area->lock, flags);
xol_vaddr = xol_take_insn_slot(xol_area);
spin_unlock_irqrestore(&xol_area->lock, flags);
/*
* Initialize the slot if user_bkpt->vaddr points to valid
* instruction slot.
*/
if (!xol_vaddr)
return 0;
len = access_process_vm(current, xol_vaddr, user_bkpt->insn,
UPROBES_XOL_SLOT_BYTES, 1);
if (unlikely(len < UPROBES_XOL_SLOT_BYTES))
printk(KERN_ERR "Failed to copy instruction at %#lx "
"len = %d\n", user_bkpt->vaddr, len);
/*
* Update user_bkpt->xol_vaddr after giving a chance for the slot to
* be initialized.
*/
smp_mb();
user_bkpt->xol_vaddr = xol_vaddr;
return user_bkpt->xol_vaddr;
}
--
Thanks and Regards
Srikar
--
Racy like you won't believe.. Suppose multiple threads hitting the trap at the same time, every thread will end up failing the check and allocating a new slot for it, at the end the slowest thread will end up setting the value. --
Agree, I shall fix this up. Since set_bit and clear_bit are atomic, I shall change the area->lock from a spinlock to a mutex, and have the mutex released after the slot has been updated with the "single-stepping instruction". -- Thanks and Regards Srikar --
What you're doing might well be the right thing, I was just wondering. I think that, after thinking about it more, that the shmem file thing you're doing has the added benefit that the things gets auto-magic paging, which is a good thing. --
An alternative method would be to have 1 slot per cpu, and manage the slot content using preemption notifiers. That gives you a fixed number of slots and an unlimited number of probe points. If the preemption happens to be a migration you need to rewrite the userspace IP to point to the new slot -- if indeed the task was inside one when it got preempted -- but that all should be doable. --
Certainly doable but it has its share of drawbacks. 1. On every probe hit we have to copy the instruction into the slot, so there is a performance penalty. 2 This might complicate booster probe, because the jump instruction that follows the original instruction now actually have to coded every time. 3. Yes migration is an issue esp - if a thread of the same process that hit a breakpoint is scheduled into the same cpu and that newly scheduled thread hits a breakpoint. - Something similar can happen if a multithreaded process runs on a uniprocessor machine. 4. I dont see a need for clearing slots after post processing, but if we need to clear we then are adding more penalties because not only are we clearing the slots but the post processing then cant happen in interrupt context. 5. I think we are covered on the cpu hotplug too, (i.e not sure if we have to make uprobes cpu hot plug aware.). 6. We would still be allocating a page for the slots. Unless we want to expand to more slots than available in one page, I dont see the disadvantages with the current approach. -- Thanks and Regards Srikar --
Yeah, although I imagine its nearly free since you need to pay the Why can't you keep the whole replacement sequence in-tact? Simply copy post-processing? you mean the probe handler? Why couldn't that be done Not if you use a slot per cpu and use preemption notifiers, the The current approach limits the number of probes to what fits in a page. The slot per cpu approach will have no such limit. --
Lets say the thread while singlestepping the process gets pre-empted. Eventually the cpu might run some other thread of the same process before picking the first run thread. Or the first run thread could after migration due to load balancing or whatever end up yes the limit on number of probes is a limitation. For now the implementation would be straight and easy. We could either rework on the Yes, if we use jump absolute then the replacement sequence stays in-tact. -- Thanks and Regards Srikar --
So assuming we're preempted while the IP is inside the slot: On the preempt-out we store the slot relative ip (ip - start_of_slot), on preempt-in we write the replacement instructions in our cpu slot (could be the same cpu, could be another) and re-position the ip to point to the same relative position inside that slot, then go! It really doesn't matter what happens in between. --
Right, but with the proposed slot-per-cpu we'd be able to have unlimited active probes within that single page, even with boosted probes, assuming 16 bytes per instruction: push reg mov reg,foo insn pop reg jmp and cacheline alignment we'd end up with 128 bytes per slot, we can service 32 cpus per page. Which, for now, means that all my machines need but a single page. --
Implements trace_event support for uprobes. In its
current form it can be used to put probes at a specified text address
in a process and dump the required registers when the code flow reaches
the probed address.
TODO: Documentation/trace/uprobetrace.txt
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog from v5: Addressed comments from Masami Hiramatsu and Steven
Rostedt. Some changes because of changes in common probe events.
Changelog from v4: (Merged to 2.6.35-rc3-tip)
Changelog from v2/v3: (Addressing comments from Steven Rostedt
and Frederic Weisbecker)
* removed pit field from uprobe_trace_entry.
* share common parts with kprobe trace events.
* use trace_create_file instead of debugfs_create_file.
The following example shows how to dump the instruction pointer and %ax a
register at the probed text address.
Start a process to trace. Get the address to trace.
[Here pid is asssumed as 6016]
[Address to trace is 0x0000000000446420]
[Registers to be dumped are %ip and %ax]
# cd /sys/kernel/debug/tracing/
# echo 'p 6016:0x0000000000446420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_6016_0x0000000000446420 6016:0x0000000000446420 %ip=%ip %ax=%ax
# cat events/uprobes/p_6016_0x0000000000446420/enable
0
[enable the event]
# echo 1 > events/uprobes/p_6016_0x0000000000446420/enable
# cat events/uprobes/p_6016_0x0000000000446420/enable
1
# #### do some activity on the program so that it hits the breakpoint
# cat uprobe_profile
6016 p_6016_0x0000000000446420 234
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-6016 [004] 227931.093579: p_6016_0x0000000000446420: (0x446420) %ip=446421 %ax=79
zsh-6016 [005] 227931.097541: p_6016_0x0000000000446420: (0x446420) %ip=446421 %ax=79
zsh-6016 [000] 227931.124909: p_6016_0x0000000000446420: (0x446420) %ip=446421 ...Given a dso, list the symbols in ascending order. Needed for listing
available symbols from perf-probe.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
tools/perf/util/symbol.c | 14 ++++++++++++++
tools/perf/util/symbol.h | 1 +
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 1a36773..ca22032 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -388,6 +388,20 @@ size_t dso__fprintf_buildid(struct dso *self, FILE *fp)
return fprintf(fp, "%s", sbuild_id);
}
+size_t dso__fprintf_symbols(struct dso *self, enum map_type type, FILE *fp)
+{
+ size_t ret = 0;
+ struct rb_node *nd;
+ struct symbol_name_rb_node *pos;
+
+ for (nd = rb_first(&self->symbol_names[type]); nd; nd = rb_next(nd)) {
+ pos = rb_entry(nd, struct symbol_name_rb_node, rb_node);
+ fprintf(fp, "%s\n", pos->sym.name);
+ }
+
+ return ret;
+}
+
size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp)
{
struct rb_node *nd;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index b7a8da4..72ef973 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -181,6 +181,7 @@ size_t machines__fprintf_dsos(struct rb_root *self, FILE *fp);
size_t machines__fprintf_dsos_buildid(struct rb_root *self, FILE *fp, bool with_hits);
size_t dso__fprintf_buildid(struct dso *self, FILE *fp);
+size_t dso__fprintf_symbols(struct dso *self, enum map_type type, FILE *fp);
size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp);
enum dso_origin {
--
Applied after renaming it to 'dso__fprintf_symbols_by_name', as at first I was scratching my head to figure out if we could reuse it in dso__fprintf() to then notice that it is in ascending _name_ order, not the default that is ordered by addr :-) Please fixup the users, i.e. perf probe in your patchset. - Arnaldo --
Yes, Shall send the refreshed patch now -- Thanks and Regards Srikar --
Commit-ID: 90f18e63fbd005133624bf18a5e8b75c92e90f4d Gitweb: http://git.kernel.org/tip/90f18e63fbd005133624bf18a5e8b75c92e90f4d Author: Srikar Dronamraju <srikar@linux.vnet.ibm.com> AuthorDate: Wed, 25 Aug 2010 19:13:29 +0530 Committer: Arnaldo Carvalho de Melo <acme@redhat.com> CommitDate: Wed, 25 Aug 2010 17:28:59 -0300 perf symbols: List symbols in a dso in ascending name order Given a dso, list the symbols in ascending name order. Needed for listing available symbols from perf probe. Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com> Cc: Mark Wielaard <mjw@redhat.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Naren A Devaiah <naren.devaiah@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100825134329.5447.92261.sendpatchset@localhost6.localdomain6> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> --- tools/perf/util/symbol.c | 14 ++++++++++++++ tools/perf/util/symbol.h | 1 + 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 1a36773..a08e1cb 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -388,6 +388,20 @@ size_t dso__fprintf_buildid(struct dso *self, FILE *fp) return fprintf(fp, "%s", sbuild_id); } +size_t dso__fprintf_symbols_by_name(struct dso *self, enum map_type type, FILE *fp) +{ + size_t ret = 0; + struct rb_node *nd; + struct symbol_name_rb_node *pos; + + for (nd = ...
Selecting CONFIG_PROBE_EVENTS enables both kprobe-based and uprobes-based dynamic events. However kprobe-tracer or uprobe-tracer can still be individually selected or disabled. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> --- kernel/trace/Kconfig | 51 +++++++++++++++++++++++++++++--------------------- 1 files changed, 30 insertions(+), 21 deletions(-) diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 55ba474..205c12b 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -76,7 +76,7 @@ config RING_BUFFER_ALLOW_SWAP # All tracer options should select GENERIC_TRACER. For those options that are # enabled by all tracers (context switch and event tracer) they select TRACING. # This allows those options to appear when no other tracer is selected. But the -# options do not appear when something else selects it. We need the two options +# options do not appear when something else selects it. We need the two option # GENERIC_TRACER and TRACING to avoid circular dependencies to accomplish the # hiding of the automatic options. @@ -162,7 +162,7 @@ config IRQSOFF_TRACER This option measures the time spent in irqs-off critical sections, with microsecond accuracy. - The default measurement method is a maximum search, which is + The default measurement method is a maximum search, which i disabled by default and can be runtime (re-)started via: @@ -184,7 +184,7 @@ config PREEMPT_TRACER This option measures the time spent in preemption-off critical sections, with microsecond accuracy. - The default measurement method is a maximum search, which is + The default measurement method is a maximum search, which i disabled by default and can be runtime (re-)started via: @@ -228,7 +228,7 @@ choice prompt "Branch Profiling" default BRANCH_PROFILE_NONE help - The branch profiling is a software profiler. It will add hooks + The ...
Hmm, without this series, KPROBE_EVENT is set "y" by default. (PROBE_EVENTS is introduced by 8/15) I'd like to set this "y" by default, because it doesn't affect other parts. --
Okay will correct them. This is based on what we discussed here http://lkml.org/lkml/2010/8/2/86. To recollect, Frederic wanted that there should be one option to select both UPROBE_EVENT and KPROBE_EVENT. However if we make PROBE_EVENTS (which is the option to enable both events) default "Y", then both UPROBE_EVENT and KPROBE_EVENT will be selected. Also if we look at http://lkml.org/lkml/2010/6/21/160, Steven Rostedt didnt want UPROBE_EVENT to selected by default. I agree that we should keep UPROBE_EVENT to be 'default n' till it gets tested. Hence we have two choices. Either set the common knob to be 'default n' or dont have the common knob for now (i.e drop this patch for now). I think we should go with the first one, i.e have a common knob thats by default unselected. -- Thanks and Regards Sriakr --
Yeah, I'm OK to have a common knob, but I just don't like to set KPROBE_EVENT unselected by default. I think there is no reason to change default selecting (currently, KPROBE_EVENT=y by default.) So, I think we should have below selecting list; --- Tracers ... [*] Enable dynamic events [ ] Enable user-space dynamic events (EXPERIMENTAL) ... What would you think about this ? :) Thank you, -- Masami HIRAMATSU 2nd Dept. Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com --
Wouldnt it negate the purpose of common knob? Because people would still have go and select UPROBE_EVENTS, I think when Frederic asked for a common knob, he was looking at enabling both or disabling both and an option to selectively select one of the tracers. -- Thanks and Regards Srikar --
Hmm, I think this just seems an enhancement of dynamic events, and also you can enable it by default on some point. I mean, eventually, there will be only "Enable dynamic events" Yeah, but I'd like to ask Frederic that he expected disabling KPROBE_EVENT by default too, even though it changes current default config. Thank you, --
Selecting CONFIG_PROBE_EVENTS enables both kprobe-based and uprobes-based dynamic events. However kprobe-tracer or uprobe-tracer can still be individually selected or disabled. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> --- Changelog from V10: Fixed few erroneous changes: missing s at eol. reported by Masami Hiramatsu. kernel/trace/Kconfig | 21 +++++++++++++++------ 1 files changed, 15 insertions(+), 6 deletions(-) diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 55ba474..77e04b0 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -351,9 +351,8 @@ config BLK_DEV_IO_TRACE config KPROBE_EVENT depends on KPROBES depends on HAVE_REGS_AND_STACK_ACCESS_API + depends on PROBE_EVENTS bool "Enable kprobes-based dynamic events" - select TRACING - select PROBE_EVENTS default y help This allows the user to add tracing events (similar to tracepoints) @@ -370,10 +369,9 @@ config UPROBE_EVENT bool "Enable uprobes-based dynamic events" depends on ARCH_SUPPORTS_UPROBES depends on MMU + depends on PROBE_EVENTS select UPROBES - select PROBE_EVENTS - select TRACING - default n + default y help This allows the user to add tracing events on top of userspace dynamic events (similar to tracepoints) on the fly via the traceevents interface. @@ -383,7 +381,18 @@ config UPROBE_EVENT tools on user space applications. config PROBE_EVENTS - def_bool n + bool "Enable kprobes and uprobe based dynamic events" + select TRACING + default n + help + This allows a user to add dynamic tracing events in + kernel using kprobe-tracer and in userspace using + uprobe-tracer. However users can still selectively + disable one of these events. + + For more information on kprobe-tracer and uprobe-tracer + please refer help under KPROBE_EVENT and UPROBE_EVENT + respectively. config DYNAMIC_FTRACE bool "enable/disable ftrace ...
Introduces -S/--show_functions option for perf-probe.
This lists function names in a File. If no file is specified, then lists
functions in the current running kernel.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog from V10: As suggested by Arnaldo, filtering is now
based on sym.binding.
Changelog from V9: Filter labels, weak, and local binding functions
from listing as suggested by Christoph Hellwig.
Show last 10 functions in /bin/zsh.
# perf probe -S -D /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam
Show first 10 functions in /lib/libc.so.6
# perf probe -S -D /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf
Show last 10 functions in kernel.
# perf probe -S | tail
zlib_inflateInit2
zlib_inflateReset
zlib_inflate_blob
zlib_inflate_table
zlib_inflate_workspacesize
zone_pcp_update
zone_reclaim
zone_reclaimable_pages
zone_statistics
zone_watermark_ok
tools/perf/builtin-probe.c | 43 ++++++++++++++++++++++++
tools/perf/util/probe-event.c | 72 +++++++++++++++++++++++++++++++++++++++++
tools/perf/util/probe-event.h | 1 +
tools/perf/util/symbol.c | 8 +++++
tools/perf/util/symbol.h | 1 +
5 files changed, 124 insertions(+), 1 deletions(-)
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 199d5e1..fa63245 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -50,9 +50,11 @@ static struct {
bool list_events;
bool force_add;
bool show_lines;
+ bool list_functions;
int nevents;
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
+ struct strlist *limitlist;
struct line_range line_range;
int max_probe_points;
} params;
@@ -132,6 +134,19 @@ static int opt_show_lines(const struct ...Introduces -S/--show_functions option for perf-probe.
This lists function names in a File. If no file is specified, then lists
functions in the current running kernel.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog from V11: accomodate name change dso_fprintf_symbols_by_name.
Changelog from V10: As suggested by Arnaldo, filtering is now
based on sym.binding.
Changelog from V9: Filter labels, weak, and local binding functions
from listing as suggested by Christoph Hellwig.
Show last 10 functions in /bin/zsh.
# perf probe -S -D /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam
Show first 10 functions in /lib/libc.so.6
# perf probe -S -D /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf
Show last 10 functions in kernel.
# perf probe -S | tail
zlib_inflateInit2
zlib_inflateReset
zlib_inflate_blob
zlib_inflate_table
zlib_inflate_workspacesize
zone_pcp_update
zone_reclaim
zone_reclaimable_pages
zone_statistics
zone_watermark_ok
---
tools/perf/builtin-probe.c | 43 ++++++++++++++++++++++++
tools/perf/util/probe-event.c | 72 +++++++++++++++++++++++++++++++++++++++++
tools/perf/util/probe-event.h | 1 +
tools/perf/util/symbol.c | 8 +++++
tools/perf/util/symbol.h | 1 +
5 files changed, 124 insertions(+), 1 deletions(-)
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 199d5e1..fa63245 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -50,9 +50,11 @@ static struct {
bool list_events;
bool force_add;
bool show_lines;
+ bool list_functions;
int nevents;
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
+ struct strlist *limitlist;
struct line_range line_range;
int ...Hi Srikar, Hmm, I think the basic functionality of this patch (I mean functions in running kernel) could be merged separately However, I'd rather use --funcs/-F and --dso/-D instead of above. :) Thank you, -- Masami HIRAMATSU 2nd Dept. Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com --
Introduces map_groups_for_each_map that iterates over a map_group.
This is useful while listing functions through perf-probe.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@infradead.org>
---
tools/perf/util/map.h | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/tools/perf/util/map.h b/tools/perf/util/map.h
index 7857579..45b5f50 100644
--- a/tools/perf/util/map.h
+++ b/tools/perf/util/map.h
@@ -54,6 +54,33 @@ struct map_groups {
struct machine *machine;
};
+/* For map_groups iteration */
+static inline struct map *map__first(struct map_groups *self,
+ enum map_type type)
+{
+ struct rb_node *rn = rb_first(&self->maps[type]);
+ return rn ? rb_entry(rn, struct map, rb_node) : NULL;
+}
+
+static inline struct map *map__next(struct map *map)
+{
+ struct rb_node *rn;
+ if (!map)
+ return NULL;
+ rn = rb_next(&map->rb_node);
+ return rn ? rb_entry(rn, struct map, rb_node) : NULL;
+}
+
+/**
+ * map_groups__for_each_map - iterate over a map_group
+ * @pos: the &struct map to use as a loop cursor.
+ * @type: the map type.
+ * @self: the &struct map_groups for loop.
+ */
+#define map_groups__for_each_map(pos, type, self) \
+ for (pos = map__first(self, type); pos; \
+ pos = map__next(pos))
+
/* Native host kernel uses -1 as pid index in machine */
#define HOST_KERNEL_ID (-1)
#define DEFAULT_GUEST_KERNEL_ID (0)
--
Introduces an option to list potential probes to probe using perf probe command. Also introduces an option to limit the dso to list the potential probes. Listing of potential probes is sorted by dso and alphabetical order. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog from V9: Filter labels, weak, and local binding functions from listing as suggested by Christoph Hellwig. Incorporated comments from Arnaldo on Version 9 of patchset. Show all potential probes in the current running kernel and limit to the last 10 functions. # perf probe -S | tail zlib_inflateInit2 zlib_inflateReset zlib_inflate_blob zlib_inflate_table zlib_inflate_workspacesize zone_pcp_update zone_reclaim zone_reclaimable_pages zone_statistics zone_watermark_ok Show all potential probes in a process by pid 3104 across all dsos and limit to the last 10 functions. # perf probe -S -p 3104 | tail _nss_files_setgrent _nss_files_sethostent _nss_files_setnetent _nss_files_setnetgrent _nss_files_setprotoent _nss_files_setpwent _nss_files_setrpcent _nss_files_setservent _nss_files_setspent _nss_netgroup_parseline Show all potentail probes in a process by pid 3104 limit to zsh dso and limit to the last 10 functions. # perf probe -S -p 3104 -D zsh | tail zstrtol ztrcmp ztrdup ztrduppfx ztrftime ztrlen ztrncpy ztrsub zwarn zwarnnam tools/perf/builtin-probe.c | 2 + tools/perf/util/probe-event.c | 68 +++++++++++++++++++++++++++++++++-------- tools/perf/util/probe-event.h | 4 +- 3 files changed, 56 insertions(+), 18 deletions(-) diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c index afca6ae..f5893d9 100644 --- a/tools/perf/builtin-probe.c +++ b/tools/perf/builtin-probe.c @@ -274,7 +274,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used) " --line.\n"); usage_with_options(probe_usage, options); } - ret = show_possible_probes(params.limitlist); + ret = ...
Enhances perf probe to accept pid and user vaddr. Provides very basic support for uprobes. TODO: Update perf-probes.txt. Global tracing. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog from V10: split few independent hunks into different patches as suggested by Arnaldo. Changelog from v9: Renaming common fields/functions to refer to probe instead of kprobe. This was suggested by Arnaldo. Changelog from v8: Fixed a build break reported by Christoph Hellwig. Changelog from v6: Changelog from v6: Fixed a bug reported by Masami. i.e Throw an error message and exit if perf probe is for a dwarf based probes. Changelog from v4: Merged to 2.6.35-rc3-tip. Changelog from v3: (addressed comments from Masami Hiramatsu) * Every process id has a different group name. * event name starts with function name. * If vaddr is specified, event name has vaddr appended along with function name, (this is to avoid subsequent probes using same event name.) * warning if -p and --list options are used together. Also dso can either be a short name or absolute path. Here is a terminal snapshot of placing, using and removing a probe on a process with pid 3591 (corresponding to zsh) [ Probing a function in the executable using function name ] ------------------------------------------------------------- [root@ABCD]# perf probe -p 3591 zfree@zsh Added new event: probe_3591:zfree (on 0x446420) You can now use it on all perf tools, such as: perf record -e probe_3591:zfree -a sleep 1 [root@ABCD]# perf probe --list probe_3591:zfree (on 3591:0x0000000000446420) [root@ABCD]# cat /sys/kernel/debug/tracing/uprobe_events p:probe_3591/zfree 3591:0x0000000000446420 [root@ABCD]# perf record -f -e probe_3591:zfree -a sleep 10 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ] [root@ABCD]# perf probe -p 3591 --del ...
It's been a while since the last posting. Did you make any progress on uprobes, especially allowing to define probes based on a file name? --
Thanks for checking, I discussed with Peter offline and ironed out most of the issues. I am thankful for Peter for all the suggestions I am still getting the inode based uprobes to shape. Here is the brief summary of the discussion. Significant differences from the previous patchset are: - All probes would be maintained in a global rbtree sorted by inode and offset. - There can be one or more consumers per probe. With each consumer there will be one handler and one (optional) filter. - Filter restricts the processes/tasks that the handler is active. - uprobe structure is dynamically created when the first consumer registers to the probe. It gets deallocated when all consumers unregisters from the probe. - While registering a probe, we walk thro the list of vmas that are mapped to the inode, check if the consumer wants to probe the task corresponding to the vma and inserts the breakpoint. - unregistering a probe also does something similar except for deleting the probe. - There will be a hook in mmap/unmap to install probes as and when the vma gets loaded into process address space. This hook would walk thro the tree of probes for that inode and for each probe, walk thro the list of consumers and insert/delete breakpoints accordingly. - There will be a hook in fork to install probes in newly created processes. This hook would walk thro the tree of probes for that inode and for each probe, walk thro the list of consumers and insert/delete breakpoints accordingly. - Slots will still hang-out of mm_struct. - Instead of the per-probe slot, we would have to use a per-thread slot. (This slot is for single stepping out of line). On every probehit, the slot has to be refreshed with the correct contents. - Since probe information is stored as inode:offset, probe identification on a breakpoint hit can only happen in task context. Current issues: Given a vma; finding all tasks that have this vma mapped. The current solution seems to walk thro ...
I don't see a way around that if we have to find the task by the vma. You'll have to start with vma->vm_mm->owner and then walk the list The performance numbers are pretty drastic. But I'll let Peter comment on the desire in more detail. I'm really not in enough touch with this I really prefer the new interface. But as said before I'm just a user here and I don't care how it's implemented underneath. I'll defer to Peter and others knowing the code in more detail to make the trade offs between the different low level implementations. --
