login
Header Space

 
 

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

Previous thread: [PATCH 1/3] rcutorture: Use ARRAY_SIZE macro when appropriate by Josh Triplett on Wednesday, February 7, 2007 - 2:54 pm. (7 messages)

Next thread: [GIT PATCH] ACPI patches for 2.6.21 by Len Brown on Wednesday, February 7, 2007 - 3:18 pm. (17 messages)
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 7, 2007 - 3:22 pm

I wonder where this convention of using lower numbers to indicate higher 
priorities comes from...  It seems to be quite prevalent even though it's 


All right.  However this means thread_struct will have to grow in order to
hold each task's debug-register allocations and priorities.  Would that be
acceptable?  (This might be a good reason to keep the number of bits
down.)

Another question: How can a program using the ptrace API ever give up a
debug-register allocation?  Disabling the register isn't sufficient; a
user program should be able to store a watchpoint address in dr1, enable
it in dr7, disable it in dr7, and then re-enable it in the expectation
that the address stored in dr1 hasn't been overwritten.  As far as I can
see, ptrace-type allocations have to be permanent (until the task exits or
execs, or uses some other to-be-determined API to do the de-allocation.)

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 9, 2007 - 6:21 am

Well, noone seems to mind the wasted debugreg[5..6] words. ;-) 

I'm inclined to make thread_struct smaller than it is now by making it
indirect (debugreg[8] -&gt; one pointer).  It feels like this would be pretty
safe now that we have TIF_DEBUG anyway.  Already nothing should be looking

I hadn't really thought about this before, but it's pretty straightforward.
Each allocation is for one of %dr[0-3] and for its associated bits in %dr7.
%dr0 and bits 0,1,16-19; %dr1 and bits 2,3,20-23; etc.
%dr7 &amp; (0x000f0003 &lt;&lt; N) for %drN, I guess it is.
((((1 &lt;&lt; DR_CONTROL_SIZE) - 1) &lt;&lt; DR_CONTROL_SHIFT) |
 ((1 &lt;&lt; DR_ENABLE_SIZE) - 1)) &lt;&lt; N, I should say.

Each allocation owns those 38 bits (70 bits on x86_64).  When all those
bits are zero, the allocation is not active and might as well not be there
(except for whatever semantics you might want to have about an allocation's
lifetime as distinct from the event of resetting its contents).  

gdb already works this way: when it removes a watchpoint, it clears drN to
zero when it zeros all the corresponding bits in dr7.  (In fact it's in a
separate call after changing dr7, because of the ptrace interface.)

Your question made me think about the %dr6 issue, too, which I also hadn't
thought about before.  It looks like your patch handles this correctly, but
it's a subtle point that I think warrants some comments in the code.  When
userland inserts a watchpoint and it's hit, it gets a SIGTRAP via do_debug
and eventually looks at dr6 (via ptrace) to see what happened.  Kernel
watchpoints that come along after the user signal is generated can clobber
the CPU %dr6 with new hits, but userland should not perceive this.  This
works out because what userland sees is thread.debugreg[6], the only place
that sets it is do_debug, and a kwatch hit bails out before changing it.
Any other new users of the debugreg sharing interface need to be cognizant
of the %dr6 issue to avoid stepping on old ptrace uses.


Thanks,
Roland
...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 9, 2007 - 11:54 am

Yes.  In fact, the current existing code does not handle dr6 correctly.  
It never clears the register, which means you're likely to get into 
trouble when multiple breakpoints (or watchpoints) are enabled.


Here's another complication -- and this is one I can't figure out any easy
solutions for.  The type of API we've been discussing will work well
enough on UP systems, but what about SMP?  I don't see any value in having
a kernel watchpoint enabled on some CPUs and not others.  But then what
should happen when a debug register is in use by kwatch and a ptrace
request on one CPU usurps it (leaving no other register in which to put
it)?  Not to mention the difficulties of keeping track of everything when
the same kwatch watchpoint is stored in different debug registers on
different CPUs.

It's really quite a tricky matter.  Should a register be allocated to
kwatch only when no user process needs it?  Should we really go about
checking the requirements of every single process whenever a kwatch
allocation request comes in?  What if the processes which need a
particular register aren't running -- should the register then be given to
kwatch?  What if one of those processes then does start running on one
CPU?

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 9, 2007 - 7:31 pm

This is a subtle change from the existing ABI, in which userland has to
clear %dr6 via ptrace itself.  But gdb never does that AFAICT.  So it's in
fact subject to confusion when two watchpoints are set and the second hits
after the first.  So gdb ought to be fixed to clear dr6 via ptrace, to work
with existing and older kernels.

I don't think I really object to the ABI change of clearing %dr6 after an
exception so that it does not accumulate multiple results.  But first I'll
have to convince myself that we never actually do want to accumulate
multiple results.  Hmm, I think we can, so maybe I do object.  If you set
two watchpoints inside a user buffer and then do a system call that touches
both those addresses (e.g. read), then you will go through do_debug (to
send_sigtrap) twice before returning to user mode.  When the syscall is
done, you'll have a pending SIGTRAP for the debugger to handle.  By looking
at your %dr6 the debugger can see that both watchpoints hit.  (gdb does not
handle this case, but it should.)  Am I wrong?

So this gets to the more complicated view of %dr6 handling that I had first
had in mind yesterday.  Each allocation "owns" one of the low 4 bits in
%dr6 too.  Only the dr6 bits owned by the userland "raw" allocation
(i.e. ptrace/utrace_regset) should appear nonzero in thread.debugreg[6].
So when kwatch swallows a debug exception, it should mask off its bit from
%dr6 in the CPU, but not clear %dr6 completely.  That way you can have a
sequence of user dr0 hit, kwatch dr3 hit, user dr1 hit, all inside one
system call (including interrupt handlers), and when it gets to the

To "go about checking the requirements of every single process" is not so
hard as it sounds when they're recorded as a single global use count per
slot, as your original code does.  When you mentioned a "your allocation is
available" callback, I was thinking it might come to that being called
inside context switch.  It's all rather tricky, indeed.  

The obvious answer is to start simple.  ...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 21, 2007 - 4:35 pm

Going back to something you mentioned earlier...


Yes, you are wrong -- although perhaps you shouldn't be.

The current version of do_debug() clears dr7 when a debug interrupt is 
fielded.  As a result, if a system call touches two watchpoint addresses 
in userspace only the first access will be noticed.

This is probably a bug in do_debug().  It would be better to disable each
individual userspace watchpoint as it is triggered (or even not to disable
it at all).  dr7 would be restored when the SIGTRAP is delivered.  (But
what if the user is blocking or ignoring SIGTRAP?)


Moving on...

I've worked out a plan for implementing combined user/kernel mode 
breakpoints and watchpoints (call them "debugpoints" as a catch-all 
term).  It should work transparently with ptrace and should accomodate 
whatever scheme utrace decides to adopt.  There won't need to be a 
separate kwatch facility on top of it; the new combined facility will 
handle debugpoints in both userspace and kernelspace.

The idea is that callers can register and unregister a struct debugpoint, 
which contains fields for the type, length, address, and priority as well 
as three callback pointers (for installed, uninstalled, and triggered).  
The debug core will keep these structures sorted by priority and will 
allocate the available debug registers based on the priorities of the 
userspace and kernelspace requests.  Each CPU will have its own array of 
pointers to these structures, indicating which debugpoints are currently 
enabled.

To work with ptrace, the new scheme will completely virtualize the debug
registers.  So for example, a ptrace call might request a debugpoint in
dr0, but it could end up that the physical register used is really dr2
instead.  The various bits in dr6 and dr7 will be mapped in such a way
that the entire procedure is transparent to the user.  Debugpoints 
registered in kernelspace or by utrace won't care, of course.

There are two things I am uncertain about: vm86 mode and kprobes.  I...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Thursday, February 22, 2007 - 10:19 pm

The user blocking or ignoring it doesn't come up, because it's a
force_sig_info call.  However, a debugger will indeed swallow the signal
through ptrace/utrace means.  In ptrace, the dr7 is always going to get
reset because there will always be a context switch out and back in that
does it.  But with utrace it's now possible to swallow the signal and keep
going without a context switch (e.g. a breakpoint that is just doing
logging but not stopping).  So perhaps we should have a TIF_RESTORE_DR7
that goes into _TIF_WORK_MASK and gets handled in do_notify_resume
(or maybe it's TIF_HWBKPT).

You should not actually need to disable user watchpoints, because in data
watchpoints the exception comes after the instruction completes.  Only for
instruction watchpoints does the exception come before the instruction
executes, and no user watchpoints can be in the address range containing
kernel code.  

SIGTRAP both doesn't queue, and doesn't give %dr6 values in its siginfo_t.
All user watchpoints will be handled via the signal; this is the only way
ptrace can report them, and is also the utrace way of doing things.
do_debug can happen inside kernel code, and tracing of user-level tasks can
only safely do anything at the point just before returning to user mode,
where signals are handled.  So, getting to send_sigtrap in do_debug is
enough to say "one or more user debug exceptions happened".  The %dr6 value
that collects in the thread state to be seen by ptrace, or by utrace-based
things using your new facility, needs to collect all the %dr6 bits that
were set by the hardware and weren't consumed by kernel-level tracing.  An
eventual utrace-based thing might in fact have some other way to tie in so
that the event details could just be in some call made by do_debug and not
recorded in the thread's virtual %dr6 value.  But at least for ptrace, they
should collect there if it becomes possible for more than one exception to
happen while in kernel mode or in a single user instruction.  (A single

...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 23, 2007 - 12:55 pm

I think the best approach will be not to reset dr7 at all.  Then there 
won't be any need to worry about restoring it.  Leaving a userspace 
watchpoint enabled while running in the kernel ought not to matter; a 
system call shouldn't touch any address in userspace more than once or 


My idea was to put 4 hwbkpt structures in thread_struct, so they would
always be available for use by ptrace.  However it turned out not to be
feasible to replace the debugreg array with something more sophisticated,
because of conflicting declarations and problems with the ordering of
#includes.  So instead I have been forced to replace debugreg[] with a
pointer to a structure which can be allocated as needed.

This raises the possibility that a PTRACE syscall might fail because the 
allocation fails.  Hopefully that won't be an issue?

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 23, 2007 - 8:08 pm

Hmm.  That sounds reasonable.  But I wonder why the old code clears %dr7.

I think that's preferable anyway.  Most tasks most of the time will never
need that storage, so why not make thread_struct a little smaller?

It's not a new issue, anyway, after utrace.  The utrace-based ptrace can
fail for PTRACE_ATTACH because of OOM too, which wasn't possible before.
I think it's survivable.


Thanks,
Roland
-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, March 2, 2007 - 1:19 pm

Roland and Prasanna:

Here's my first attempt, lightly tested, at an hwbkpt implementation.  It
includes copious comments, so it shouldn't be too hard to figure out (if
you read the files in the right order).  The patch below is meant for
2.6.21-rc2; porting it to -mm shouldn't be very hard.

There are still several loose ends and unanswered questions.

	I pretty much copied the existing code for handling vm86 mode
	and single-step exceptions, without fully understanding it.

	The code doesn't virtualize the BS (single-step) flag in DR6
	for userspace.  It could be added, but I wonder whether it is
	really needed.

	Unlike the existing code, DR7 is re-enabled upon returning from
	a debug interrupt.  That means it doesn't have to be enabled
	when delivering a SIGTRAP.

	Setting user breakpoints on I/O ports should require permissions
	checking.  I haven't tried to figure out how that works or
	how to implement it yet.

	It seems likely that some of the new routines should be marked
	"__kprobes", but I don't know which, or even what that annotation
	is supposed to mean.

	When CPUs go on- or off-line, their debug registers need to be
	initialized or cleared.  I did a little bit of that, but more is
	needed.  In particular, CPU hotplugging and kexec have to take
	this into account.

	The parts relating to kernel breakpoints could be made conditional
	on a Kconfig option.  The amount of code space saved would be
	relatively small; I'm not sure that it would be worthwhile.

Probably there are some more issues I haven't thought of.  Anyway, let me 
know what you think.

Alan Stern



Index: 2.6.21-rc2/include/asm-i386/hwbkpt.h
===================================================================
--- /dev/null
+++ 2.6.21-rc2/include/asm-i386/hwbkpt.h
@@ -0,0 +1,185 @@
+#ifndef	_I386_HWBKPT_H
+#define	_I386_HWBKPT_H
+
+#include &lt;linux/list.h&gt;
+#include &lt;linux/types.h&gt;
+
+/**
+ * struct hwbkpt - unified kernel/user-space hardware breakpoint
+ * @node: inte...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, March 5, 2007 - 3:01 am

There is only one TF flag, it and the DR_STEP bit in dr6 are part of the
unitary thread state.  You should not be doing anything at all on

I would just leave the I/O breakpoint feature out for the first cut.  
See if there is demand for it.  

It requires setting CR4.DE, which we don't do.  The validation is
relatively simple, comparing against the io_bitmap_ptr data (if any).
You can check at insertion time (requiring doing ioperm before inserting
an I/O breakpoint), and then check again at exception time if the hit
was in kernel mode and ignore it for a user-only breakpoint, or perhaps
check again and eject the breakpoint if it's been cleared from the

That annotation is for code that is in the kprobes maintenance code path
(where you cannot insert kprobes).  do_debug has it for the notify_die
path that might lead to kprobes.  The only thing in your code that
should need it is your notifier function, which might get called for an
exception that turns out to be a kprobes single-step.  There shouldn't

In a utrace merge, the user parts can be made conditional on CONFIG_UTRACE.
Then with both turned off, the code goes away completely.  It's unlikely it
will ever be turned off, but it is a clean way to go about things in case

Save space in the struct by having just one function for both installed
and uninstalled, taking an argument.  Probably a caller should be able to
pass a null function here to say that the registration call should fail if
it can't be installed due to higher-priority or no-callback registrations
existing, and that its registration cannot be ejected by another (i.e., an

Leave this out.  Let the caller embed struct hwbkpt in his larger struct

These are the kinds of constraints utrace is there to make doable.
(I'm assuming that guarantee is really "will not start running in
user mode", since SIGKILL is always possible.)  In a utrace merge,
I think we want the only entry points for this to be a utrace-based
interface that explicitly requires utrace's notion o...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, March 5, 2007 - 1:25 pm

Presumably you mean that hw-breakpoint.c shouldn't do anything at all on
single-step exceptions.  do_debug does have to know about them, because it
has to know whether or not to send a SIGTRAP.  Separation of duties; I


It's easier simply to ignore the whole issue.  :-)  Fortunately many other 


So far I've been developing under 2.6.21-rc, which doesn't have utrace.  
But eventually this will be submitted by way of -mm, which does.  The
easiest approach would be to make the whole thing conditional on 


The actual guarantee I need is that nobody will switch_to() the task while

Yes; I had in mind that the user-breakpoint part would be called only by 
utrace and/or ptrace.  That's why the routines aren't EXPORTed.  But I 
wrote it in more general form, to make it easier to understand.  I'll add 

If someone really needs to do that, they can always put their own call to
(un)register_kernel_hwbkpt() at the entry(exit) to the complex subsystems.  
Or perhaps it should be a job for systemtap, which would use hwbkpt to do



That may be so, but the only way to access that part of the state is via
ptrace.  Think of it this way: The debug register settings really should
not be part of the thread's virtual state.  If we had some other, more
logical API for managing breakpoints in a task then ptrace_bps[] wouldn't

That would make things much more complicated.  It's a lot easier to treat 
all breakpoints as though they are the same under the hood.

Besides, the extra memory usage will show up only on threads that are 

Another thing to watch out for is that we would have to allow length-8





That doesn't seem quite right.  What if you encounter _both_ a single-step 
exception and a breakpoint at the same time?  Probably kprobes should 
always return NOTIFY_DONE.

Which implies that do_debug needs to decide whether or not to issue 
SIGTRAP.  Presumably the condition will be that any of the DR_STEP or 
DR_TRAPn bits remain set after the notifier chain has run.  This means ...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, March 5, 2007 - 11:13 pm

You can't get that.  It can always be woken for SIGKILL (which is a good
thing).  What you are guaranteed is that if it does, it will never return
to user mode.  So it has to be ok for switching in to use the bits in any
intermediate state you might get them, meaning any possible garbage state
is harmful only to user mode or is otherwise recoverable (worst case
perhaps the exception handler has to know to ignore some traps).  This is
already true with ptrace and -&gt;thread.debugreg, as well as the normal user
registers.  In your case, if you wanted to be paranoid you could clear
TIF_DEBUG before you touch anything, and set it again only after you're

But you don't have an option to avoid interrupting other CPUs to update,
which is not necessary or desireable for this usage.  That's what I was


As things are in utrace, there will continue to be a utrace method of
setting the (virtual) "raw" debugregs, even if ptrace per se is not involved.
(So all I'm saying really is I'm on a personal campaign against the letter P.)

OTOH, your point is well taken.  Once your stuff is integrated, there is no
real reason that thread-virtualized "raw" debug registers need to be
accessible via utrace_regset.  Perhaps I should drop it.  Then those calls

Yeah, I guess that's right.  It should still return NOTIFY_STOP when
args-&gt;err has no other bits set, so notifiers aren't called with zero.


Thanks,
Roland
-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, March 6, 2007 - 11:23 am

Guess I'll have to take that approach.  The new additions to __switch_to() 
follow a linked list, and obviously it's not safe to do that while the 

I see your point.  If you want to monitor a certain location in kernel
space but only while in the context of a specific task, the new framework
doesn't provide any way of doing it.  One approach could be to use a
regular kernel breakpoint and return immediately if the task of interest

In practice that might not work.  On my machine, at least, reads of DR6
return ones in all the reserved bit positions.

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, March 6, 2007 - 11:49 pm

Does that mean asm("mov %1,%%dr6; mov %%dr6,%0" : "=r" (mask) : "r" (0)); 
puts in mask the set of reserved bits?  We could collect that value at CPU
startup and mask it off args-&gt;err, then OR it back into vdr6.


Thanks,
Roland
-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, March 7, 2007 - 3:11 pm

That sounds like a rather fragile approach to avoiding a minimal amount of 
work.  Debug exceptions don't occur very often, and when they do it won't 
matter too much if we go through some extra notifier-chain callouts.



It turns out that this won't work correctly unless I use something
stronger, like a spinlock or RCU.  Either one seems like overkill.

Is there any way to find out from within the
switch_to_thread_hw_breakpoint routine whether the task is in this unusual
state?  (By which I mean the task is being debugged and the debugger
hasn't told it to start running.)  Would (tsk-&gt;exit_code == SIGKILL) work?  
If not, can we add a TIF_DEBUG_STOPPED flag?  Or should I just go with a 
spinlock?

Is SIGKILL the only way this can happen?

In a similar vein, I need a reliable way to know whether a task has gone 
through exit_thread().  If it has, then its hw_breakpoint area has been 
deallocated and a new one must not be allocated.  Will (tsk-&gt;flags &amp; 
PF_EXITING) always be true once that happens?

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, March 9, 2007 - 2:52 am

When single-stepping occurs it happens repeatedly many times, and that

What is the problem with just clearing TIF_DEBUG?  It just means that in
the SIGKILL case, the dying thread won't switch in its local debugregs.
The global kernel allocations will already be set in the processor from the
previous context, and old user-address allocations do no harm since we

That won't necessarily work.  There isn't any cheap check that won't also


If it's really necessary, but it hasn't proved so for any other switched
per-thread state.  As long as you aren't doing per-thread kernel-mode

It should be, but there might be some stray wake_up_process calls in the
kernel that can violate [up]trace's supposed monopoly on TASK_TRACED (or
duopoly with SIGKILL, I suppose I should say).  If there is no SIGKILL,
then the task will just call schedule again nearly immediately to go back
to blocking, which will switch out unless there is a second wakeup right

PF_EXITING it set after there is no possibility of returning to user mode,
but a while before exit_thread, when you might still want kernel-mode
breakpoints.  If the only per-thread allocations you support are for user
mode, then you can certainly refuse to do any when PF_EXITING is set.


Thanks,
Roland

-
To: Roland McGrath <roland@...>, Christoph Hellwig <hch@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, March 9, 2007 - 2:40 pm

Well, I can add in the test for 0, but finding the set of always-on bits
in DR6 will have to be done separately.  Isn't it possible that different

Here's what might happen.  Let's say T is being debugged by D, when
somebody sends T a SIGKILL.  CPU 0 does a __switch_to() to take care of
terminating T, and since TIF_DEBUG is still set, it runs the
switch_to_thread_hw_breakpoint() routine.  The routine begins to install 
the debug registers for T.

At that moment D, running on CPU 1, decides to unregister a breakpoint in
T.  Clearing TIF_DEBUG now doesn't do any good -- it's too late; CPU 0 has
already tested it.  CPU 1 goes in and alters the user breakpoint data,
maybe even deallocating a breakpoint structure that CPU 0 is about to
read.  Not a good situation.

What I need is some way for CPU 0 to know from the start that the current 
task is about to die, so it can avoid installing the user breakpoints 
altogether.  Or else there has to be some sort of mutual exclusion so that 

No way to tell when a task being debugged is started up by anything other 
than its debugger?  Hmmm, in that case maybe it would be better to use 
RCU.  It won't add much overhead to anything but the code for registering 

Okay, that's easy enough.



Below is the current version of the patch, against 2.6.21-rc3.  The source 
file include/asm-i386/hw_breakpoint.h includes an example showing how to 
set up a kernel breakpoint.

Alan Stern



Index: 2.6.21-rc2/include/asm-i386/hw_breakpoint.h
===================================================================
--- /dev/null
+++ 2.6.21-rc2/include/asm-i386/hw_breakpoint.h
@@ -0,0 +1,193 @@
+#ifndef	_I386_HW_BREAKPOINT_H
+#define	_I386_HW_BREAKPOINT_H
+
+#include &lt;linux/list.h&gt;
+#include &lt;linux/types.h&gt;
+
+/**
+ *
+ * struct hw_breakpoint - unified kernel/user-space hardware breakpoint
+ * @node: internal linked-list management
+ * @triggered: callback invoked when the breakpoint is hit
+ * @installed: callback invoked when the b...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, March 13, 2007 - 4:00 am

I don't know, but it seems unlikely.  AFAIK all CPUs are presumed to have


Indeed, it is for this sort of thing.  Still, it feels like a bit too much
is going on in switch_to_thread_hw_breakpoint for the common case.
It seems to me it ought to be something as simple as this:

	if (unlikely((thbi-&gt;want_dr7 &amp;~ chbi-&gt;kdr7) != thbi-&gt;active_tdr7) {
		/* Need to make some installed or uninstalled callbacks.  */
		if (thbi-&gt;active_tdr7 &amp; chbi-&gt;kdr7)
			uninstalled callbacks;
		else
			installed callbacks;
		recompute active_dr7, want_dr7;
	}

	switch (thbi-&gt;active_bkpts) {
	case 4:
		set_debugreg(0, thbi-&gt;tdr[0]);
	case 3:
		set_debugreg(1, thbi-&gt;tdr[1]);
	case 2:
		set_debugreg(2, thbi-&gt;tdr[2]);
	case 1:
		set_debugreg(3, thbi-&gt;tdr[3]);
	}
	set_debugreg(7, chbi-&gt;kdr7 | thbi-&gt;active_tdr7);

Only in the unlikely case do you need to worry about synchronization,
whether it's RCU or spin locks or whatever.  The idea is that breakpoint
installation when doing the fancy stuff would reduce it to "the four
breakpoints I would like, in order" (tdr[3] containing the highest-priority
one), and dr7 masks describing what dr7 you were using last time you were
running (active_tdr7), and that plus the enable bits you would like to have
set (want_dr7).  The unlikely case is when the number of kernel debugregs
consumed changed since the last time you switched in, so you go recompute
active_tdr7.  (Put the body of that if in another function.)

For the masks to work as I described, you need to use the same enable bit
(or both) for kernel and user allocations.  It really doesn't matter which
one you use, since all of Linux is "local" for the sense of the dr7 enable
bits (i.e. you should just use DR_GLOBAL_ENABLE).

It's perfectly safe to access all this stuff while it might be getting
overwritten, and worst case you switch in some user breakpoints you didn't
want.  That only happens in the SIGKILL case, when you never hit user mode

I don't unders...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, March 13, 2007 - 2:56 pm

Yes, the code could be reworked by moving some of the data from the CPU
hw-breakpoint info into the thread's info.  I'll see how much simpler it

It isn't quite that easy.  Even though the number of user breakpoints may
not have changed, their identities may have.  So the unlikely case has to
encompass two possibilities: the number of installable user breakpoints

This shouldn't be necessary.  So long as DR_GLOBAL_ENABLE always belongs
to the kernel's part of DR7 and DR_LOCAL_ENABLE always belongs to the

Maybe.  I always had in the back of my mind the possibility that there
might be a user I/O breakpoint set.  It could be triggered by an interrupt
handler even in the SIGKILL case.  But since we're not supporting I/O

Actually the code _doesn't_ already know what's there; the chbi area
doesn't include any storage for the kernel DR7 value.  I figured it was at
least as easy to read it from the CPU register as to read it from memory.  
But maybe that's not true; according to my ancient processor manual, moves
to/from debug registers take many more clock cycles than moves to/from

Okay.  That test was for situations when it's necessary to install only 
the kernel's debug registers, with no thread registers.  It can easily 

I was being overly cautious.  If interrupts were enabled then it might be 
possible to trip a user I/O breakpoint.  But since they aren't, that code 

No.  If a debugger has removed some user breakpoints since the last time
the thread ran, the chbi-&gt;bps[] entries could still be present.  Likewise
if the previously-running task had more breakpoints than the current one.

Of course it doesn't actually hurt to have stale pointers lying around;
they refer to breakpoints which aren't enabled.  So clearing them isn't

This proposed change is wrong.  Remember that ptrace breakpoints are 
virtualized; they are assigned to different debug registers from what the 


I don't like using DR_LEN_1, because it would force asm/debugreg.h to be 
#included by any user o...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, March 13, 2007 - 11:00 pm

I don't quite understand that characterization of the kind of change I'm
advocating.  If the common case path in context switch has really anything

Why does it matter?  When a new user breakpoint was made the
highest-priority one, it ought to update tdr[0..3] right then before the
registration call returns.  It seems fine to me for it to make an
uninstalled callback right away rather than at the thread's next switch-in.
But even if you wanted to delay it, you could just set active_dr7 to zero

The plan I suggested relies on setting want_dr7 with the enable bits that
do include the ones the kernel uses (for contested slots).  Of course it
works as well to use either bit for this, as long as you're consistent.
But as I've said at least twice already, there is no actual meaning
whatsoever to choosing one enable bit over the other.  It's just confusing
and misleading to have the code make special efforts to set one rather than
the other for different cases.  You talk about them as if they meant
something, which keeps making me wonder if you're confused.  Since the
hardware doesn't care which bit you set, you could overload them to record
a bit and a half of information there if really wanted to, but you're not

How would that happen?  This would mean that some user process has been
allowed to enable ioperm for some io port that kernel drivers also send to

The purpose of the chbi area is to optimize this path.  Make it store
whatever precomputed values are most convenient for the hot paths.  This
path doesn't need num_kbps, it needs kdr7.  So precompute that and do that
one load, instead of a load of chbi-&gt;num_bkps we don't otherwise need plus
a load from kdr7_masks that can be avoided altogether on hot paths.

I don't really know about the slowness of reading debug registers, though I
would guess it is slower than most common operations.  But regardless, you
can avoid it because kdr7 is something you need anyway, so you're not

I don't really get why user breakpoints would be i...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Thursday, March 22, 2007 - 3:44 pm

Roland:

Here's the most recent version of the hw-breakpoint patch.  Unlike the 
version I posted last week, this one actually works with 2.6.21-rc4.

Alan Stern


Index: usb-2.6/include/asm-i386/hw_breakpoint.h
===================================================================
--- /dev/null
+++ usb-2.6/include/asm-i386/hw_breakpoint.h
@@ -0,0 +1,17 @@
+#ifndef	_I386_HW_BREAKPOINT_H
+#define	_I386_HW_BREAKPOINT_H
+
+#include &lt;asm-generic/hw_breakpoint.h&gt;
+
+/* Available HW breakpoint lengths */
+#define HW_BREAKPOINT_LEN_1	1
+#define HW_BREAKPOINT_LEN_2	2
+#define HW_BREAKPOINT_LEN_4	4
+#define HW_BREAKPOINT_LEN_EXECUTE	1
+
+/* Available HW breakpoint types */
+#define HW_BREAKPOINT_EXECUTE	0x80	/* trigger on instruction execute */
+#define HW_BREAKPOINT_WRITE	0x81	/* trigger on memory write */
+#define HW_BREAKPOINT_RW	0x83	/* trigger on memory read or write */
+
+#endif	/* _I386_HW_BREAKPOINT_H */
Index: usb-2.6/arch/i386/kernel/process.c
===================================================================
--- usb-2.6.orig/arch/i386/kernel/process.c
+++ usb-2.6/arch/i386/kernel/process.c
@@ -58,6 +58,7 @@
 #include &lt;asm/tlbflush.h&gt;
 #include &lt;asm/cpu.h&gt;
 #include &lt;asm/pda.h&gt;
+#include &lt;asm/debugreg.h&gt;
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
@@ -359,9 +360,10 @@ EXPORT_SYMBOL(kernel_thread);
  */
 void exit_thread(void)
 {
+	struct task_struct *tsk = current;
+
 	/* The process may have allocated an io port bitmap... nuke it. */
 	if (unlikely(test_thread_flag(TIF_IO_BITMAP))) {
-		struct task_struct *tsk = current;
 		struct thread_struct *t = &amp;tsk-&gt;thread;
 		int cpu = get_cpu();
 		struct tss_struct *tss = &amp;per_cpu(init_tss, cpu);
@@ -379,15 +381,17 @@ void exit_thread(void)
 		tss-&gt;io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
 		put_cpu();
 	}
+	if (unlikely(tsk-&gt;thread.hw_breakpoint_info))
+		flush_thread_hw_breakpoint(tsk);
 }
 
 void flush_thread(void)
 {
 	struct task_st...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, March 16, 2007 - 5:07 pm

Roland:

Here's the next update.  I haven't tried testing it yet; this is just to
get your opinion.

I implemented most of the changes we discussed.  Ignoring the length for 
execute breakpoints turned out not to be a good idea because it would 
affect the way ptrace works, so the code verifies it just like any other 
kind of breakpoint.

I also decided against adding a .bits member.  It doesn't really gain very 
much; the savings in encoding the breakpoint values is trivial -- one line 
of code on i386.  And it helps to have the original length and type values 
available for use by the ptrace routines.  In fact, I decided to add a 
superfluous bit to the type code.  That's to help disambiguate between 
length and type values; it's easy to mix the two of them up.  Likewise, 
the length macros don't give the encoded values; that's so people can just 
specify the length directly instead of using the macro.

There's a small question about the value of the error_code argument for 
send_sigtrap().  The value passed into do_debug() isn't available in
ptrace_triggered() -- but since it is always 0, that's what I'm using.  
I'm not sure what it's supposed to mean anyway.

Anyway, this version seems to be a fair amount cleaner than the previous.  
See what you think.

Alan Stern


Index: usb-2.6/include/asm-i386/hw_breakpoint.h
===================================================================
--- /dev/null
+++ usb-2.6/include/asm-i386/hw_breakpoint.h
@@ -0,0 +1,17 @@
+#ifndef	_I386_HW_BREAKPOINT_H
+#define	_I386_HW_BREAKPOINT_H
+
+#include &lt;asm-generic/hw_breakpoint.h&gt;
+
+/* Available HW breakpoint lengths */
+#define HW_BREAKPOINT_LEN_1	1
+#define HW_BREAKPOINT_LEN_2	2
+#define HW_BREAKPOINT_LEN_4	4
+#define HW_BREAKPOINT_LEN_EXECUTE	1
+
+/* Available HW breakpoint types */
+#define HW_BREAKPOINT_EXECUTE	0x80	/* trigger on instruction execute */
+#define HW_BREAKPOINT_WRITE	0x81	/* trigger on memory write */
+#define HW_BREAKPOINT_RW	0x83	/* trigger on memory read ...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, March 14, 2007 - 3:11 pm

It's easy to explain: Your sample code had a tdr[] array in the thread's
hw_breakpoint area and my patch did not; adding it amounts to moving some

That's basically what I intend to do.  Although instead of keeping track 
of an active_dr7, I'll keep track of num_kbps as of the last time the 
thread ran.  The unlikely case can be triggered by setting the value to 


No, I'm not confused and neither are you.  I realize there's no functional 
difference between the two sets of enable bits, since Linux doesn't use 
hardware task-switching.  I just like to keep things neatly separated, 

I haven't checked the ioperm code to be certain, but it seems like the 
sort of thing somebody might want to do on occasion.

Another aspect perhaps worth mentioning is that user breakpoints are 
active only when the task is running.  Hence breakpoints for user data 
affected by AIO operations won't necessarily be triggered; with async I/O 
the transfers can occur while a different task is running.  Of course 
there's nothing new about this; the same has been true for ptrace all 

I will endeavor to optimize switch_to_thread_hw_breakpoint.  However it 
will turn out that the hot patch really does to use chbi-&gt;num_kbps and 



Not if you have interrupts disabled.  Debug register settings are 
disseminated from one CPU to the others by means of an IPI.

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, March 28, 2007 - 5:39 pm

Sorry I've been slow in responding to your most recent version.

I don't really know what more to say.  I like things neatly separated too,
most especially between clearly meaningful and misleading yet apparently
meaningful.  Overloading unrelated hardware bits for no material purpose

I checked the ioperm code.  As far as I can tell, if you have CAP_SYS_RAWIO
then you can use ioperm to enable any port you want, so this will always be

But the callback is not per-CPU, it is per-registration.  When a new
registration displaces an old one, or a deregistration stops displacing
one, isn't the status change in the data structure and the callback made
right away, by the thread doing the registration/deregistration call?
Another CPU with interrupts disabled won't have its breakpoints actually


I think this is a mistake.  On other machines, there is no need for more
than one field and .len will go unused.  There is no reason not to encode
ahead of time.  If it's useful for callers to be able to extract the type
and length from a struct hw_breakpoint, it's easy to define macros or

There's no reason __modify_user_hw_breakpoint can't use an encoded value.

It doesn't hurt to add some constant bits to the API values that you just
mask out (after checking) as part of the encoding convsersion.  In fact,

I object to this.  The purpose of having macros is to give a constrained
set of values that can be used at compile time.  Letting people use literal
numbers is just asking for people to write their calls unportably.  For
machines that support only LEN8, I'd planned to define it to some magic
bit pattern just so it could be checked and rule out cavalier users who

task-&gt;thread.trap_no and task-&gt;thread.error_code usually store the values
from the hardware trap frame when a trap is turned into a signal (e.g. see
do_trap).  This is the hardware trap number, which is 1 for do_debug.  The
error code is a word of the hardware trap frame whose meaning is specified
differently for each trap n...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, May 11, 2007 - 11:25 am

Has the same thing happened again?  There hasn't been any feedback on the 
most recent version of hw_breakpoint emailed on April 13:

	http://marc.info/?l=linux-kernel&amp;m=117661223820357&amp;w=2

I think there are probably still a few small things wrong with it.  For
instance, the RF setting isn't right; I misunderstood the Intel manual.  
It should get set only when the latest debug interrupt was for an
instruction breakpoint.

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Sunday, May 13, 2007 - 6:39 am

This makes me think about RF a little more.  If you ever set it, there are
some places we need to clear it too.  That is, when the PC is being changed
before returning to user mode, which is in signals and in ptrace.  If the
PC is changing to other than the breakpoint location hit by the handler
that set RF, we need clear RF so that the first instruction at the changed
PC can be a breakpoint hit of its own and not get masked.  In fact, it may
also be necessary to clear RF when freshly setting a new instruction
breakpoint (when RF is set because the stop was not a debug exception at

I sort of wondered from the beginning why it was there.  The rationale I
can see is to avoid flutter.  That is, when unregistering frees up a slot
for a lower-priority allocation waiting in the wings, and then the new
registration will just displace it again.  The priority list diddling is
wasted work to get back to just how it was before, but more importantly you
don't want to have those callbacks for a momentarily-available slot coming



The documentation I have says that RF is set in the trap frame on the stack
(i.e. pt_regs.eflags) by every other kind of exception.  However, for a
debug exception that is due to an instruction breakpoint, RF=0 in the trap
frame and the manual explicitly says that the handler must set the bit so
that iret will resume and execute it rather than hit the breakpoint again.


Is this really "some CPUs"?  Or is it actually always as I described above

The important thing is that there aren't any difficult races (i.e. what you
get with callbacks).  If register with no callback followed by unregister
on seeing "registered but not installed" return value is simple and cheap,

It's no less meaningful than for a kernel allocation.  In neither case is
there a guarantee you'll keep it forever.  What callers I had in mind want
is a quick answer when the answer is negative at the time of the call, so

It looks right to me.  That is, it preserves the existing behavior for

...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Thursday, May 17, 2007 - 4:39 pm

I took a look at seqlock.h.  It turns out not to be a good match for my 
requirements; the header file specifically says that it won't work with 
data that contains pointers.  But changing over to regular 32-bit 
sequence numbers was straightforward.

The "switching"/"restart" stuff doesn't need memory barriers because
all the communication is between two routines on the same CPU.  Nor are 
memory barriers needed in the rest of the code for the kernel 
breakpoint updates; the IPI mechanism already provides its own.

However there is one oddball case which does seem to require a memory
barrier: when a new CPU comes online (either for the first time or
during return from hibernation).  There's a hook to load the initial
debug register values, and it runs in an atomic context so I can't
grab the mutex.  The hook is called in two places:

	arch/i386/power/cpu.c: fix_processor_context(), and
	arch/i386/kernel/smpboot.c: start_secondary().

A memory barrier is necessary to avoid chaos if another CPU should
happen to update the kernel breakpoint settings at the same time.  If

Okay, if I don't worry about machines with two sets of code &amp; data
debug registers (HB_TYPES_NUM = 2) then yes, quite a lot of the code is
sharable.  There will be a few arch-specific hooks to:

	Store the values into the debug registers;

	Take care of the DR7 calculations;

	Do address limit verification (see whether a pointer
	lies in user space or kernel space).

Nothing more seems to be needed.  Then there will be unsharable code, 
including:

	Dumping the debug registers while creating an aout-type
	core image;

	All the legacy ptrace stuff;

	The notify-handler itself.


How should this be arranged so that it can build okay on all platforms,
even ones where the low-level support code hasn't been written?  Maybe 
an arch-dependent CONFIG_HW_BREAKPOINT option?

Alan Stern

-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, May 14, 2007 - 11:42 am

It seems to me that signal handlers must run with a copy of the original
EFLAGS stored on the stack.  Otherwise, when the handler returned the
former context wouldn't be fully restored.  But I don't know enough about
the signal handling code to see how to turn off RF in the stored EFLAGS
image.

Also, what if the signal handler was entered as a result of encountering 
an instruction breakpoint?  In that case you would want to keep RF on to 
prevent an infinite loop.

You're right about wanting to clear RF when changing the PC via ptrace or
when setting a new execution breakpoint (provided the new breakpoint's
address is equal to the current PC value).

Do you know how gdb handles instruction breakpoints, and in particular, 

That may be what I originally had in mind; I no longer remember.

But it doesn't matter.  We're up against an API incompatibility here.  
gdb doesn't allow you to modify breakpoints; it forces you to delete the
old one and add a new one.  It's only an artifact of the x86 architecture
that gdb implements this by reusing debug registers.  So even if the 
modify_user_hw_breakpoint() routine were kept, gdb wouldn't really want to 
make use of it.


On the 386, either GE or LE had to be set for DR breakpoints to work 
properly.  Later on (I don't remember if it was in the 486 or the Pentium) 
this restriction was removed.  I don't know whether those bits do anything 

My 80386 Programmer's Reference Manual says:

	... an instruction-address breakpoint exception is a fault.

And:

	When it detects a fault, the processor automatically sets
	RF in the flags image that it pushes onto the stack.

And:

	The processor automatically sets RF in the EFLAGS image
	on the stack before entry into any fault handler.  Upon
	entry into the fault handler for instruction address
	breakpoints, for example, RF is set in the EFLAGS image
	on the stack...

That seems to be pretty clear.  So the behavior can vary according to the 

I suppose you might register a breakp...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, May 14, 2007 - 5:25 pm

Of course.  I'm talking about how the registers get changed to set up the
signal handler to start running, not how the interrupted registers are
saved on the user stack.  There is no issue with the stored eflags image;

This does not happen in reality.  Breakpoints can only be set by the

Starting a signal handler is "warping the PC" equivalent to changing it via
ptrace for purposes of this discussion.  In case the new PC is the site of

AFAICT it never actually uses hardware instruction breakpoints, only data
watchpoints.  I wouldn't be surprised if noone has ever really used
instruction breakpoint settings in x86 hardware debug registers on Linux.
(Frankly, I don't much expect them to start either.  This level of detail
about instruction breakpoints is largely academic.  I am a stickler for
getting the details right if we're going to allow using them at all.  



I'm moderately sure they do nothing on modern CPUs.  Intel says they're
ignored as of Pentium, but recommends setting both bits if you care at all.
In practice, I don't think we'll ever hear about the inexactness on a
pre-Pentium processor from not setting the bits.  But I'd follow the Intel

The earlier quote I gave was from an AMD64 manual.  A 1995 Intel manual I
have says, "All Intel Architecture processors manage the RF flag as follows,"
and proceeds to give the "all faults except instruction breakpoint" behavior
I quoted from the AMD manual earlier.  Hence I sincerely doubt that this
varies among Intel and AMD processors.  Someone else will have to help us
know about other makers' processors.  So far I have no reason to suspect that

Yes, that is really the kind of thing I had in mind.  For user breakpoints it


I think that is what I suggested an iteration or two ago.  Installing new
state means making a fresh data structure and installing a pointer to it,

I guess my main objection to having .type and .len is the false implied
documentation of their presence and names, leading to people thinking they
can l...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, May 16, 2007 - 3:03 pm

Ah, okay.  Yes, clearly the new EFLAGS for the signal handler should 
have RF turned off.  This should always be true, regardless of 

Hmmm.  I put in a little extra code to account for the possibility that
a program might want to set hardware breakpoints in itself.  Should


The fact that they are machine-specific and implementation-specific 

No; I have tested only a couple of systems and I don't have a wide

Allow me to rephrase: When a debug exception occurs, the real DR6 value
should be copied to vdr6, except that kprobes should adjust DR_STEP and
hw_breakpoint should adjust the DR_TRAPn bits appropriately.  There's
some question about what value the debug exception handler should write
back to DR6, if anything.  When switching to a new task, the DR_TRAPn
bits in vdr6 could be de-virtualized somehow and the result loaded into
DR6, but again, it might be safest to leave DR6 alone.

As for what users expect of the low four bits, you are definitely 
wrong.  My tests with gdb show that it relies on the CPU to clear those 
bits whenever a data breakpoint is hit; it doesn't clear them itself 
and it doesn't work properly if the kernel keeps virtualized versions 
of them set.  That's on a Pentium 4 and on an AMD Duron.

I did some testing to see how the CPU behaves when the debug handler
writes different values back to DR6.  The results were:

	Values written back to DR6 were retained in the register until 
	the next debug exception occurred.

	When the exception handler read DR6, the 0xffff0ff0 bits were
	set every time.  The 0x00001000 bit was never set, even if it
	had been turned on before the exception occurred.

	No matter what values were stored in the low four bits
	beforehand, when the exception occurred DR6 had only the
	bit for the debug register which was triggered.

	If the handler wrote back any of BS, BT, or BD to DR6, then
	the system misbehaved.  I don't know exactly what happened,
	but my shell process ended and the debug handler got called
	over and ove...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, May 23, 2007 - 4:47 am

Do you just mean a register_hw_breakpoint call made on current?  That
certainly ought to work.  That's still "the debugger", i.e. in utracespeak
the tracing engine.  My point was that there will never be a facility
intended for a program to use hw_breakpoint to generate a signal that gets
delivered to a handler in the vanilla way.  There's always some "outside"
agent who asked for the breakpoint and who is responsible for responding to
the traps it causes, never the program itself so as it would make sense for

The code in bp_show is exactly the kind of wrong I want to prevent.  When I
say they are machine-specific and implementation-specific, I mean there is
no specified part of the interface to which you can presume they correspond
directly.  The powerpc implementation will not have any field that is set
to HW_BREAKPOINT_LEN_8 and may well have none set to the type macros
either.  If you want to have some machine-specific macros or inlines to


Ok.  We were both going on what the manual said and I was assuming that



Ok.  This makes the users' expectations make sense.  Maybe we can get the
Intel and AMD people to change the manual not to be misleading about this
(it says something terse about "never clears" and without more details I
read it as "never clears any bit, ever").  

What about DR_STEP?  i.e., if DR_STEP was set from a single-step and then
there was a DR_TRAPn debug exception, is DR_STEP still set?  If DR_TRAPn


Agreed.  I suspect clearing it to zero is the right thing (given what the
hardware manuals say), even if it appears that DR_STEP and DR_TRAPn do


I think you should post that little patch (and equivalent for x86_64) by

There is no black magic about that, it's just saying that seqlock/seqcount
does not do any implicit synchronization with your data structure
management.  If the pointers in question are protected by RCU, there is no
problem (if your read_seqcount_retry loop is inside rcu_read_lock).  Since
the caller supplies the pointers, not requir...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Friday, June 1, 2007 - 3:39 pm

I really don't understand your point here.  What's wrong with bp_show?  
Is it all the preprocessor conditionals?  I thought that was how we had 
agreed portable code should determine which types and lengths were 
supported on a particular architecture.

Consider that the definition of struct hw_breakpoint is in
include/asm-generic/.  Hence .type and .len are guaranteed to be
present on all architectures; we can't just leave them out on some
while including them on others.  In particular, .len _will_ always be
equal to HW_BREAKPOINT_LEN_8 on PPC.  (Of course, you're always
free to define HW_BREAKPOINT_LEN_8 as 0 in the arch-specific header
file if you want, so this doesn't mean as much as it might seem.)

Consider also that .type and .len impose no overhead on architectures 
that don't care about them.  The space they use up would be wasted 
otherwise.  It seems that what you want would complicate the x86 
implementations significantly without offering any real benefit to 
others.

The one thing which makes sense to me is that some architectures might 
want to store type and/or length bits in along with the address field.  
So I added documentation explaining that there may be arch-specific 
changes to .address while a breakpoint is registered, and I added 
arch-specific accessors to fetch the true address value.  There are 

I didn't experiment with using DR_STEP.  There wasn't any simple way to
cause a single-step exception.  Perhaps if I were more familiar with

Even more surprising was that it stopped and settled back down to 
normal after a little while!  I'm not accustomed to seeing infinite 


In fact, I don't need the seqcount stuff at all.  Just about everything
it provides is already covered by RCU.  One of the secrets is to move

The other secret is to have shared access only to the global data in
the RCU-protected structure, which means storing an array of pointers
to the highest-priority kernel breakpoints there, as you suggest.  The
data which gets updated in ...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Thursday, June 14, 2007 - 2:48 am

That part is fine.  The problem is fetching the hw_breakpoint.len field
directly and expecting it to contain the API values.  In an implementation
done as I've been referring to, there is no need for any field to contain
the HW_BREAKPOINT_LEN_8 value, and it's a waste to store one.  If it were

Indeed, that is the natural thing (and all the bits needed) on several.
I hadn't raised this before since I was having so much trouble already
convincing you about storing things in machine-dependent fashion so that
users cannot just use the struct fields directly.

I really think it would be cleanest all around to use just:

	struct arch_hw_breakpoint info;

in place of address union, len, type in struct hw_breakpoint.  Then each
arch provides hw_breakpoint_get_{kaddr,uaddr,len,type} inlines.  For
storing, each arch can define hw_breakpoint_init(addr, len, type) (or
maybe k/u variants).  This can be used by callers directly if you want to
keep register_hw_breakpoint to one argument, or could just be internal if
register_hw_breakpoint takes the three more args.  If callers use it
directly, there can also be an INIT_ARCH_HW_BREAKPOINT(addr, len, type)
for use in struct hw_breakpoint_init initializers.

On x86 use:

	struct arch_hw_breakpoint_info {
		union {
			const void		*kernel;
			const void	__user	*user;
			unsigned long		va;
		}		address;
		u8		len;
		u8		type;
	} __attribute__((packed));


It's easy for user mode with gdb.  kprobes is simple to use, and it
always does a single-step to execute (a copy of) the instruction that 
was overwritten with the breakpoint.  So, write a module that does:

	int testvar=0;
	asm(".globl testme; testme: movl $17,testvar; ret");
	void testme();
	testinit() {
		... register kprobe at &amp;testme ...
		... register hw_breakpoint at &amp;testvar ...
		testme()
	}

Your kprobe handlers don't have to actually do anything at all, if you
are just hacking the low-level code so see what %dr6 values you get at

I still think it's the proper...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, June 19, 2007 - 4:35 pm

"A waste to store one"?  Waste of what?  It isn't a waste of space; the 

It is now possible for an implementation to store things in a 
machine-dependent fashion; I have added accessor routines as you 
suggested.  But I also left the fields as they were; the documentation 
mentions that they won't necessarily contain any particular values.

You might want to examine the check in validate_settings() for address 
alignment; it might not be valid if other values get stored in the 
low-order bits of the address.  This is a tricky point; it's not safe 
to mix bits around unless you know that the data values are correct, 

Maybe.  I don't see any reason for the unnecessary encapsulation, 

Yes, of course.  I feel foolish for having forgotten.

Tests show that my CPU does not clear DR_STEP when a data breakpoint is
hit.  Conversely, the DR_TRAPn bits are cleared even when a single-step 
exception occurs.

The bizarre behavior from before is still present; the system gets in a
long loop when the exception handler leaves any of the 0xe000 bits set
in DR6.  And it kills my shell process, probably by sending it a
SIGTRAP.  Oddly enough, this only happens when there's a kernel-space
debug exception -- faults in user-space continue to work normally.  
It's not clear what this means; the behavior indicates a software
problem but the dependency on the DR6 value indicates a hardware
contribution as well...

If you're interested, I can send you the code I used to do this testing

We have three things to consider: ptrace, utrace, and hw-breakpoint.  
Ultimately hw-breakpoint should become part of utrace; we might not
want to bother with a standalone version.

Furthermore, hw-breakpoint takes over the ptrace's mechanism for
breakpoint handling.  If we want to allow a configuration where ptrace
is present and hw-breakpoint isn't, then I would have to add an
alternate implementation containing only support for the legacy
interface.

It doesn't have to be done now, but it is something to bea...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, June 25, 2007 - 7:32 am

I added this on top of your patch to make it compile (and look a little nicer).
With that, bptest worked nicely.

---
 arch/i386/kernel/kprobes.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Index: b/arch/i386/kernel/kprobes.c
===================================================================
--- a/arch/i386/kernel/kprobes.c
+++ b/arch/i386/kernel/kprobes.c
@@ -35,6 +35,7 @@
 #include &lt;asm/cacheflush.h&gt;
 #include &lt;asm/desc.h&gt;
 #include &lt;asm/uaccess.h&gt;
+#include &lt;asm/debugreg.h&gt;
 
 void jprobe_return_end(void);
 
@@ -660,16 +661,16 @@ int __kprobes kprobe_exceptions_notify(s
 			ret = NOTIFY_STOP;
 		break;
 	case DIE_DEBUG:
-
-	/* The DR6 value is stored in args-&gt;err */
-#define DR6	(args-&gt;err)
-
-		if ((DR6 &amp; DR_STEP) &amp;&amp; post_kprobe_handler(args-&gt;regs)) {
-			if ((DR6 &amp; ~DR_STEP) == 0)
-				ret = NOTIFY_STOP;
-		}
+		/*
+		 * The %db6 value is stored in args-&gt;err.
+		 * If DR_STEP is the only bit set and it's ours,
+		 * we should eat this exception.
+		 */
+		if ((args-&gt;err &amp; DR_STEP) &amp;&amp;
+		    post_kprobe_handler(args-&gt;regs) &amp;&amp;
+		    (args-&gt;err &amp; ~DR_STEP) == 0)
+			ret = NOTIFY_STOP;
 		break;
-#undef DR6
 
 	case DIE_GPF:
 	case DIE_PAGE_FAULT:
-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, June 25, 2007 - 4:51 pm

Roland:

Here's the next iteration.  The arch-specific parts are now completely 
encapsulated.  validate_settings is in a form which should be workable 
on all architectures.  And the address, length, and type are passed as 
arguments to register_{kernel,user}_hw_breakpoint().

I changed the Kprobes single-step routine along the lines you 
suggested, but added a little extra.  See what you think.

I haven't tried to modify Kconfig at all.  To do it properly would
require making ptrace configurable, which is not something I want to
tackle at the moment.

The test for early termination of the exception handler is now back the
way it was.  However I didn't change the test for deciding whether to 
send a SIGTRAP.  Under the current circumstances I don't see how it 
could ever be wrong.  (On the other hand, the code will end up calling 
send_sigtrap() twice when a ptrace exception occurs: once in the ptrace 
trigger routine and once in do_debug.  That won't matter will it?  I 
would expect send_sigtrap() to be idempotent.)

Are you going to the Ottawa Linux Symposium?

Alan Stern



Index: usb-2.6/include/asm-i386/hw_breakpoint.h
===================================================================
--- /dev/null
+++ usb-2.6/include/asm-i386/hw_breakpoint.h
@@ -0,0 +1,49 @@
+#ifndef	_I386_HW_BREAKPOINT_H
+#define	_I386_HW_BREAKPOINT_H
+
+#ifdef	__KERNEL__
+#define	__ARCH_HW_BREAKPOINT_H
+
+struct arch_hw_breakpoint {
+	unsigned long	address;
+	u8		len;
+	u8		type;
+} __attribute__((packed));
+
+#include &lt;asm-generic/hw_breakpoint.h&gt;
+
+/* HW breakpoint accessor routines */
+static inline const void *hw_breakpoint_get_kaddress(struct hw_breakpoint *bp)
+{
+	return (const void *) bp-&gt;info.address;
+}
+
+static inline const void __user *hw_breakpoint_get_uaddress(
+		struct hw_breakpoint *bp)
+{
+	return (const void __user *) bp-&gt;info.address;
+}
+
+static inline unsigned hw_breakpoint_get_len(struct hw_breakpoint *bp)
+{
+	return bp-&gt;info.len;
+}
...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, June 26, 2007 - 2:17 pm

I needed the attached patch on top of the bptest patch for the current
code.  Btw, that is a very nice little tester!

Below that is a patch to go on top of your current patch, with x86-64
support.  I've only tried a few trivial tests with bptest (including an
8-byte bp), which worked great.  It is a pretty faithful copy of your i386
changes.  I'm still not sure we have all that right, but you might as well
incorporate this into your patch.  You should change the x86_64 code in
parallel with any i386 changes we decide on later, and I can test it and
send you any typo fixups or whatnot.


Thanks,
Roland
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, June 26, 2007 - 10:43 pm

I had already made some of those changes (the ones needed to make 

Right.  I may update a few comments...

Alan Stern

-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, June 25, 2007 - 11:37 am

I'll merge this with the rest of the patch.

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, June 25, 2007 - 6:52 am

People usually read the documentation after the fields named like they can

This is why I didn't bring up encoded addresses earlier on. :-)  

These kinds of issues are why I prefer unambiguously opaque arch-specific
encodings.  validate_settings is indeed wrong for the natural ppc encoding.

The values must be set by a call that can return an error.  That means you
can't really have a static initializer macro, unless it's intended to mean
"unspecified garbage if not used exactly right".  I favor just going back




I was not suggesting that.  CONFIG_PTRACE would require HW_BREAKPOINT on

This is incorrect.  The usage of notify_die in all other cases, at least of
machine exceptions on x86, is to test for == NOTIFY_STOP and when true
short-circuit the normal effect of the exception (signal, oops).  The
notifiers should return NOTIFY_STOP if they consumed the exception wholly.
So if a non-ptrace hw breakpoint consumed the exception and left no
DR_TRAPn bits set, the thread would generate a second SIGTRAP from the
prior single-step.  Currently userland expects to have to clear DR_STEP in
dr6 via ptrace itself, but does not expect it can get a duplicate SIGTRAP
if it doesn't.


Thanks,
Roland
-
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Monday, June 25, 2007 - 11:36 am

Of course, calling register_kernel_hw_breakpoint() with three extra
arguments is a waste of an instruction also, if one of those arguments 
isn't used.

And yet it's not clear that either of these really is a waste.  Suppose
somebody ports code from x86 to PPC64 and leaves a breakpoint length
set to HW_BREAKPOINT_LEN_4.  Clearly we would want to return an error.  
This means that the length value _has_ to be tested, even if it won't
be used for anything.  And this means the length _has_ to be passed

All right, I'll change it.  And I'll encapsulate those fields.  I still 
think it will accomplish nothing more than hiding some implementation 

It's below.  The patch logs the value of DR6 when each debug interrupt 
occurs, and it adds another sysfs attribute to the bptest driver.  The 
attribute is named "test", and it contains the value that the IRQ 
handler will write back to DR6.  Combine this with the Alt-SysRq-P 
change already submitted, and you can get a clear view of what's going 

I see.  So I could add a CONFIG_HW_BREAKPOINT option and make 
CONFIG_PTRACE depend on it.  That will be simple enough.



No, because do_debug always writes a 0 to DR6 after reading it;  
consequently DR_STEP does not remain set on later exceptions.  Unless
we do something like this we would never know whether we entered the
handler because of a single-step exception or not.

But the same effect could occur because of a bogus debug exception 
caused by lazy DR7 switching.  I'll have to add back in code to detect 
that case.

Alan Stern



Index: usb-2.6/arch/i386/kernel/traps.c
===================================================================
--- usb-2.6.orig/arch/i386/kernel/traps.c
+++ usb-2.6/arch/i386/kernel/traps.c
@@ -802,13 +802,17 @@ fastcall void __kprobes do_int3(struct p
  * find every occurrence of the TF bit that could be saved away even
  * by user code)
  */
+unsigned long dr6test;
+EXPORT_SYMBOL(dr6test);
+
 fastcall void __kprobes do_debug(struct pt_regs * regs...
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, June 26, 2007 - 4:49 pm

It makes me a little happier, and I at least consider that a substantial



Sure.  There's no special reason to want to turn hw-breakpoint off, but


You don't need to worry about that.  Under utrace, CONFIG_PTRACE is
already separate and can be turned off.  I don't think we need really to

Calling send_sigtrap twice during the same exception does happen to be
harmless, but I don't think it should be presumed to be.  It is just not
the right way to go about things that you send a signal twice when there
is one signal you want to generate.

Also, send_sigtrap is an i386-only function (not even x86_64 has the
same).  Only x86_64 will share this actual code, but all others will be
modelled on it.  I think it makes things simplest across the board if
the standard form is that when there is a ptrace exception, the notifier
does not return NOTIFY_STOP, so it falls through to the existing SIGTRAP
arch code.

So, hmm.  In the old do_debug code, if a notifier returns NOTIFY_STOP,
it bails immediately, before the db6 value is saved in current-&gt;thread.
This is the normal theory of notify_die use, where NOTIFY_STOP means to
completely swallow the event as if it never happened.  In the event
there were some third party notifier involved, it ought to be able to
swallow its magic exceptions as before and have no user-visible db6
change happen at the time of that exception.  So how about this:

	get_debugreg(condition, 6);
	set_debugreg(0UL, 6);		/* The CPU does not clear it.  */

	if (notify_die(DIE_DEBUG, "debug", regs, condition, error_code,
					SIGTRAP) == NOTIFY_STOP)
		return;

The kprobes notifier uses max priority, so it will run first.  Its
notifier code uses my version.  For a single-step that belongs to it,
it will return NOTIFY_STOP and nothing else happens (noone touches
vdr6).  (I think I'm dredging up old territory by asking what happens
when kprobes steps over an insn that hits a data breakpoint, but I
don't recall atm.)

vdr6 belongs wholly to hw_breakpoint, no o...
To: Roland McGrath <roland@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, June 26, 2007 - 11:26 pm

Good.  My earlier stubbornness was caused by a desire to allow static
initializers, but now I see that specifying the values in the

So far this work has all been based on the vanilla kernel.  Should I 

What happens when there are two ptrace exceptions at different points
during the same system call?  Won't we end up sending the signal twice

In theory we should get an exception with both DR_STEP and DR_TRAPn 
set, meaning that neither notifier will return NOTIFY_STOP.  But if the 
kprobes handler clears DR_STEP in the DR6 image passed to the 

That sounds not quite right.  To a user-space debugger, a system call
should appear as an atomic operation.  If multiple ptrace exceptions
occur during a system call, all the relevant DR_TRAPn bits should be
set in vdr6 together and all the other ones reset.  How can we arrange
that?

There's also the question of whether to send the SIGTRAP.  If
extraneous bits are set in DR6 (e.g., because the CPU always sets some
extra bits) then we will never get NOTIFY_STOP.  Nevertheless, the

I disagree.  kfree() is documented to return harmlessly when passed a
NULL pointer, and lots of places in the kernel have been changed to
remove useless tests for NULL before calls to kfree().  This is just

I figured there was some reason like that.

Alan Stern

-
To: Alan Stern <stern@...>
Cc: Prasanna S Panchamukhi <prasanna@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, June 27, 2007 - 11:02 pm

I did the first crack at a powerpc port.  I'd appreciate your comments on
this patch.  It should not be incorporated, isn't finished, probably breaks
ptrace, etc.  I'm posting it now just to get any thoughts you have raised
by seeing the second machine share the intended arch-independent code.

I just translated your implementation to powerpc terms without thinking
about it much.  If you see anything that you aren't sure is right, please
tell me and don't presume there is some powerpc-specific reason it's
different.  More likely I just didn't think it through.

In the first battle just to make it compile, the only issue was that you
assume every machine has TIF_DEBUG, which is in fact an implementation
detail chosen lately by i386 and x86_64.  AFAIK the only reason for it
there is just to make a cheap test of multiple bits in the hot path
deciding to call __switch_to_xtra.  Do you rely on it meaning something
more precise than just being a shorthand for hw_breakpoint_info!=NULL?

Incidentally, I think it would be nice if kernel/hw_breakpoint.c itself had
all the #include's for everything it uses directly.  arch hw_breakpoint.c
files probably only need &lt;asm/hw_breakpoint.h&gt; and one or two others to
define what they need before #include "../../../kernel/hw_breakpoint.c".

The num_installed/num_kbps stuff feels a little hokey when it's really a
flag because the maximum number is one.  It seems like I could make it
tighter with some more finesse in the arch-specific hook options, so that
chbi and thbi each just store dabr, dabr!=0 means "mine gets installed",
and the switch in is just chbi-&gt;dabr?:thbi-&gt;dabr or something like that.
As we get more machines, more cleanups along these lines will probably make
sense.  (Also, before the next person not me or you tries a port, we could
use for the generic hw_breakpoint.c to get some comments at the top making
explicit what the arch file is expected to define in its types, etc.)

With just the included change to the generic code fo...