ChangesLog: added MAGIC_SYSRQ support. removed notify_die handler and changes to drivers/char/keyboard.c. save or restrive pt_regs for cases where panic is called from non-exception context. handled cases where the per_cpu registers and context are not valid/saved when panic is called from non-exception contexts. removed more checkpatch.pl messages. lots more messages to still work on. volatiles left in the code due to the previously stated (and still present) severe breakage of the GNU compiler with SMP shared data. most of the barrier() functions are just plain broken and do not result in proper compiler behavior in this tree. it helps having a working debugger to actually see these compiler problems. 2.6.27-rc4 1/27 mdb: export genapic and machine_restart functions export genapic and machine_restart functions 2.6.27-rc4 2/27 mdb: add makefile for Merkey's Linux Kernel Debugger module add makefile for Merkey's Linux Kernel Debugger module 2.6.27-rc4 3/27 mdb: add local Makefile for Merkey's Linux Kernel Debugger module add local Makefile for Merkey's Linux Kernel Debugger module 2.6.27-rc4 4/27 mdb: add mdb-base.c architecure independent functions add mdb-base.c architecure independent functions 2.6.27-rc4 5/27 mdb: add mdb-base.h architecure indepedent include add mdb-base.h architecure indepedent include 2.6.27-rc4 6/27 mdb: add mdb.h architecure main include add mdb.h architecure main include 2.6.27-rc4 7/27 mdb: add mdb-ia32.c Intel x86 architecure functions add mdb-ia32.c Intel x86 architecure functions 2.6.27-rc4 8/27 mdb: add mdb-ia32.h Intel x86 architecure includes add mdb-ia32.h Intel x86 architecure includes 2.6.27-rc4 9/27 mdb: add mdb-ia32.c Intel x86 architecure function includes add mdb-ia32.c Intel x86 architecure function includes 2.6.27-rc4 10/27 mdb: add mdb-ia32-support.c disassembler support add mdb-ia32-support.c ...
Can you provide explicit detail? By using barrier() the compiler should clobber all its memory and registers therefore forcing a write/reload of the variable. --
I hope Jeff didn't try mere barrier()s only. smp_wmb() and smp_rmb() are the more relevant barrier variants for mdb, from what I remember when I last looked at it. -- Stefan Richter -=====-==--- =--- =-=-= http://arcgraph.de/sr/ --
Sure, but volatile isn't a replacement for memory barriers. --
Let's face it, the C standard does not support concurrency, so we are all in a state of sin in any case, forced to rely on combinations of gcc-specific non-standard language extensions and assembly language. Could be worse!!! Thanx, Paul --
Hehe, still, a little birdie told me they are working on it and perhaps someone with clue could enlighten us on their direction. Still, I'd like Jeff to show his C, the resulting asm and the intent for the volatile and barrier versions of his code (well, little snippets of his code obviuosly). Either he doesn't understand barriers (nothing to be ashamed about), or we might have more trouble lurking in the rest of the kernel. --
Well, I guess you guys will be the judge of that. Or one of the judges, at least. ;-) One advantage of the current c++0x approach is that it allows extremely weak memory barriers to be used in many cases that would require smp_rmb() in current Linux kernel. If you are crazy enough to want to see a sneak preview in standardese, try all 10MB of: http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf Section 1.10 (physical page 25, logical page 11) describes the memory model. Sections 29 and 30 describe the operations (physical page 1155, logical page 1141). The C and C++ guys got together ahead of time and agreed to work together towards a compatible solution. And rcu_dereference() would be implemented in terms of memory_order_consume, Sounds fair to me! Thanx, Paul --
I have thoroughly reviewed Linux memory barriers and the efficacy of the barriers as defined in Linux are not the issue here. the code segment discussed sits and spins on a variable waiting for a specific state, and its a spinlock which creates a hard barrier, so no amount of barrier usage should nor does matter here. Even if a processor was late in flushing its writes, sooner or later the spinning processor would see the change in the shared memory address -- IF IT WERE ACTUALLY A SHARED REFERENCE. What I am seeing is not an issue of races between processors on load/store operations, but cases where gcc has chosen to optimize away global references entirely. Jeff --
Nevertheless, an analysis of which particular parts of code generation are insufficient if one particular volatile qualification is removed is IMO likely to turn up places in mdb where a clearer or/and more efficient implementation is possible. (Based on what I saw a few revisions ago; I haven't looked at the current one yet.) -- Stefan Richter -=====-==--- =--- =-=-= http://arcgraph.de/sr/ --
I used the smp_wmb() functions. I noted a couple of things. a) some of these macros just emit __asm__ __volatile__ into the code so why not just say "volatile" to begin with b) smp_wmb() in some cases worked and in other cases jut optimized away the global reference. c) I can go back and break the code again by inserting them and building broken assembler d) I ave been doing hardware and software design since the early 1980;s, I invented SMP affinity scheduling, and yes, I understand barriers and this concept of instruction score-boarding and optimization very well -- its not an excuse for a busted C compiler. It did not break all the places in the code, but broke enough for SMP to lock up and fail, It turned global variables into local variables. If you want me to reproduce this I can but it will have to wait til this evening because I have some product releases to get out the door at Omega 8 today. It's simple to reproduce. Take away the volatile declaration for the rlock_t structure in mdb-ia32.c (rlock_t debug_lock) in all code references and watch the thing lock up in SMP with multiple processors in the debugger each stuck with their own local copy of debug_lock. The barrier functions do not appear to work in all cases. Jeff --
Even if you use the smb_wmb() macros around the debug_lock, the compiler still optimizes the debug_lock into a local variable. After watching the broken behavior, the fact that some of these macro's emit __asm__ __volatile__ anyway, I just left the vars declared volatile. Its a debugger, so its probably ok for the kernel debugger to use some volatile data. --
It is not the same as volatile type. What it does is tell the compiler to clobber all registers or temporaries. This something pretty well Linux barriers aren't going to force a load to be emitted, if it can be optimized away. If it optimized away a store, then I'd like to see a The point is not whether it is possible to work with volatile types, but that we tend not to use them in Linux to deal with concurrency. Also, barriers seem to work fine for everybody else, so I think it is likely you either aren't using them correctly, or have other bugs in the You should disable preempt before getting the processor id. Can't see any other possible bugs, but you should be able to see from the disassembly pretty easily. --
debug_lock() is AFAICS only called from contexts which have preemption disabled. Last time around I recommended to Jeff to document this requirement on the calling context. But even though preemption is disabled, debug_lock() is still incorrect as I mentioned in my other post a minute ago. It corrupts its .flags and .count members. (Or maybe it coincidentally doesn't as long as volatile is around.) -- Stefan Richter -=====-==--- =--- =-=-= http://arcgraph.de/sr/ --
I'm not talking about where debug_lock gets called, I'm talking about where the processor id is derived that eventually filters down to I don't think so. And flags should only be restored by the processor that saved it because the spinlock should disable preemption, right? --
You are right, I replied to fast. debug_unlock() retrieves the
OK; the .count seems alright due to restrictions of the calling
contexts. About .flags: Jeff, can the following happen?
- Context A on CPU A has interrupts enabled. Enters debug_lock(),
thus disables its interrupts. (Saves its flags in rlock->flags with
the plan to enable interrupts later when leaving debug_unlock()
provided it does so as last holder.)
- Context B on CPU B happens to have interrupts disabled. Enters
debug_lock(), overwrites rlock->flags with its different value.
(Spins on the rlock which is held by CPU A.)
- Context A on CPU A leaves debug_unlock. Doesn't re-enable its
interrupts as planned, since rlock->flags is the one from CPU B.
--
Stefan Richter
-=====-==--- =--- =-==-
http://arcgraph.de/sr/
--
It should be benign since both procs have ints off here. Stefan, The flags needs fixing, you are right, however, since this case only occurs when two processors both have an int1 exception (or some exception other than an NMI) and ints are disabled here anyway, its benign. That being said, it needs to be perfect. So I moved the flags to a stack variable. I will post a new patch series correcting the flags issue along with code fragments showing the gcc breakage with debug lock later tonight. I have a lot to get done at omega8 on appliance releases this week, and being a Jeff --
jmerkey@wolfmountaingroup.com wrote: Ah, thanks. I missed the bigger picture. -- Stefan Richter -=====-==--- =--- =-==- http://arcgraph.de/sr/ --
Right, asm volatile () means that the asm may not be discarted. Very Not sure - I think all barrier clobber the full register and memory set. So if you access a variable after a barrier it will have to issue a load. Well, there is of course the third option, which is what Jeff claims, that gcc is broken. But in that case we should have more problems elsewhere too. --
Here is one example (which might or might not be what Nick had in mind):
extern int v;
void foo(void)
{
do_something_with(v);
barrier();
do_something_else_with(v - v);
}
The second set of loads from v can be optimized away unless v is
declared volatile. In contrast:
void bar(void)
{
do_something_with(v);
barrier();
do_something_else_with(v);
}
Given the amount of code in gcc, we can reasonably assume that some
aspects of it are broken. Whether that presumed breakage is affecting
Jeff is another question altogether. ;-)
Thanx, Paul
--
I'll create some cases this evening showing how GCC a) optimizes away global references, even when you tell it not to (volatile is the only thing that seems to work) b) takes global defines and turns them into stack variables (WHAT A GREAT THING TO DO WITH ONLY A 4K STACK) when it has bloated out on register aliases from all the global data it is turning into local variables in registers or on the stack. Jeff --
Ah, right. But in that case: v-v := tmp1 = v; tmp2 = v; sub tmp1,tmp2; Which you can of course write out more explicitly in C as well and insert a barrier between the two reads of v, giving the same effect as volatile. Still, point taken. --
IIRC a barrier only clobbers memory. gcc must reload a variable from
memory unless it can prove the variable's address has not escaped anywhere.
So:
void f()
{
int v;
v = g();
barrier();
do_domething_with(v);
}
Need not reload v from memory (indeed, v can be in a register for its
entire lifetime), but
void f()
{
int v;
v = g();
h(&v);
barrier();
do_domething_with(v);
}
Will force v into memory, and reload it after the barrier.
--
error compiling committee.c: too many arguments to function
--
No, that's not what "asm volatile" means. Its *only* meaning is "emit
this, even if it doesn't look like it has side-effects and its results
are not used". An asm() with no outputs is "volatile" by default, which
makes most of the uses of "asm volatile" in the kernel redundant. "asm
volatile" also has no effect on the ordering of the asm with respect to
other code; you must use constraints to do that.
An asm with a "memory" clobber is sufficient to make sure that gcc
doesn't cache memory values in registers; perhaps that's what you mean.
J
--
Well, the gcc docs at _used_ to say that volatile asms are not moved around "significantly". They've removed that. Linus --
Yep, they seem to do that kind of thing specifically to annoy you. But
it does mean we can't rely on it at all any more.
J
--
Results from Analysis of GCC volatile/memory barriers
Use of volatile will produce the results intended in those files which
have shared data elements, but will also result in some cases global data
which is not referenced outside of a file and which has not also been
declared as volatile as being treated as static and optimized
into local variables in some cases.
If volatile is avoided entirely, the compiler appears to make correct
assumptions about whether or not it is in fact global memory references.
My conclusion is that the code generation of gcc appears correct and
in fact does a better job than Microsoft's implementation of shared
data management on SMP systems.
If you use volatile, and use optimization at the same time, the compiler
will take you at your word and potentially optimize global references into
local variables. This is in fact better than MS cl which will ALWAYS
optimize global data into local if volatile is not used with SMP data within
a single file.
If you choose to use volatile, then you had better use it on every variable
you need shared between processors -- or just leave it out entirely -- and
gcc does appear figure it out code references properly (though some of them
are quite odd).
While this may be counter-intuitive, it makes sense. When you are using
volatile, you are telling the compiler anything not declared as volatile
witihin a given file is fair game for local optimization if you turn on
optimization at the same time.
Analysis
Code Generation for two atomic_t variables. One an array and the other
standalone -- macros in kernel includes and their interactions
with the compiler may be the basis of some of these cases.
atomic_inc(&debuggerActive);
atomic_inc(&debuggerProcessors[processor]);
55a0: f0 ff 05 00 00 00 00 lock incl 0x0
55a7: 8d 2c 8d 00 00 00 00 lea 0x0(,%ecx,4),%ebp
55ae: 8d 85 00 00 00 00 lea 0x0(%ebp),%eax
55b4: 89 04 24 mov %eax,(%esp)
55b7: f0 ff ...So the result was that barrier()s actually worked? Great. BTW if you did this work it might be worthwhile to go one further step: gcc has a test suite where they run small programs and have a simple facility to scan the resulting assembler code for specific patterns. If you could bring your examples into that format (as in being stand alone compiled and have a simple "yes/no" pattern to check) and let it be included there that might make sure this always keeps working in the future. -Andi --
I'm having a quick look at mdb-2.6.27-rc4-ia32-08-20-08.patch at the moment. Speaking of debug_lock()...: You use spin_trylock_irqsave((spinlock_t *)&rlock->lock, rlock->flags) in there. Minor nit: The pointer type cast is unnecessary. Major problem: rlock->flags is wrong in this call. Use an on-stack flags variable for the initial spin_trylock_irqsave. Ditto in the following call of spin_trylock_irqsave. Next major problem with debug_lock() and debug_unlock(): The reference counting doesn't work. You need an atomic_t counter. Have a look at the struct kref accessors for example, or even make use of the kref API. Or if it isn't feasible to fix with atomic_t, add a second spinlock to rlock_t to ensure integrity of .count (and of the .processor if necessary). Furthermore, I have doubts about the loop which is entered by CPU B while CPU A holds the rlock. You are fully aware that atomic_read(a) && !atomic_read(b) in its entirety is not atomic, I hope. All this aside, I don't see *anything* in debug_lock and _unlock which would necessitate volatile. Well, volatile might have papered over some of these bugs. PS: Try to cut down on #if/#endif clutter. It should be possible to reduce them at least in .c files; .h are a different matter. For example, #if MDB_DEBUG_DEBUGGER DBGPrint("something"); #endif can be trivially reduced to dbg_mdb_printk("something"); where dbg_mdb_printk() is defined as an inline function which does nothing when MDB_DEBUG_DEBUGGER is false. PS2: Why are there this many debug printks anyway? -- Stefan Richter -=====-==--- =--- =-=-= http://arcgraph.de/sr/ --
The code works in debug lock provided this memory location is actually SHARED between the processors. The various race conditions you describe are valid cases, but he only modifier of .count and .lock is the processor that obtains the spinlock -- the rest are readers. This code works well, of course, when this memory location is actually SHARED between the processors and the read/write operations serialized. Even in SMP, at various times it is necessary for the processors to perform serializing operations. You cannot in checker-scoreboard land for everything. Jeff --
jmerkey@wolfmountaingroup.com wrote: Regarding rlock->count, I stand corrected: It works correctly if the debug_lock()...debug_unlock() region can be entered by up to two contexts and if the second one to enter cannot be preempted by the first one. However, since these regions are enclosed in preempt_disable/enable and since the first one to grab the rlock->lock keeps local interrupts disabled until debug_unlock() or even longer, there is always only a single context in the debug_lock()...debug_unlock() region --- which defeats the whole purpose of the rlock_t. Or what am I missing /now/? Independently of this, you cannot use rlock->flags like you currently do. spin_trylock_irqsave would overwrite the flags of CPU A by the flags of CPU B, making the results of spin_unlock_irqrestore in debug_unlock unpredictable. -- Stefan Richter -=====-==--- =--- =-=-= http://arcgraph.de/sr/ --
Hmmm.... Yep, that may be correct. I'll fix that one since it is outside of the lock. At this point, the flags in the processors are all going to be identical at this point in the code so it would not cause any problems (local debugger state would not care), so the debugger would work with this and does. But I will change that and repost the patch this evening. You have good eyes by friend. --
It _will_ be worse. The C standard will eventually support concurrency (they are working on it), and it will almost inevitably be a horrible pile of stinking sh*t, and we'll continue to use the gcc inline asms instead, but then the gcc people will ignore our complaints when they break the compiler, and say that we should use the stinking-pile-of-sh*t ones that are built in. No, I haven't seen the drafts, and maybe I'm overly pessimistic, but I'm pretty sure that this is an area where (a) the kernel needs more support than most normal pthread-like models and (b) any design-by-committee thing simply won't be very good, because they'll have to try to make everybody happy. Oh, well. Linus --
Ok, I have looked at the draft now, and I don't think I was overly pessimistic. If I read it right, all the memory ordering operations are defined for _single_ objects. So if you want to do the kernel kind of memory ordering where you specify ordering requirements independently of the actual accesses (perhaps because the accesses are in some helper function that doesn't care, but then you want to "finalize" the thing by stating a sequence point), it seems to be impossible with current drafts. Oh, well. Nothing lost. I didn't expect the thing to work. Linus --
You are looking for atomic_fence() on page 1168 (1154 virtual) of the most recent draft. The current semantics are not correct, but this is being worked. And yes, it does currently have a variable associated with it, but it acts as a bare fence nevertheless. There is a proposal to ;-) Thanx, Paul --
And the proposal for variable-free memory-ordering operations was voted into the draft standard: http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2731.html Still not perfect, of course, but hopefully movement in the right direction. Thanx, Paul --
And not only that, it will be worse before the new standard is ratified. Let's see... 1. Yes, the standard is being designed by a committee. 2. No, it does not simply ratify the Linux kernel's memory-barrier API. Yes, I did propose that. No, the proposal did not go very far, but yes, it did get people's attention. Yes, there were a number of people who asserted that Linux's approach was fundamentally broken/buggy/whatever. Yes, they have changed their mind, most of them, anyway. No, there is still zero chance of the standard simply ratifying smp_mb() and friends. 3. Yes, we are working to get things closer to the Linux kernel's memory-barrier API into the spec. No, they will likely not be exact. Yes, getting some prominent committee members to countenance the idea of any sort of raw memory barrier (not connected directly to a load, store, or atomic operation) was a long-term project that finally made headway last May. Again, design by committee. 4. No, the current Linux kernel memory-barrier API is not perfect. Yes, the proposed standard's preferred memory-barrier approach will make code easier to read and understand in many cases, and could potentially allow the compiler to do better optimizations (which, yes, might or might not be a good thing). No, the proposed standard's preferred approach does not apply to all the cases in the Linux kernel. No, I don't know whether or not it will be worthwhile to introduce the standard's preferred approach to the Linux kernel (though I suspect that it would be, but have to wait and see). Either way, yes, the Linux kernel will likely continue to need to resort to non-standard gcc extensions and assembly language in at least some cases. As you say, the kernel is not a garden-variety pthreads application. But even if the Linux kernel never uses this stuff, user applications will need it. And there are a number of reasons why gcc extensions and assembly language are even less welcome in many ...
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm |
