Hi,
There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
that was unnecessary and not required for the correctness of those APIs.
All that superfluous stuff was also unnecessarily disallowing compiler
optimization possibilities, and making gcc generate code that wasn't as
beautiful as it could otherwise have been. Hence the following series
of cleanups (some trivial, some surprising, in no particular order):
* Bogus extended asm constraints (such as specifying immediate-value
constraint-modifiers to operands constrained to local registers)
* Obsolete / inapplicable-to-x86 / incorrect comments
* Marking "memory" as clobbered for no good reason
* Volatile-casting of memory addresses
(wholly unnecessary, makes gcc generate bad code)
* Unwarranted use of __asm__ __volatile__ even when those semantics
are not required
* Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
(again, this was like *asking* gcc to generate bad code)
These patches fix no bugs, as there were none (or else we'd have known
long before, obviously). This is just an attempt to clean up / bring down
the bogosity level of that file by several microlenats :-)
Almost all the above are also applicable to include/asm-x86_64/bitops.h
so I will send a patch for that also, later. I have zero knowledge of
other arch's so cannot audit them, sorry.
My testbox boots/works fine with all these patches (uptime half an hour)
and the compressed bzImage is smaller by about ~2 KB for my .config --
don't know about savings in modules. But considering these primitives get
inlined from several places in the tree, I'd expect good savings for
allyesconfig or allmodconfig kind of setups. We've also probably lost
a few cycles from kernel codepaths that use these functions, but I haven't
verified that experimentally.
Satyam
-Hi Satyam, [I did read entire thread] Welcome to the minefield. This bitops and barrier stuff is complicated. It's very easy to introduce bugs which are hard to trigger, or happen only with some specific gcc versions, or only on massively parallel SMP boxes. You can also make technically correct changes which relax needlessly strict barrier semantics of some bitops and trigger latent bugs in code which was unknowingly depending on it. How you can proceed: Make a change which you believe is right. Recompile allyesconfig kernel with and without this change. Find a piece of assembly code which become different. Check that new code is correct (and smaller and/or faster). Post your patch together with example(s) of code fragments that got better. Be as verbose as needed. Repeat for each change separately. This can be painfully slow, but less likely to be rejected outright I vaguely remember that "memory" clobbers are needed in some rather obscure, non-obvious situations. Google for it - Linus wrote about it For this kind of things, you really need something more stressing. At least it proves that _something_ changed. -- vda -
Hi Denis, The problem with most of the patches in this series (other than the two related to vague gcc docs), as you know, turned out to be the fact that a lot of code/callers out there rely on the bitops as also being lock/unlock functions -- and hence the need for disallowing both compiler as well as CPU reordering / optimizations. There were few valid changes in the patchset too, and probably I'll try to look at them the way you described above (or probably it also makes some sense to keep the bitops as-is, "it works" today after all, and there are other avenues for optimizations/cleanups too). Thanks again, Satyam -
From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
[4/8] i386: bitops: Kill volatile-casting of memory addresses
All the occurrences of "volatile" that are used to qualify access to the
passed bit-string pointer/address must be removed, because "volatile" is
crazy, doesn't really work anyway, has nothing to do with locking and
atomicity, and is in general wholly unnecessary for these operations:
* In the case of the unlocked variants, we're not providing any guarantees
whatsoever, so we want the caller to do any necessary locking surrounding
the use of that function anyway.
* In case of the primitives where we *do* make locking/atomicity/reordering
(three very different, but related, concepts) guarantees, we're doing that
properly by using the lock instruction-prefix and "__asm__ __volatile__"
anyway (and with "addr" always being constrained with "m" in the assembly,
gcc doesn't bring it into a local register either).
So let's just kill the pointless use of volatile.
Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
---
include/asm-i386/bitops.h | 60 +++++++++++++++++++++------------------------
1 files changed, 28 insertions(+), 32 deletions(-)
diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index d623fd9..0f5634b 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -26,8 +26,6 @@
* on single-dword quantities.
*/
-#define ADDR (*(volatile long *) addr)
-
/**
* set_bit - Atomically set a bit in memory
* @nr: the bit to set
@@ -38,11 +36,11 @@
*
* See __set_bit() if you do not require the atomic guarantees.
*/
-static inline void set_bit(int nr, volatile unsigned long * addr)
+static inline void set_bit(int nr, unsigned long *addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btsl %1,%0"
- :"=m" (ADDR)
+ :"=m" (*addr)
:"r" (nr));
}
@@ -58,11 +56,11 @@ static inline void...This is wrong. The "const volatile" is so that you can pass an arbitrary pointer. The only kind of abritraty pointer is "const volatile". In other words, the "volatile" has nothing at all to do with whether the memory is volatile or not (the same way "const" has nothing to do with it: it's purely a C type *safety* issue, exactly the same way "const" is a type safety issue. A "const" on a pointer doesn't mean that the thing it points to cannot change. When you pass a source pointer to "strlen()", it doesn't have to be constant. But "strlen()" takes a "const" pointer, because it work son constant pointers *too*. Same deal here. Admittedly this may be mostly historic, but regardless - the "volatiles" are right. Using volatile on *data* is generally considered incorrect and bad taste, but using it in situations like this potentially makes sense. Of course, if we remove all "volatiles" in data in the kernel (with the possible exception of "jiffies"), we can then remove them from function declarations too, but it should be done in that order. Linus -
However... What about that: - This "volatile" will allow to pass pointers to volatile data to the bitops. - Most users of "volatile" in the kenrel (except maybe jiffies) are bogus - Thus let's remove it -as a type safety thing- to catch more of those stupid volatile that shouldn't be ? :-) Besides, as Nick pointed out, it prevents some valid optimizations. Cheers, Ben. -
Quite frankly, I'd really really rather kill the "volatile" data structures independently. Once that is done, it doesn't really matter any No it doesn't. Not the ones on the functions that just do an inline asm. The only valid optimization it might break is for "constant_test_bit()", which isn't even using inline asm. And before we remove that "volatile", we'd better make damn sure that there isn't any driver that does /* Wait for the command queue to be cleared by DMA */ while (test_bit(...)) ; or similar. Yes, it's annoying, but this is a scary and subtle area. And we sadly _have_ had code that does things like that. Linus -
The constant case is probably most used (at least for page flags), and is most important for me. constant_test_bit may not be using inline asm, but the volatile pointer target means that it reloads the value and can't do much optimisation over it. BTW. once volatile goes away, i386 really should start using the C versions of __set_bit and __clear_bit as well IMO. (at least for the constant bitnr case), so gcc can potentially optimise better. -- SUSE Labs, Novell Inc. -
I certainly cannot speak for all the grotty drivers out there, but I've
never ever seen anything like the above. I would consider anyone using
kernel bit operations on DMA memory to be more than a little crazy. But
that's just me :)
Usually you will see
while (1) {
rmb();
if (software_index == hardware_index_in_DMA_memory)
break;
... handle a unit of work ...
}
Though ISTR being told that even rmb() was not sufficient in all cases
[nonetheless, that's what I use in net and SATA drivers and email
recommendations, and people seem happy]
Jeff
-Well, regardless, it still forces the function to treat the pointer target as volatile, won't it? It definitely prevents valid optimisations that would be useful for me in mm/page_alloc.c where page flags are being set up or torn down or checked with non-atomic bitops. OK, not the i386 functions as much because they are all in asm anwyay, but in general (btw. why does i386 or any architecture define their own non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h is not good enough then surely it is a bug in gcc or that file?) Anyway by type safety, do you mean it will stop the compiler from warning if a pointer to a volatile is passed to the bitop? If so, then why don't we just kill all the volatiles out of here and fix any warnings that comeup? I doubt there would be many, and of those, some might show up real synchronisation problems. -- SUSE Labs, Novell Inc. -
Yes, and yes. But I think what he meant there is that we'd need to audit the kernel for all users of set_bit and friends and see if callers actually pass in any _data_ that _is_ volatile. So we have to kill them there first, and then in the function declarations here. I think I'll put The compiler would start warning for all those cases (passing in a pointer to volatile data, when the bitops have lost the volatile casting from their function declarations), actually. Something like "passing argument discards qualifiers from pointer type" ... but considering I didn't see *any* of those warnings after these patches, I'm confused as to what exactly Linus meant here ... and what exactly Satyam -
Because even with an allyesconfig, your compile isn't testing the entire kernel. So given the relatively minor benefit of removing the volatiles, I suppose we shouldn't risk slipping a bug in. If you can make gcc throw an error in that case it would be a different story. -- SUSE Labs, Novell Inc. -
From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Extended asm supports input-output or read-write operands. Use the
constraint character `+' to indicate such an operand and list it with
the output operands. You should only use read-write operands when the
constraints for the operand (or the operand in which only some of the
bits are to be changed) allow a register.
So, using the "+" constraint modifier for memory, like "+m" is bogus.
We must simply specify "=m" which handles the case correctly.
Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
---
include/asm-i386/bitops.h | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 17a302d..d623fd9 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -42,7 +42,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btsl %1,%0"
- :"+m" (ADDR)
+ :"=m" (ADDR)
:"r" (nr));
}
@@ -62,7 +62,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
{
__asm__(
"btsl %1,%0"
- :"+m" (ADDR)
+ :"=m" (ADDR)
:"r" (nr));
}
@@ -80,7 +80,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btrl %1,%0"
- :"+m" (ADDR)
+ :"=m" (ADDR)
:"r" (nr));
}
@@ -100,7 +100,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__(
"btrl %1,%0"
- :"+m" (ADDR)
+ :"=m" (ADDR)
:"r" (nr));
}
@@ -127,7 +127,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__(
"btcl %1,%0"
- :"+m" (ADDR)
+ :"=m" (ADDR)
:"r" (nr));
}
@@ -145,7 +145,7 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
{
__asm__ __...No. This is wrong. "=m" means that the new value of the memory location is *set*. Which means that gcc will potentially optimize away any previous stores to that memory location. And yes, it happens, and yes, we've seen it. The gcc docs are secondary. They're not updated, they aren't correct, they don't reflect reality, and they don't matter. The only correct thing to use is "+m" for things like this, or alternatively to just use the "memory" specifier at the end and make gcc generate really crappy code. Linus -
I had a lot of "fun" with this when mucking around with the asm-optimised R/W semaphores. David -
I checked with Honza (cc'ed) and he stated that the + are really needed at least in newer gcc. -Andi -
That extract is from the latest (4.2.1) manual, but they could've forgotten to update the documentation, of course. But even then, I'm not sure they could ever make it "really needed", that'll be a step that would break all existing code, otherwise! :-) Satyam -
From: Satyam Sharma <ssatyam@cse.iitk.ac.in> [2/8] i386: bitops: Rectify bogus "Ir" constraints The "I" constraint (on the i386 platform) is used to restrict constants to the 0..31 range, for use with instructions that must deal with bit numbers. However: * The "I" constraint modifier is applicable only to immediate-value operands, and combining it with "r" is bogus. * gcc will just ignore "I" if we specify "Ir" anyway and treat it same as "r". * Bitops in Linux allow @nr to be arbitrarily large anyway, so we don't want to restrict ourselves to @nr < 32 in the first place. So just kill the bogus "I" constraint modifier. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: David Howells <dhowells@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> --- include/asm-i386/bitops.h | 26 +++++++++++++------------- 1 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h index ba8e4bb..17a302d 100644 --- a/include/asm-i386/bitops.h +++ b/include/asm-i386/bitops.h @@ -43,7 +43,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr) __asm__ __volatile__( LOCK_PREFIX "btsl %1,%0" :"+m" (ADDR) - :"Ir" (nr)); + :"r" (nr)); } /** @@ -63,7 +63,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr) __asm__( "btsl %1,%0" :"+m" (ADDR) - :"Ir" (nr)); + :"r" (nr)); } /** @@ -81,7 +81,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr) __asm__ __volatile__( LOCK_PREFIX "btrl %1,%0" :"+m" (ADDR) - :"Ir" (nr)); + :"r" (nr)); } /** @@ -101,7 +101,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr) __asm__ __volatile__( "btrl %1,%0" :"+m" (ADDR) - :"Ir" (nr)); + :"r" (nr)); } /* @@ -128,7 +128,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr) __asm__ __volatile__( "btcl %1,%0" :"+m" (ADDR) - :"Ir...
This is wrong too. The whole point of a "Ir" modifier is to say that the instruction takes *either* an "I" or an "r". Andrew - the ones I've looked at were all wrong. Please don't take this series. Linus -
Incidentally, I just noticed the x86-64 bitops have "dIr" as their constraint set. "d" would normally be redundant with "r", and as far as I know, gcc doesn't prefer one over the other without having "?" or "!" as part of the constraint, so is is "d" a stray or is there some meaning behind it? -hpa -
Yup, I had noticed that myself. We would need to use "J" if we want to limit the offsets to 0..63, but "d" sounds weird / stray indeed, unless there's yet another undocumented/wrongly-documented mystery behind this one too ... (I'd like to know even if that's the case, obviously). Satyam -
Yup, sorry about this one, Andi pointed this out earlier. But the "I" must still go I think, for the third reason in that changelog -- it unnecessarily limits the bit offset to 0..31, but (at least from the comment up front in that file) we do allow arbitrarily large @nr (upto I think I'll rescind the series anyway, a lot of patches turned out to be wrong -- some due to mis-reading / incorrect gcc docs, others due to other reasons ... this was just something I did thinking of as a cleanup anyway, so I don't intend to push or correct this or anything. Satyam -
As HPA pointed out that would risk not being correctly assembled by at cpumask_t/nodemask_t bitmap optimizations would be useful. -Andi -
It means I or r, not I modified by r. This means either a immediate constant
0..31 or a register, which is correct.
% cat t18.c
f()
{
asm("xxx %0" :: "rI" (10));
asm("yyy %0" :: "rI" (100));
}
% gcc -O2 -S t18.c
% cat t18.s
...
f:
.LFB2:
#APP
xxx $10
#NO_APP
movl $100, %eax
#APP
yyy %eax
#NO_APP
ret
.LFE2:
...
-Andi
-Hi Andi, Whoa, thanks for explaining that to me -- I didn't know, obviously. I had just written a test program that used "Ir" with an automatic variable defined in the inline function (as is the case with these bitops) and observed that even when I gave > 32 values, it would still work -- hence my conclusion. However, the patch still stands, does it not? [ I will modify the changelog, obviously. ] The thing is that we don't want to limit @nr to <= 31 in the first place, or am I wrong again? :-) Thanks, Satyam -
These bit operations only allow 8 bit immediates, so 0..255 would be correct. N might work from the 4.1 docs, but I don't know if it works in all old supported gccs (3.2+) However I is definitely not wrong and most bit numbers are small anyways. -Andi -
"I" is correct. The Intel documentation on this is highly confusing (and has bugs in it), but it does unambiguously state: "Some assemblers support immediate bit offsets larger than 31 by using the immediate bit offset field in combination with the displacement field in the memory operand ... The processor will ignore the high-order bits if they are not zero." AMD processors might be different for all I know. So unless gas is capable of doing this transformation (and it's not as of binutils-2.17.50.0.6) "I" is what's needed here. -hpa -
Just tested it on a K8 machine; AMD behaves the same way. So "I" is correct, and changing it to "N" would introduce a bug. The only way to optimize this is by using __builtin_constant_p() and adjust the offset appropriately. -hpa -
'N' constraint was there pretty much forever, originally intended for in/out operands. It is in 2.95 docs http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_16.html#SEC175 Honza -
From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
memory barriers around clear_bit() must also implement these two interfaces.
However, for i386, clear_bit() is a strict, locked, atomic and
un-reorderable operation and includes an implicit memory barrier already.
But these two functions have been wrongly defined as "barrier()" which is
a pointless _compiler optimization_ barrier, and only serves to make gcc
not do legitimate optimizations that it could have otherwise done.
So let's make these proper no-ops, because that's exactly what we require
these to be on the i386 platform.
Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
---
[ A similar optimization needs to be done in the atomic.h also.
Will submit that patch shortly. ]
include/asm-i386/bitops.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 4f1fda5..42999eb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
* Bit operations are already serializing on x86.
* These must still be defined here for API completeness.
*/
-#define smp_mb__before_clear_bit() barrier()
-#define smp_mb__after_clear_bit() barrier()
+#define smp_mb__before_clear_bit() do {} while (0)
+#define smp_mb__after_clear_bit() do {} while (0)
/**
* __change_bit - Toggle a bit in memory
-No. clear_bit is not a compiler barrier on i386, thus smp_mb__before/after -- SUSE Labs, Novell Inc. -
Not so obvious. Why do we require these to be a full compiler barrier
is precisely the question I raised here.
Consider this (the above two functions exist only for clear_bit(),
the atomic variant, as you already know), the _only_ memory reference
we care about is that of the address of the passed bit-string:
(1) The compiler must not optimize / elid it (i.e. we need to disallow
compiler optimization for that reference) -- but we've already taken
care of that with the __asm__ __volatile__ and the constraints on
the memory "addr" operand there, and,
(2) For the i386, it also includes an implicit memory (CPU) barrier
already.
So I /think/ it makes sense to let the compiler optimize _other_ memory
references across the call to clear_bit(). There's a difference. I think
we'd be safe even if we do this, because the synchronization in callers
must be based upon the _passed bit-string_, otherwise _they_ are the
ones who're buggy.
[ However, elsewhere Jeremy Fitzhardinge mentioned the case of
some callers, for instance, doing a memset() on an alias of
the same bit-string. But again, I think that is dodgy/buggy/
extremely border-line usage on the caller's side itself ...
*unless* the caller is doing that inside a higher-level lock
anyway, in which case he wouldn't be needing to use the
locked variants either ... ]
[ For those interested, I've been looking at the code generated
for the test kernel I built with these patches, and I don't
really see anything gcc did that it shouldn't have -- but ok,
that doesn't mean other versions/toolchains for other setups
won't. Also, the test box has been up all night, but I'm only
running Firefox on it anyway, and don't really know how to
verify if I've introduced any correctness issues / bugs. ]
Satyam
-Repeating what has been said before: A CPU memory barrier is not a compiler barrier or vice versa. Seeing as we are talking about the compiler barrier, it is irrelevant as to whether or not the Yes it makes sense to let the compiler move memory operations over clear_bit(), because we have defined the interface to be nice and relaxed. And this is exactly why we do need to have an additional correct output != correct input. Without a barrier there, we _allow_ the compiler to reorder. If it does not reorder, the missing barrier is still a bug :) -- SUSE Labs, Novell Inc. -
[ Compiler barrier, you mean, that's not true of CPU barriers. ]
In any case, I know that, obviously. I asked "why" not "what" :-) i.e.
why should we care about other addresses / why do we want to extend
the compiler barrier to all memory references -- but Jeremy seems to
I think it is quite relevant, in fact. From Documentation/atomic_ops.txt,
smp_mb__{before,after}_clear_bit(), as the name itself suggests, must
be _CPU barriers_ for those arch's that don't have an implicit
_CPU barrier_ in the clear_bit() itself [ which i386 does have already ].
As for a compiler barrier, the asm there already guarantees the compiler
will not optimize references to _that_ address, but there could still be
the memset()/set{clear}_bit() interspersing pitfalls for example, so yeah
the memory clobber would probably protect us there.
Satyam
-For the purpose of this discussion (Linux memory
barrier semantics, on WB memory), it is true of CPU
Obviously because we want some kind of ordering
guarantee at a given point. All the CPU barriers
in the world are useless if the compiler can reorder
One or both of us still fails to understand the other.
bit_spin_lock(LOCK_NR, &word);
var++;
/* this is bit_spin_unlock(LOCK_NR, &word); */
smp_mb__before_clear_bit();
clear_bit(LOCK_NR, &word);
Are you saying that it is OK for the store to var to
be reordered below the clear_bit? If not, what are you
saying?
Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts.
http://au.docs.yahoo.com/mail/unlimitedstorage.html
-On later Intel processors, if the memory address range being referenced (and say written to) by the (locked) instruction is in the cache of a CPU, then it will not assert the LOCK signal on the system bus -- thus not assume exclusive use of shared memory. So other CPUs are free to modify (other) memory at that point. Cache coherency will still I might be making a radical turn-around here, but all of a sudden I think it's actually a good idea to put a complete memory clobber in set_bit/clear_bit and friends themselves, and leave the "__" variants as they are. Satyam -
The system bus does not need to be serialised because the CPU already holds the cacheline in exclusive state. That *is* the cache coherency protocol. The memory ordering is enforced by the CPU effectively preventing speculative loads to pass the locked instruction and ensuring all stores reach the cache before it is executed. (I say effectively Why? -- SUSE Labs, Novell Inc. -
Looks like when you said "CPU memory barrier extends to all memory references" you were probably referring to a _given_ CPU ... yes, Well, why not. Callers who don't want/need any guarantees whatsoever can still use __foo() -- for others, it makes sense to just use foo() and get *both* the compiler and CPU barrier semantics -- I think that's the behaviour most callers would want anyway. Satyam -
No. CPU memory barriers extend to all CPU's. End of discussion.
It's not about "that cacheline". The whole *point* of a CPU memory barrier
is that it's about independent memory accesses.
Yes, for a memory barrier to be effective, all CPU's involved in the
transaction have to have the barriers - the same way a lock needs to be
taken by everybody in order for it to make sense - but the point is, CPU
barriers are about *global* behaviour, not local ones.
So there's a *huge* difference between
clear_bit(x,y);
and
clear_bit(x,y);
smp_mb__before_after_clear_bit();
and it has absolutely nothing to do with the particular cacheline that "y"
is in, it's about the *global* memory ordering.
Any write you do after that "smp_mb__before_after_clear_bit()" will be
guaranteed to be visible to _other_ CPU's *after* they have seen the bit
being cleared. Yes, those other CPU's need to have a read barrier between
reading the bit and reading some other thign, but the point is, this hass
*nothing* to do with cache coherency, and the particular cache line that
"y" is in.
And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do
{} while (0)". It needs to be a compiler barrier even when it has no
actual CPU meaning, unless clear_bit() itself is guaranteed to be a
compiler barrier (which it isn't, although the "volatile" on the asm in
practice makes it something *close* to that).
Why? Think of the sequence like this:
clear_bit(x,y);
smp_mb__after_clear_bit();
other_variable = 10;
the whole *point* of this sequence is that if another CPU does
x = other_variable;
smp_rmb();
bit = test_bit(x,y)
then if it sees "x" being 10, then the bit *has* to be clear.
And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs
to be a compiler barrier:
- it doesn't matter for the action of the "clear_bit()" itself: that one
is locked, and on x86 it thus also happens to be a serializing
instruction, and the cache cohe...Ok, thanks much (David and Nick too!) for taking the time to explain this out clearly. It does look amazingly obvious now that I see it, with callers using bitops to implement locking schemes who would completely break otherwise -- in fact 6 and 8 in this series look amazingly obtuse now ... Satyam -
Firstly, __foo() is non-atomic, secondly set_bit/clear_bit/etc do not provide any CPU or compiler semantics. It was set up this way because other CPU ISAs don't couple atomicity with memory ordering, and with many implementations of those, the cost to order memory is quite high. -- SUSE Labs, Novell Inc. -
You miss my point. If you have:
memset(&my_bitmask, 0, sizeof(my_bitmask));
set_bit(my_bitmask, 44);
Then unless the set_bit has a constraint argument which covers the whole
of the (multiword) bitmask, the compiler may see fit to interleave the
memset writes with the set_bit in bad ways. In other words, plain "+m"
(*(long *)ptr) won't cut it. You'd need "+m" (my_bitmask), I think.
J
-That's a valid point, and looks like it is a bug in the existing i386 macros, doesn't it? We should be clobbering addr + BITOP_WORD(nr). -- SUSE Labs, Novell Inc. -
From: Satyam Sharma <ssatyam@cse.iitk.ac.in> [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily The goal is to let gcc generate good, beautiful, optimized code. But test_and_set_bit, test_and_clear_bit, __test_and_change_bit, and test_and_change_bit unnecessarily mark all of memory as clobbered, thereby preventing gcc from doing perfectly valid optimizations. The case of __test_and_change_bit() is particularly surprising, given that it's a variant where we don't make any guarantees at all. Even for the other three cases, the (only) instruction that accesses shared memory is btsl/btrl/btcl and requires locking and atomicity. But we handle that properly already by the use of the lock prefix. Also, "=m" is already specified in the output operands of the asm for the passed bit-string pointer, so the gcc optimizer knows what to do and what not to do anyway. So there's no point clobbering all of memory in these functions. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: David Howells <dhowells@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> --- include/asm-i386/bitops.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h index 0f5634b..f37b8a2 100644 --- a/include/asm-i386/bitops.h +++ b/include/asm-i386/bitops.h @@ -164,7 +164,7 @@ static inline int test_and_set_bit(int nr, unsigned long *addr) __asm__ __volatile__( LOCK_PREFIX "btsl %2,%1\n\tsbbl %0,%0" :"=r" (oldbit),"=m" (*addr) - :"r" (nr) : "memory"); + :"r" (nr)); return oldbit; } @@ -208,7 +208,7 @@ static inline int test_and_clear_bit(int nr, unsigned long *addr) __asm__ __volatile__( LOCK_PREFIX "btrl %2,%1\n\tsbbl %0,%0" :"=r" (oldbit),"=m" (*addr) - :"r" (nr) : "memory"); + :"r" (nr)); return oldbit; } @@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr) __asm__ __volatile__( "btcl %2,%1\n\t...
__test_and_change_bit is one that you could remove the memory clobber -- SUSE Labs, Novell Inc. -
Yes, for the atomic versions we don't care if we're asking gcc to generate trashy code (even though I'd have wanted to only disallow problematic optimizations -- ones involving the passed bit-string address -- there, and allow other memory references to be optimized as and how the compiler feels like it) because the atomic variants are slow anyway and we probably want to be extra-safe there. But for the non-atomic variants, it does make sense to remove the memory clobber (and the unneeded __asm__ __volatile__ that another patch did -- for the non-atomic variants, again). OTOH, as per Linus' review it seems we can drop the "memory" clobber and specify the output operand for the extended asm as "+m". But I must admit I didn't quite understand that at all. [ I should probably start reading gcc sources, the docs are said to be insufficient/out-of-date, as per the reviews of the patches. ] Satyam -
As I understand it, the "+m" indicates to the compiler a restriction on the ordering of things that access that particular memory location, whereas a "memory" indicates a restriction on the orderings of all accesses to memory - precisely what you need to produce a lock. There are a number of things that use test_and_set_bit() and co to implement a lock or other synchronisation. This means that they must exhibit LOCK-class barrier effects or better. LOCK-class barrier effects mean, more or less, that all memory accesses issued before the lock must happen before all memory accesses issued after the lock. But it most happen at both CPU-level and compiler-level. The "memory" constraint instructs the compiler in this regard. Remember also that this is gcc black magic and so some of it has had to be worked out empirically - possibly after sacrificing a goat under a full moon. David -
Hi David, Ok, thanks -- I didn't know gcc's behaviour w.r.t. "+m" at all, but in my Yes, thanks for laying this out so clearly, again. So combined with what you explained above, I think I now fully understand why most of this Satyam -
No. It has nothing to do with atomicity and all to do with ordering. For example test_bit, clear_bit, set_bit, etc are all atomic but none place any restrictions on ordering. __test_and_change_bit has no restriction on ordering, so as long as the correct operands are clobbered, a "memory" clobber to enforce a Yes, but that is for cases where "memory" is being used to say that some otherwise unspecified memory is actually being changed, rather than to provide a compiler barrier as is the case for test_and_set_bit You should read Documentation/atomic_ops.txt and memory-barriers.txt, which are quite useful and should be uptodate. -- SUSE Labs, Novell Inc. -
Speaking of that, why are all the asm functions in arch/i386/lib/string.c
defined as having a memory clobber, even those which don't modify memory
like strcmp, strchr, strlen and so on?
Why are they all volatile too? Even those with no side effects.
It seems like any asm which writes to or _reads from_ memory that isn't one
of the operands must have a memory clobber. And any asm with side effects
other than it's operands (and asm("mov $0, (%0)" : : "r"(some_pointer))
has side effects) must be volatile.
So something like strlen() needs to have a memory clobber, otherwise a
store to the string could be moved to the other side of the strlen(), but
doesn't need to be volatile, since it has no side effects.
-That's because the memory clobber will serialize the inline asm with anything else that reads or writes memory. So even if we don't actually change any memory, if we cannot describe what we *read*, then we need to tell gcc to not re-order us wrt things that could *write*. And the simplest way to do that is to say that you clobber memory, even if you don't. So as a more concrete example: imagine that we're doing a "strlen()" on some local variable. We need to tell gcc that that variable has to be updated in memory before it schedules the asm. The memory clobber does that. (Yes, the "asm volatile" may do so too, but it's very unclear what the "volatile" on the asm actually does, so ..) Linus -
I went a made a test suite to see what really happened, and this isn't how it works. It appears that a memory clobber only tells gcc that the asm writes to memory. It does _not_ tell gcc that the asm reads from memory. It's at http://www.speakeasy.org/~xyzzy/download/opttest.tar.gz It's only 3k, but there are 16 files so I'm not inlining it. The suite has a few test c files, which are compiled with various different functions, inline norm asm, inline volatile asm, inline asm with a memory clobber, a normal function, a __attribute__((const)) function, and so on. They are compiled to asm file, and then a perl script scans the asm files to figure out what optimizations gcc made. "make test" will compile all the tests and run them through the perl scripts. "make test1" will just run test1, etc. It appears that a normal asm, one without volatile or a memory clobber, is treated like a const function, which returns the output via a struct (not using pass-by-address). It has no side-effects, can't read or write global variables, and can't dereference pointer arguments. Adding volatile tells gcc the asm has some hidden side-effect. It still can't r/w globals or dereference inputs. But it won't get elided if there are no used outputs, common subexpression merged, or treated as a loop invariant. Adding a memory clobber tells gcc that the asm modifies memory. It doesn't modify un-aliased local variables in registers. It does modify aliased local variables. It does not read from memory. gcc will move or elide a memory write before an asm with a memory clobber if nothing else (besides the asm) could see the write. A memory clobber doesn't count as a side-effect either, a non-volatile asm without unused outputs will be elided, even if has a memory clobber. Specifically, check test6_memasm.s. The C code looks like this: extern int a; /* keep asm from being elided for having no used output */ static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); } /* flo...
Hmm. I really think you should take this up with the gcc people. That looks like a gcc bug - because there really is nothing that guarantees that the asm doesn't change the array that "x" points to, and the asm I can't explain it. I do think you've found a gcc bug. That said, the kernel mostly uses "asm volatile()" _together_ with a memory clobber for these kinds of things, so it sounds like the kernel wouldn't be impacted. But you're definitely right - the above report makes Well, that's likely just a subtle register allocation issue, and understandable. Generating perfect code is impossible, you want to generate good code on average. Linus -
Actually, I take that back. I think gcc does the right thing, and yes, it's explained by the memory clobber being just a blind "write to memory" rather than read memory. My bad. It does leave us with very few ways of saying that an asm can *read* memory, and so it might be good to have it clarified that "volatile" implies that (at least with the memory clobber). Your examples are good, I think. Linus -
Without the volatile they get completely optimized away :/ [tried that, cost a lot of debugging time -- empty string functions cause a lot of strange side effects] -Andi -
Sure, that's *one* thing that "volatile" guarantees (it guarantees that gcc won't optimize away things where the end result isn't actually visibly used). But gcc docs also talk about the other things volatile means, including "not significantly moved". Is that "not significantly moved" already equivalent to "not moved past something that can change memory"? Probably not. Which is why it's a good idea to have the "memory" clobber there too, because the semantics of the "volatile" just aren't very well defined from a code movement standpoint (while a general memory clobber is *very* clear: if gcc moves the inline asm across another thing that uses/changes memory, that's a gcc bug, plain and simple (*)). Linus (*) Other than pure internal gcc spills/reloads. Spilling/reloading local registers that haven't had their address taken obviously cannot be seen as memory accesses. -
Actually, it doesn't. In fact it goes out of its way to say that "asm
volatile" statements can be moved quite a bit, with respect to other
asms, other code, jumps, basic blocks, etc. The only reliable way to
control the placement of an asm is have the right dependencies.
The `volatile' keyword indicates that the instruction has important
side-effects. GCC will not delete a volatile `asm' if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.) Note
that even a volatile `asm' instruction can be moved relative to other
code, including across jump instructions.
also:
An `asm' instruction without any output operands will be treated
identically to a volatile `asm' instruction.
So there isn't anything very special about "asm volatile". It's purely
to stop apparently useless code from being removed.
J
-Ahh. That's newer. Historically, gcc manuals used to say "may not be deleted or significantly reordered". So they've weakened what it means, probably exactly because it wasn't well-defined before either. Linus -
The memory clobber, or the volatile asm? There's more than one variable
here ... but still, I don't think either affects _ordering_ in any
But why even for the other operations? Consider (current code of)
test_and_set_bit():
static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
{
int oldbit;
__asm__ __volatile__( LOCK_PREFIX
"btsl %2,%1\n\tsbbl %0,%0"
:"=r" (oldbit),"+m" (ADDR)
:"Ir" (nr) : "memory");
return oldbit;
}
The only memory reference in there is to the passed address, it will
be modified, yes, but that's been made obvious to gcc in the asm
already. So why are we marking all of memory as clobbered, is the
question. (I just read Jeremy's latest reply, but I don't see how
or why the memory clobber helps that case either -- does a memory
I have, of course. Will probably read again, thanks.
Satyam
-Of course, because then the compiler can't assume anything about the contents of memory after the operation. #define barrier() __asm__ __v
