login
Header Space

 
 

Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions

Previous thread: Re: Linus 2.6.23-rc1 by Bob Picco on Monday, July 23, 2007 - 11:47 am. (2 messages)

Next thread: Plan 9 Resource Sharing Support - what is it? by Pavel Machek on Monday, July 23, 2007 - 3:54 am. (4 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:05 pm

Hi,

There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
that was unnecessary and not required for the correctness of those APIs.
All that superfluous stuff was also unnecessarily disallowing compiler
optimization possibilities, and making gcc generate code that wasn't as
beautiful as it could otherwise have been. Hence the following series
of cleanups (some trivial, some surprising, in no particular order):

* Bogus extended asm constraints (such as specifying immediate-value
  constraint-modifiers to operands constrained to local registers)
* Obsolete / inapplicable-to-x86 / incorrect comments
* Marking "memory" as clobbered for no good reason
* Volatile-casting of memory addresses
  (wholly unnecessary, makes gcc generate bad code)
* Unwarranted use of __asm__ __volatile__ even when those semantics
  are not required
* Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
  (again, this was like *asking* gcc to generate bad code)


These patches fix no bugs, as there were none (or else we'd have known
long before, obviously). This is just an attempt to clean up / bring down
the bogosity level of that file by several microlenats :-)

Almost all the above are also applicable to include/asm-x86_64/bitops.h
so I will send a patch for that also, later. I have zero knowledge of
other arch's so cannot audit them, sorry.


My testbox boots/works fine with all these patches (uptime half an hour)
and the compressed bzImage is smaller by about ~2 KB for my .config --
don't know about savings in modules. But considering these primitives get
inlined from several places in the tree, I'd expect good savings for
allyesconfig or allmodconfig kind of setups. We've also probably lost
a few cycles from kernel codepaths that use these functions, but I haven't
verified that experimentally.


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 30, 2007 - 1:57 pm

Hi Satyam,


[I did read entire thread]

Welcome to the minefield.

This bitops and barrier stuff is complicated. It's very easy to
introduce bugs which are hard to trigger, or happen only with some
specific gcc versions, or only on massively parallel SMP boxes.

You can also make technically correct changes which relax needlessly
strict barrier semantics of some bitops and trigger latent bugs
in code which was unknowingly depending on it.

How you can proceed:

Make a change which you believe is right. Recompile allyesconfig
kernel with and without this change. Find a piece of assembly code
which become different. Check that new code is correct (and smaller
and/or faster). Post your patch together with example(s) of code
fragments that got better. Be as verbose as needed.

Repeat for each change separately.

This can be painfully slow, but less likely to be rejected outright

I vaguely remember that "memory" clobbers are needed in some rather
obscure, non-obvious situations. Google for it - Linus wrote about it




For this kind of things, you really need something more stressing.

At least it proves that _something_ changed.
--
vda
-
To: Denis Vlasenko <vda.linux@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 30, 2007 - 9:07 pm

Hi Denis,




The problem with most of the patches in this series (other than the two
related to vague gcc docs), as you know, turned out to be the fact that
a lot of code/callers out there rely on the bitops as also being
lock/unlock functions -- and hence the need for disallowing both
compiler as well as CPU reordering / optimizations.

There were few valid changes in the patchset too, and probably I'll try
to look at them the way you described above (or probably it also makes
some sense to keep the bitops as-is, "it works" today after all, and
there are other avenues for optimizations/cleanups too).


Thanks again,
Satyam
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:05 pm

From: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;

[4/8] i386: bitops: Kill volatile-casting of memory addresses

All the occurrences of "volatile" that are used to qualify access to the
passed bit-string pointer/address must be removed, because "volatile" is
crazy, doesn't really work anyway, has nothing to do with locking and
atomicity, and is in general wholly unnecessary for these operations:

* In the case of the unlocked variants, we're not providing any guarantees
  whatsoever, so we want the caller to do any necessary locking surrounding
  the use of that function anyway.

* In case of the primitives where we *do* make locking/atomicity/reordering
  (three very different, but related, concepts) guarantees, we're doing that
  properly by using the lock instruction-prefix and "__asm__ __volatile__"
  anyway (and with "addr" always being constrained with "m" in the assembly,
  gcc doesn't bring it into a local register either).

So let's just kill the pointless use of volatile.

Signed-off-by: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

---

 include/asm-i386/bitops.h |   60 +++++++++++++++++++++------------------------
 1 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index d623fd9..0f5634b 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -26,8 +26,6 @@
  * on single-dword quantities.
  */
 
-#define ADDR (*(volatile long *) addr)
-
 /**
  * set_bit - Atomically set a bit in memory
  * @nr: the bit to set
@@ -38,11 +36,11 @@
  *
  * See __set_bit() if you do not require the atomic guarantees.
  */
-static inline void set_bit(int nr, volatile unsigned long * addr)
+static inline void set_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -58,11 +56,11 @@ static inline void...
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 1:52 pm

This is wrong.

The "const volatile" is so that you can pass an arbitrary pointer. The 
only kind of abritraty pointer is "const volatile".

In other words, the "volatile" has nothing at all to do with whether the 
memory is volatile or not (the same way "const" has nothing to do with it: 
it's purely a C type *safety* issue, exactly the same way "const" is a 
type safety issue.

A "const" on a pointer doesn't mean that the thing it points to cannot 
change. When you pass a source pointer to "strlen()", it doesn't have to 
be constant. But "strlen()" takes a "const" pointer, because it work son 
constant pointers *too*.

Same deal here.

Admittedly this may be mostly historic, but regardless - the "volatiles" 
are right.

Using volatile on *data* is generally considered incorrect and bad taste, 
but using it in situations like this potentially makes sense.

Of course, if we remove all "volatiles" in data in the kernel (with the 
possible exception of "jiffies"), we can then remove them from function 
declarations too, but it should be done in that order.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 5:49 am

However... What about that:

 - This "volatile" will allow to pass pointers to volatile data to the
bitops.

 - Most users of "volatile" in the kenrel (except maybe jiffies) are
bogus

 - Thus let's remove it -as a type safety thing- to catch more of those
stupid volatile that shouldn't be ? :-)

Besides, as Nick pointed out, it prevents some valid optimizations.

Cheers,
Ben.

-
To: Benjamin Herrenschmidt <benh@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 1:20 pm

Quite frankly, I'd really really rather kill the "volatile" data
structures independently. Once that is done, it doesn't really matter any

No it doesn't. Not the ones on the functions that just do an inline asm.

The only valid optimization it might break is for "constant_test_bit()", 
which isn't even using inline asm.

And before we remove that "volatile", we'd better make damn sure that 
there isn't any driver that does

	/* Wait for the command queue to be cleared by DMA */
	while (test_bit(...))
		;

or similar.

Yes, it's annoying, but this is a scary and subtle area. And we sadly 
_have_ had code that does things like that.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Benjamin Herrenschmidt <benh@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Wednesday, July 25, 2007 - 12:54 am

The constant case is probably most used (at least for page flags), and
is most important for me. constant_test_bit may not be using inline asm,
but the volatile pointer target means that it reloads the value and can't
do much optimisation over it.

BTW. once volatile goes away, i386 really should start using the C
versions of __set_bit and __clear_bit as well IMO. (at least for the
constant bitnr case), so gcc can potentially optimise better.

-- 
SUSE Labs, Novell Inc.
-
To: Linus Torvalds <torvalds@...>
Cc: Benjamin Herrenschmidt <benh@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 1:39 pm

I certainly cannot speak for all the grotty drivers out there, but I've 
never ever seen anything like the above.  I would consider anyone using 
kernel bit operations on DMA memory to be more than a little crazy.  But 
that's just me :)

Usually you will see

	while (1) {
		rmb();
		if (software_index == hardware_index_in_DMA_memory)
			break;
		
		... handle a unit of work ...
	}

Though ISTR being told that even rmb() was not sufficient in all cases 
[nonetheless, that's what I use in net and SATA drivers and email 
recommendations, and people seem happy]

	Jeff


-
To: Linus Torvalds <torvalds@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 12:19 am

Well, regardless, it still forces the function to treat the pointer
target as volatile, won't it? It definitely prevents valid optimisations
that would be useful for me in mm/page_alloc.c where page flags are
being set up or torn down or checked with non-atomic bitops.

OK, not the i386 functions as much because they are all in asm anwyay,
but in general (btw. why does i386 or any architecture define their own
non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
is not good enough then surely it is a bug in gcc or that file?)

Anyway by type safety, do you mean it will stop the compiler from
warning if a pointer to a volatile is passed to the bitop? If so, then
why don't we just kill all the volatiles out of here and fix any
warnings that comeup? I doubt there would be many, and of those, some
might show up real synchronisation problems.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 2:23 am

Yes, and yes. But I think what he meant there is that we'd need to
audit the kernel for all users of set_bit and friends and see if callers
actually pass in any _data_ that _is_ volatile. So we have to kill them
there first, and then in the function declarations here. I think I'll put

The compiler would start warning for all those cases (passing in
a pointer to volatile data, when the bitops have lost the volatile
casting from their function declarations), actually. Something like
"passing argument discards qualifiers from pointer type" ... but
considering I didn't see *any* of those warnings after these patches,
I'm confused as to what exactly Linus meant here ... and what exactly




Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 3:16 am

Because even with an allyesconfig, your compile isn't testing the
entire kernel. So given the relatively minor benefit of removing
the volatiles, I suppose we shouldn't risk slipping a bug in. If
you can make gcc throw an error in that case it would be a different
story.

-- 
SUSE Labs, Novell Inc.
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:05 pm

From: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;


  Extended asm supports input-output or read-write operands. Use the
  constraint character `+' to indicate such an operand and list it with
  the output operands. You should only use read-write operands when the
  constraints for the operand (or the operand in which only some of the
  bits are to be changed) allow a register.

So, using the "+" constraint modifier for memory, like "+m" is bogus.
We must simply specify "=m" which handles the case correctly.

Signed-off-by: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

---

 include/asm-i386/bitops.h |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 17a302d..d623fd9 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -42,7 +42,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -62,7 +62,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__(
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -80,7 +80,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -100,7 +100,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -127,7 +127,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btcl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -145,7 +145,7 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __...
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 1:46 pm

No. This is wrong.

"=m" means that the new value of the memory location is *set*.

Which means that gcc will potentially optimize away any previous stores to 
that memory location.

And yes, it happens, and yes, we've seen it.

The gcc docs are secondary. They're not updated, they aren't correct, they 
don't reflect reality, and they don't matter. The only correct thing to 
use is "+m" for things like this, or alternatively to just use the 
"memory" specifier at the end and make gcc generate really crappy code.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 5:22 am

I had a lot of "fun" with this when mucking around with the asm-optimised R/W
semaphores.

David
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 12:37 pm

I checked with Honza (cc'ed) and he stated that the + are really needed
at least in newer gcc.

-Andi
-
To: Andi Kleen <ak@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 1:15 pm

That extract is from the latest (4.2.1) manual, but they could've
forgotten to update the documentation, of course.

But even then, I'm not sure they could ever make it "really needed",
that'll be a step that would break all existing code, otherwise! :-)

Satyam
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:05 pm

From: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;

[2/8] i386: bitops: Rectify bogus "Ir" constraints

The "I" constraint (on the i386 platform) is used to restrict constants to
the 0..31 range, for use with instructions that must deal with bit numbers.

However:
* The "I" constraint modifier is applicable only to immediate-value operands,
  and combining it with "r" is bogus.
* gcc will just ignore "I" if we specify "Ir" anyway and treat it same as "r".
* Bitops in Linux allow @nr to be arbitrarily large anyway, so we don't want
  to restrict ourselves to @nr &lt; 32 in the first place.

So just kill the bogus "I" constraint modifier.

Signed-off-by: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

---

 include/asm-i386/bitops.h |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index ba8e4bb..17a302d 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -43,7 +43,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -63,7 +63,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 	__asm__(
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -81,7 +81,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -101,7 +101,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /*
@@ -128,7 +128,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btcl %1,%0"
 		:"+m" (ADDR)
-		:"Ir...
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 1:57 pm

This is wrong too.

The whole point of a "Ir" modifier is to say that the instruction takes 
*either* an "I" or an "r".

Andrew - the ones I've looked at were all wrong. Please don't take this 
series.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 2:39 pm

Incidentally, I just noticed the x86-64 bitops have "dIr" as their
constraint set.  "d" would normally be redundant with "r", and as far as
I know, gcc doesn't prefer one over the other without having "?" or "!"
as part of the constraint, so is is "d" a stray or is there some meaning
behind it?

	-hpa
-
To: H. Peter Anvin <hpa@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 2:52 pm

Yup, I had noticed that myself. We would need to use "J" if we want
to limit the offsets to 0..63, but "d" sounds weird / stray indeed,
unless there's yet another undocumented/wrongly-documented mystery
behind this one too ... (I'd like to know even if that's the case,
obviously).

Satyam
-
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 2:14 pm

Yup, sorry about this one, Andi pointed this out earlier. But the "I"
must still go I think, for the third reason in that changelog -- it
unnecessarily limits the bit offset to 0..31, but (at least from the
comment up front in that file) we do allow arbitrarily large @nr (upto

I think I'll rescind the series anyway, a lot of patches turned out to
be wrong -- some due to mis-reading / incorrect gcc docs, others due to
other reasons ... this was just something I did thinking of as a cleanup
anyway, so I don't intend to push or correct this or anything.

Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 2:32 pm

As HPA pointed out that would risk not being correctly assembled by at 

cpumask_t/nodemask_t bitmap optimizations would be useful.

-Andi

-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:10 pm

It means I or r, not I modified by r. This means either a immediate constant
0..31 or a register, which is correct.

% cat t18.c 

f()
{
        asm("xxx %0" :: "rI" (10));
        asm("yyy %0" :: "rI" (100));
}
% gcc -O2 -S t18.c
% cat t18.s
...
f:
.LFB2:
#APP
        xxx $10
#NO_APP
        movl    $100, %eax
#APP
        yyy %eax
#NO_APP
        ret
.LFE2:
...

-Andi

-
To: Andi Kleen <ak@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:21 pm

Hi Andi,




Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
just written a test program that used "Ir" with an automatic variable
defined in the inline function (as is the case with these bitops) and
observed that even when I gave &gt; 32 values, it would still work -- hence
my conclusion.

However, the patch still stands, does it not? [ I will modify the
changelog, obviously. ] The thing is that we don't want to limit
@nr to &lt;= 31 in the first place, or am I wrong again? :-)

Thanks,
Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 12:30 pm

These bit operations only allow 8 bit immediates, so 0..255 would
be correct. N might work from the 4.1 docs, but I don't know if it works 
in all old supported gccs (3.2+)

However I is definitely not wrong and most bit numbers are small anyways.

-Andi
-
To: Andi Kleen <ak@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 2:05 pm

"I" is correct.  The Intel documentation on this is highly confusing
(and has bugs in it), but it does unambiguously state:

"Some assemblers support immediate bit offsets larger than 31 by using
the immediate bit offset field in combination with the displacement
field in the memory operand ... The processor will ignore the high-order
bits if they are not zero."  AMD processors might be different for all I
know.

So unless gas is capable of doing this transformation (and it's not as
of binutils-2.17.50.0.6) "I" is what's needed here.

	-hpa

-
To: Andi Kleen <ak@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 2:28 pm

Just tested it on a K8 machine; AMD behaves the same way.  So "I" is
correct, and changing it to "N" would introduce a bug.

The only way to optimize this is by using __builtin_constant_p() and
adjust the offset appropriately.

	-hpa
-
To: Andi Kleen <ak@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <jh@...>
Date: Monday, July 23, 2007 - 12:36 pm

'N' constraint was there pretty much forever, originally intended for
in/out operands.  It is in 2.95 docs
http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_16.html#SEC175

Honza
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:06 pm

From: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;

memory barriers around clear_bit() must also implement these two interfaces.
However, for i386, clear_bit() is a strict, locked, atomic and
un-reorderable operation and includes an implicit memory barrier already.

But these two functions have been wrongly defined as "barrier()" which is
a pointless _compiler optimization_ barrier, and only serves to make gcc
not do legitimate optimizations that it could have otherwise done.

So let's make these proper no-ops, because that's exactly what we require
these to be on the i386 platform.

Signed-off-by: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

---

[ A similar optimization needs to be done in the atomic.h also.
  Will submit that patch shortly. ]

 include/asm-i386/bitops.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 4f1fda5..42999eb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  * Bit operations are already serializing on x86.
  * These must still be defined here for API completeness.
  */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#define smp_mb__before_clear_bit()	do {} while (0)
+#define smp_mb__after_clear_bit()	do {} while (0)
 
 /**
  * __change_bit - Toggle a bit in memory
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 11:53 pm

No. clear_bit is not a compiler barrier on i386, thus smp_mb__before/after


-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 3:34 am

Not so obvious. Why do we require these to be a full compiler barrier
is precisely the question I raised here.

Consider this (the above two functions exist only for clear_bit(),
the atomic variant, as you already know), the _only_ memory reference
we care about is that of the address of the passed bit-string:

(1) The compiler must not optimize / elid it (i.e. we need to disallow
    compiler optimization for that reference) -- but we've already taken
    care of that with the __asm__ __volatile__ and the constraints on
    the memory "addr" operand there, and,
(2) For the i386, it also includes an implicit memory (CPU) barrier
    already.

So I /think/ it makes sense to let the compiler optimize _other_ memory
references across the call to clear_bit(). There's a difference. I think
we'd be safe even if we do this, because the synchronization in callers
must be based upon the _passed bit-string_, otherwise _they_ are the
ones who're buggy.

[ However, elsewhere Jeremy Fitzhardinge mentioned the case of
  some callers, for instance, doing a memset() on an alias of
  the same bit-string. But again, I think that is dodgy/buggy/
  extremely border-line usage on the caller's side itself ...
  *unless* the caller is doing that inside a higher-level lock
  anyway, in which case he wouldn't be needing to use the
  locked variants either ... ]

[ For those interested, I've been looking at the code generated
  for the test kernel I built with these patches, and I don't
  really see anything gcc did that it shouldn't have -- but ok,
  that doesn't mean other versions/toolchains for other setups
  won't. Also, the test box has been up all night, but I'm only
  running Firefox on it anyway, and don't really know how to
  verify if I've introduced any correctness issues / bugs. ]


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 4:20 am

Repeating what has been said before: A CPU memory barrier is not a
compiler barrier or vice versa. Seeing as we are talking about
the compiler barrier, it is irrelevant as to whether or not the

Yes it makes sense to let the compiler move memory operations over
clear_bit(), because we have defined the interface to be nice and
relaxed. And this is exactly why we do need to have an additional

correct output != correct input.

Without a barrier there, we _allow_ the compiler to reorder. If it
does not reorder, the missing barrier is still a bug :)

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 5:21 am

[ Compiler barrier, you mean, that's not true of CPU barriers. ]

In any case, I know that, obviously. I asked "why" not "what" :-) i.e.
why should we care about other addresses / why do we want to extend
the compiler barrier to all memory references -- but Jeremy seems to

I think it is quite relevant, in fact. From Documentation/atomic_ops.txt,
smp_mb__{before,after}_clear_bit(), as the name itself suggests, must
be _CPU barriers_ for those arch's that don't have an implicit
_CPU barrier_ in the clear_bit() itself [ which i386 does have already ].

As for a compiler barrier, the asm there already guarantees the compiler
will not optimize references to _that_ address, but there could still be
the memset()/set{clear}_bit() interspersing pitfalls for example, so yeah
the memory clobber would probably protect us there.


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 6:25 am

For the purpose of this discussion (Linux memory
barrier semantics, on WB memory), it is true of CPU

Obviously because we want some kind of ordering
guarantee at a given point. All the CPU barriers
in the world are useless if the compiler can reorder

One or both of us still fails to understand the other.

bit_spin_lock(LOCK_NR, &amp;word);
var++;
/* this is bit_spin_unlock(LOCK_NR, &amp;word); */
smp_mb__before_clear_bit();
clear_bit(LOCK_NR, &amp;word);

Are you saying that it is OK for the store to var to
be reordered below the clear_bit? If not, what are you
saying?



      Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts.
http://au.docs.yahoo.com/mail/unlimitedstorage.html



-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 7:10 am

On later Intel processors, if the memory address range being referenced
(and say written to) by the (locked) instruction is in the cache of a
CPU, then it will not assert the LOCK signal on the system bus --
thus not assume exclusive use of shared memory. So other CPUs are free
to modify (other) memory at that point. Cache coherency will still




I might be making a radical turn-around here, but all of a
sudden I think it's actually a good idea to put a complete
memory clobber in set_bit/clear_bit and friends themselves,
and leave the "__" variants as they are.


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 7:32 am

The system bus does not need to be serialised because the CPU already
holds the cacheline in exclusive state. That *is* the cache coherency
protocol.

The memory ordering is enforced by the CPU effectively preventing
speculative loads to pass the locked instruction and ensuring all
stores reach the cache before it is executed. (I say effectively

Why?

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 7:45 am

Looks like when you said "CPU memory barrier extends to all memory
references" you were probably referring to a _given_ CPU ... yes,

Well, why not. Callers who don't want/need any guarantees whatsoever
can still use __foo() -- for others, it makes sense to just use
foo() and get *both* the compiler and CPU barrier semantics -- I think
that's the behaviour most callers would want anyway.


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 1:12 pm

No. CPU memory barriers extend to all CPU's. End of discussion.

It's not about "that cacheline". The whole *point* of a CPU memory barrier 
is that it's about independent memory accesses.

Yes, for a memory barrier to be effective, all CPU's involved in the 
transaction have to have the barriers - the same way a lock needs to be 
taken by everybody in order for it to make sense - but the point is, CPU 
barriers are about *global* behaviour, not local ones.

So there's a *huge* difference between

	clear_bit(x,y);

and

	clear_bit(x,y);
	smp_mb__before_after_clear_bit();

and it has absolutely nothing to do with the particular cacheline that "y" 
is in, it's about the *global* memory ordering.

Any write you do after that "smp_mb__before_after_clear_bit()" will be 
guaranteed to be visible to _other_ CPU's *after* they have seen the bit 
being cleared. Yes, those other CPU's need to have a read barrier between 
reading the bit and reading some other thign, but the point is, this hass 
*nothing* to do with cache coherency, and the particular cache line that 
"y" is in.

And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do 
{} while (0)". It needs to be a compiler barrier even when it has no 
actual CPU meaning, unless clear_bit() itself is guaranteed to be a 
compiler barrier (which it isn't, although the "volatile" on the asm in 
practice makes it something *close* to that).

Why? Think of the sequence like this:

	clear_bit(x,y);
	smp_mb__after_clear_bit();
	other_variable = 10;

the whole *point* of this sequence is that if another CPU does

	x = other_variable;
	smp_rmb();
	bit = test_bit(x,y)

then if it sees "x" being 10, then the bit *has* to be clear.

And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs 
to be a compiler barrier:

 - it doesn't matter for the action of the "clear_bit()" itself: that one 
   is locked, and on x86 it thus also happens to be a serializing 
   instruction, and the cache cohe...
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 3:01 pm

Ok, thanks much (David and Nick too!) for taking the time to explain this
out clearly. It does look amazingly obvious now that I see it, with
callers using bitops to implement locking schemes who would completely
break otherwise -- in fact 6 and 8 in this series look amazingly obtuse
now ...


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 8:01 am

Firstly, __foo() is non-atomic, secondly set_bit/clear_bit/etc do not
provide any CPU or compiler semantics.

It was set up this way because other CPU ISAs don't couple atomicity
with memory ordering, and with many implementations of those, the cost
to order memory is quite high.

-- 
SUSE Labs, Novell Inc.
-
To: Satyam Sharma <ssatyam@...>
Cc: Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 3:48 am

You miss my point.  If you have:

	memset(&amp;my_bitmask, 0, sizeof(my_bitmask));
	set_bit(my_bitmask, 44);

Then unless the set_bit has a constraint argument which covers the whole
of the (multiword) bitmask, the compiler may see fit to interleave the
memset writes with the set_bit in bad ways.  In other words, plain "+m"
(*(long *)ptr) won't cut it.  You'd need "+m" (my_bitmask), I think.

    J
-
To: Jeremy Fitzhardinge <jeremy@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 4:31 am

That's a valid point, and looks like it is a bug in the existing i386
macros, doesn't it? We should be clobbering addr + BITOP_WORD(nr).

-- 
SUSE Labs, Novell Inc.
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: David Howells <dhowells@...>, Nick Piggin <nickpiggin@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 12:05 pm

From: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;

[6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

The goal is to let gcc generate good, beautiful, optimized code.

But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
and test_and_change_bit unnecessarily mark all of memory as clobbered,
thereby preventing gcc from doing perfectly valid optimizations.

The case of __test_and_change_bit() is particularly surprising, given
that it's a variant where we don't make any guarantees at all.

Even for the other three cases, the (only) instruction that accesses
shared memory is btsl/btrl/btcl and requires locking and atomicity.
But we handle that properly already by the use of the lock prefix.
Also, "=m" is already specified in the output operands of the asm
for the passed bit-string pointer, so the gcc optimizer knows what
to do and what not to do anyway.

So there's no point clobbering all of memory in these functions.

Signed-off-by: Satyam Sharma &lt;ssatyam@cse.iitk.ac.in&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

---

 include/asm-i386/bitops.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 0f5634b..f37b8a2 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -164,7 +164,7 @@ static inline int test_and_set_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -208,7 +208,7 @@ static inline int test_and_clear_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__(
 		"btcl %2,%1\n\t...
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, July 23, 2007 - 11:57 pm

__test_and_change_bit is one that you could remove the memory clobber


-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 2:38 am

Yes, for the atomic versions we don't care if we're asking gcc to
generate trashy code (even though I'd have wanted to only disallow
problematic optimizations -- ones involving the passed bit-string
address -- there, and allow other memory references to be optimized
as and how the compiler feels like it) because the atomic variants
are slow anyway and we probably want to be extra-safe there.

But for the non-atomic variants, it does make sense to remove the
memory clobber (and the unneeded __asm__ __volatile__ that another
patch did -- for the non-atomic variants, again).

OTOH, as per Linus' review it seems we can drop the "memory" clobber
and specify the output operand for the extended asm as "+m". But I
must admit I didn't quite understand that at all.

[ I should probably start reading gcc sources, the docs are said to
  be insufficient/out-of-date, as per the reviews of the patches. ]


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 5:44 am

As I understand it, the "+m" indicates to the compiler a restriction on the
ordering of things that access that particular memory location, whereas a
"memory" indicates a restriction on the orderings of all accesses to memory -
precisely what you need to produce a lock.

There are a number of things that use test_and_set_bit() and co to implement a
lock or other synchronisation.  This means that they must exhibit LOCK-class
barrier effects or better.  LOCK-class barrier effects mean, more or less,
that all memory accesses issued before the lock must happen before all memory
accesses issued after the lock.  But it most happen at both CPU-level and
compiler-level.  The "memory" constraint instructs the compiler in this
regard.

Remember also that this is gcc black magic and so some of it has had to be
worked out empirically - possibly after sacrificing a goat under a full moon.

David
-
To: David Howells <dhowells@...>
Cc: Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 6:02 am

Hi David,


Ok, thanks -- I didn't know gcc's behaviour w.r.t. "+m" at all, but in my

Yes, thanks for laying this out so clearly, again. So combined with what
you explained above, I think I now fully understand why most of this


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 3:24 am

No. It has nothing to do with atomicity and all to do with ordering.
For example test_bit, clear_bit, set_bit, etc are all atomic but none
place any restrictions on ordering.

__test_and_change_bit has no restriction on ordering, so as long as
the correct operands are clobbered, a "memory" clobber to enforce a

Yes, but that is for cases where "memory" is being used to say that
some otherwise unspecified memory is actually being changed, rather
than to provide a compiler barrier as is the case for test_and_set_bit

You should read Documentation/atomic_ops.txt and memory-barriers.txt,
which are quite useful and should be uptodate.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 24, 2007 - 4:38 am

Speaking of that, why are all the asm functions in arch/i386/lib/string.c
defined as having a memory clobber, even those which don't modify memory
like strcmp, strchr, strlen and so on?

Why are they all volatile too?  Even those with no side effects.

It seems like any asm which writes to or _reads from_ memory that isn't one
of the operands must have a memory clobber.  And any asm with side effects
other than it's operands (and asm("mov $0, (%0)" :  :  "r"(some_pointer))
has side effects) must be volatile.

So something like strlen() needs to have a memory clobber, otherwise a
store to the string could be moved to the other side of the strlen(), but
doesn't need to be volatile, since it has no side effects.
-
To: Trent Piepho <xyzzy@...>
Cc: Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 3:39 pm

That's because the memory clobber will serialize the inline asm with 
anything else that reads or writes memory.

So even if we don't actually change any memory, if we cannot describe what 
we *read*, then we need to tell gcc to not re-order us wrt things that 
could *write*. And the simplest way to do that is to say that you clobber 
memory, even if you don't.

So as a more concrete example: imagine that we're doing a "strlen()" on 
some local variable. We need to tell gcc that that variable has to be 
updated in memory before it schedules the asm. The memory clobber does 
that.

(Yes, the "asm volatile" may do so too, but it's very unclear what the 
"volatile" on the asm actually does, so ..)

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Wednesday, July 25, 2007 - 9:07 pm

I went a made a test suite to see what really happened, and this isn't how
it works.  It appears that a memory clobber only tells gcc that the asm
writes to memory.  It does _not_ tell gcc that the asm reads from memory.

It's at http://www.speakeasy.org/~xyzzy/download/opttest.tar.gz
It's only 3k, but there are 16 files so I'm not inlining it.

The suite has a few test c files, which are compiled with various different
functions, inline norm asm, inline volatile asm, inline asm with a memory
clobber, a normal function, a __attribute__((const)) function, and so on.

They are compiled to asm file, and then a perl script scans the asm files to
figure out what optimizations gcc made.

"make test" will compile all the tests and run them through the perl scripts.
"make test1" will just run test1, etc.

It appears that a normal asm, one without volatile or a memory clobber, is
treated like a const function, which returns the output via a struct (not
using pass-by-address).  It has no side-effects, can't read or write global
variables, and can't dereference pointer arguments.

Adding volatile tells gcc the asm has some hidden side-effect.  It still can't
r/w globals or dereference inputs.  But it won't get elided if there are no
used outputs, common subexpression merged, or treated as a loop invariant.

Adding a memory clobber tells gcc that the asm modifies memory.  It doesn't
modify un-aliased local variables in registers.  It does modify aliased local
variables.  It does not read from memory.  gcc will move or elide a memory
write before an asm with a memory clobber if nothing else (besides the asm)
could see the write.  A memory clobber doesn't count as a side-effect either,
a non-volatile asm without unused outputs will be elided, even if has a memory
clobber.

Specifically, check test6_memasm.s.  The C code looks like this:

extern int a; /* keep asm from being elided for having no used output */
static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); }
/* flo...
To: Trent Piepho <xyzzy@...>
Cc: Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Wednesday, July 25, 2007 - 9:18 pm

Hmm. I really think you should take this up with the gcc people. That 
looks like a gcc bug - because there really is nothing that guarantees 
that the asm doesn't change the array that "x" points to, and the asm 

I can't explain it. I do think you've found a gcc bug.

That said, the kernel mostly uses "asm volatile()" _together_ with a 
memory clobber for these kinds of things, so it sounds like the kernel 
wouldn't be impacted. But you're definitely right - the above report makes 

Well, that's likely just a subtle register allocation issue, and 
understandable. Generating perfect code is impossible, you want to 
generate good code on average.

		Linus
-
To: Trent Piepho <xyzzy@...>
Cc: Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>
Date: Wednesday, July 25, 2007 - 9:22 pm

Actually, I take that back. I think gcc does the right thing, and yes, 
it's explained by the memory clobber being just a blind "write to memory" 
rather than read memory. My bad.

It does leave us with very few ways of saying that an asm can *read* 
memory, and so it might be good to have it clarified that "volatile" 
implies that (at least with the memory clobber).

Your examples are good, I think.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Trent Piepho <xyzzy@...>, Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 4:37 pm

Without the volatile they get completely optimized away :/
[tried that, cost a lot of debugging time -- empty string functions
cause a lot of strange side effects]

-Andi
-
To: Andi Kleen <andi@...>
Cc: Trent Piepho <xyzzy@...>, Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 4:08 pm

Sure, that's *one* thing that "volatile" guarantees (it guarantees that 
gcc won't optimize away things where the end result isn't actually visibly 
used).

But gcc docs also talk about the other things volatile means, including 
"not significantly moved".

Is that "not significantly moved" already equivalent to "not moved past 
something that can change memory"? Probably not. Which is why it's a good 
idea to have the "memory" clobber there too, because the semantics of the 
"volatile" just aren't very well defined from a code movement standpoint 
(while a general memory clobber is *very* clear: if gcc moves the inline 
asm across another thing that uses/changes memory, that's a gcc bug, plain 
and simple (*)).

		Linus

(*) Other than pure internal gcc spills/reloads. Spilling/reloading local 
registers that haven't had their address taken obviously cannot be seen as 
memory accesses.
-
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Trent Piepho <xyzzy@...>, Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 5:31 pm

Actually, it doesn't.  In fact it goes out of its way to say that "asm
volatile" statements can be moved quite a bit, with respect to other
asms, other code, jumps, basic blocks, etc.  The only reliable way to
control the placement of an asm is have the right dependencies.

    The `volatile' keyword indicates that the instruction has important
    side-effects.  GCC will not delete a volatile `asm' if it is reachable.
    (The instruction can still be deleted if GCC can prove that
    control-flow will never reach the location of the instruction.)  Note
    that even a volatile `asm' instruction can be moved relative to other
    code, including across jump instructions.

also:

    An `asm' instruction without any output operands will be treated
    identically to a volatile `asm' instruction.

So there isn't anything very special about "asm volatile".  It's purely
to stop apparently useless code from being removed.

    J
-
To: Jeremy Fitzhardinge <jeremy@...>
Cc: Andi Kleen <andi@...>, Trent Piepho <xyzzy@...>, Nick Piggin <nickpiggin@...>, Satyam Sharma <ssatyam@...>, Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andrew Morton <akpm@...>
Date: Tuesday, July 24, 2007 - 5:46 pm

Ahh. That's newer.

Historically, gcc manuals used to say "may not be deleted or significantly 
reordered".

So they've weakened what it means, probably exactly because it wasn't 
well-defined before either.

		Linus
-
To: Nick Piggin <nickpiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, July 24, 2007 - 4:29 am

The memory clobber, or the volatile asm? There's more than one variable
here ... but still, I don't think either affects _ordering_ in any


But why even for the other operations? Consider (current code of)
test_and_set_bit():

static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
{
	int oldbit;
	__asm__ __volatile__( LOCK_PREFIX
		"btsl %2,%1\n\tsbbl %0,%0"
		:"=r" (oldbit),"+m" (ADDR)
		:"Ir" (nr) : "memory");

	return oldbit;
}

The only memory reference in there is to the passed address, it will
be modified, yes, but that's been made obvious to gcc in the asm
already. So why are we marking all of memory as clobbered, is the
question. (I just read Jeremy's latest reply, but I don't see how
or why the memory clobber helps that case either -- does a memory


I have, of course. Will probably read again, thanks.


Satyam
-
To: Satyam Sharma <ssatyam@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Howells <dhowells@...>, Andi Kleen <ak@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, July 24, 2007 - 4:39 am

Of course, because then the compiler can't assume anything about
the contents of memory after the operation.

   #define barrier() __asm__ __v