In the continuining discussion about how GCC treats the volatile keyword, Linus Torvalds noted, "I just have a strong suspicion that 'volatile' performance is so low down the list of any C compiler persons interest, that it's never going to happen. And quite frankly, I cannot blame the gcc guys for it." He went on to explain, "that's especially as 'volatile' really isn't a very good feature of the C language, and is likely to get *less* interesting rather than more (as user space starts to be more and more threaded, 'volatile' gets less and less useful."
"So I wouldn't expect '
volatile' to ever really generate better code. It might happen as a side effect of other improvements (eg, I might hope that the SSA work would eventually lead to gcc having a much better defined model of valid optimizations, and maybe better code generation for volatile accesses fall out cleanly out of that), but in the end, it's such an ugly special case in C, and so seldom used, that I wouldn't depend on it."Quite frankly, I'd like there to be more competition in the open source compiler game, and that might cause some upheavals, but on the whole, gcc actually does a pretty damn good job."
From: Linus Torvalds [email blocked] Subject: Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures Date: Fri, 17 Aug 2007 21:13:35 -0700 (PDT) On Sat, 18 Aug 2007, Satyam Sharma wrote: > > No code does (or would do, or should do): > > x.counter++; > > on an "atomic_t x;" anyway. That's just an example of a general problem. No, you don't use "x.counter++". But you *do* use if (atomic_read(&x) <= 1) and loading into a register is stupid and pointless, when you could just do it as a regular memory-operand to the cmp instruction. And as far as the compiler is concerned, the problem is the 100% same: combining operations with the volatile memop. The fact is, a compiler that thinks that movl mem,reg cmpl $val,reg is any better than cmpl $val,mem is just not a very good compiler. But when talking about "volatile", that's exactly what ytou always get (and always have gotten - this is not a regression, and I doubt gcc is alone in this). Linus
From: Denys Vlasenko <vda.linux@googlemail.com> Subject: Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures Date: Fri, 24 Aug 2007 13:19:04 +0100 On Saturday 18 August 2007 05:13, Linus Torvalds wrote: > On Sat, 18 Aug 2007, Satyam Sharma wrote: > > No code does (or would do, or should do): > > > > x.counter++; > > > > on an "atomic_t x;" anyway. > > That's just an example of a general problem. > > No, you don't use "x.counter++". But you *do* use > > if (atomic_read(&x) <= 1) > > and loading into a register is stupid and pointless, when you could just > do it as a regular memory-operand to the cmp instruction. It doesn't mean that (volatile int*) cast is bad, it means that current gcc is bad (or "not good enough"). IOW: instead of avoiding volatile cast, it's better to fix the compiler. > And as far as the compiler is concerned, the problem is the 100% same: > combining operations with the volatile memop. > > The fact is, a compiler that thinks that > > movl mem,reg > cmpl $val,reg > > is any better than > > cmpl $val,mem > > is just not a very good compiler. Linus, in all honesty gcc has many more cases of suboptimal code, case of "volatile" is just one of many. Off the top of my head: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417 unsigned v; void f(unsigned A) { v = ((unsigned long long)A) * 365384439 >> (27+32); } gcc-4.1.1 -S -Os -fomit-frame-pointer t.c f: movl $365384439, %eax mull 4(%esp) movl %edx, %eax <===== ? shrl $27, %eax movl %eax, v ret Why is it moving %edx to %eax? gcc-4.2.1 -S -Os -fomit-frame-pointer t.c f: movl $365384439, %eax mull 4(%esp) movl %edx, %eax <===== ? xorl %edx, %edx <===== ??! shrl $27, %eax movl %eax, v ret Progress... Now we also zero out %edx afterwards for no apparent reason. -- vda
From: Linus Torvalds [email blocked] Subject: Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures Date: Fri, 24 Aug 2007 10:19:50 -0700 (PDT) On Fri, 24 Aug 2007, Denys Vlasenko wrote: > > > No, you don't use "x.counter++". But you *do* use > > > > if (atomic_read(&x) <= 1) > > > > and loading into a register is stupid and pointless, when you could just > > do it as a regular memory-operand to the cmp instruction. > > It doesn't mean that (volatile int*) cast is bad, it means that current gcc > is bad (or "not good enough"). IOW: instead of avoiding volatile cast, > it's better to fix the compiler. I would agree that fixing the compiler in this case would be a good thing, even quite regardless of any "atomic_read()" discussion. I just have a strong suspicion that "volatile" performance is so low down the list of any C compiler persons interest, that it's never going to happen. And quite frankly, I cannot blame the gcc guys for it. That's especially as "volatile" really isn't a very good feature of the C language, and is likely to get *less* interesting rather than more (as user space starts to be more and more threaded, "volatile" gets less and less useful. [ Ie, currently, I think you can validly use "volatile" in a "sigatomic_t" kind of way, where there is a single thread, but with asynchronous events. In that kind of situation, I think it's probably useful. But once you get multiple threads, it gets pointless. Sure: you could use "volatile" together with something like Dekker's or Peterson's algorithm that doesn't depend on cache coherency (that's basically what the C "volatile" keyword approximates: not atomic accesses, but *uncached* accesses! But let's face it, that's way past insane. ] So I wouldn't expect "volatile" to ever really generate better code. It might happen as a side effect of other improvements (eg, I might hope that the SSA work would eventually lead to gcc having a much better defined model of valid optimizations, and maybe better code generation for volatile accesses fall out cleanly out of that), but in the end, it's such an ugly special case in C, and so seldom used, that I wouldn't depend on it. > Linus, in all honesty gcc has many more cases of suboptimal code, > case of "volatile" is just one of many. Well, the thing is, quite often, many of those "suboptimal code" generations fall into two distinct classes: - complex C code. I can't really blame the compiler too much for this. Some things are *hard* to optimize, and for various scalability reasons, you often end up having limits in the compiler where it doesn't even _try_ doing certain optimizations if you have excessive complexity. - bad register allocation. Register allocation really is hard, and sometimes gcc just does the "obviously wrong" thing, and you end up having totally unnecessary spills. > Off the top of my head: Yes, "unsigned long long" with x86 has always generated atrocious code. In fact, I would say that historically it was really *really* bad. These days, gcc actually does a pretty good job, but I'm not surprised that it's still quite possible to find cases where it did some optimization (in this case, apparently noticing that "shift by >= 32 bits" causes the low register to be pointless) and then missed *another* optimization (better register use) because that optimization had been done *before* the first optimization was done. That's a *classic* example of compiler code generation issues, and quite frankly, I think that's very different from the issue of "volatile". Quite frankly, I'd like there to be more competition in the open source compiler game, and that might cause some upheavals, but on the whole, gcc actually does a pretty damn good job. Linus
Quite frankly...
...Linus is quite frank, quite frankly.
--
Program Intellivision and play Space Patrol!
...my dear, I don't give a damn
Sorry, I could resisted. ;-)
Frankly I would argue they
Frankly I would argue they have been abusing volatile. I would and have rejected code which is used as Linus suggests. There are good uses for volatile but I argue they found the perfect examples when you should not be using it. And by his own admission they've been using it to hide buggy/crappy code via volatile side effects. Gosh, what a surprise that crappy code, used to hide more crappy/buggy code results in crappy performance. Who else is shocked.
Long story short...they are finally removing crappy code which should have never been in there in the first place. This in turn may result in additional crappy/buggy code being identified which may ultimately result in yet a more stable kernel with a performance boost on the side.
This should have been done years ago and simply never been allowed in the first place. Shame on him/them.
Totally agree
There definitely needs to be more competition in the open source compiler space. LLVM was promising but doesn't seem to be getting much uptake. Especially since they use gcc+rtl as a front end.
I want more competition in the open source C++ compiler world. Should I start writing a compiler?
C++ is the language I would least want to write a compiler for...
LLVM is getting a new C/C++
LLVM is getting a new C/C++ frontend written from the scratch (clang): http://llvm.org/pubs/2007-07-25-LLVM-2.0-and-Beyond.pdf.
LLVM is a promising compiler infrastructure due to its nicely structured C++ OO code base, which is why I think it's one of the strong competitors for GCC in the future. We'll see...
Are there Debian (PPC/AMD64)
Are there Debian (PPC/AMD64) packages?
http://packages.debian.org/ll
http://packages.debian.org/llvm
Looks like amd64, but no ppc (yet?). (And apparently only in unstable.)
Linus has really missed the
Linus has really missed the point on this. volatile was not made for smp concurrency issues, it was made for access to device memory. Here you have to be even more careful because you may have reads and writes with side-effects.
As far as SMP concurrency goes, it is important to use volatile to avoid caching results in a register. Consider a spinlock implementation in C, using asm for the atomics. If the atomic acquire fails and the compiler caches the subsequent owned value, you'd spin forever waiting for the value in the register to change. It's not going to change.
There are other examples where volatile is required for correctness. He's being short-sighted. It doesn't effect the few trivial cases he's come up with, so in typical Linus fashion he denounces it as useless and moves on.
But it doesn't work.
Volatile may have been intended for access to memory mapped device memory, but it doesn't actually work on modern hardware. You need to use mechanisms outside the language to ensure memory accesses stay in the right order and get committed to the hardware. This is where things like read and write barriers come in handy, as well as explicit cache controls and/or page attributes.
As for spinlocks: Volatile matters there only if you're writing the spinlock within the language, and with modern hardware that's rarely possible. If you bump out to assembly code (as you suggested), there's nothing for the compiler to consider here. You've already dictated the instruction sequence.
Sure, there are places where volatile is useful, but most of the time that's superceded by other mechanisms, such as language extensions and inline assembler. It's easy to imagine replacing C's volatile with some other mechanisms that would be more precise and have clearer semantics. I really think the C committee should consider doing so.
And I happen to agree, it's perfectly acceptable for compiler guys to hold their nose around volatile and not optimize it aggressively. It simply isn't useful in 99.44% of the code out there.
--
Program Intellivision and play Space Patrol!
Other barriers are important
Other barriers are important to prevent the processor from re-ordering memory. Volatile is important to prevent the compiler from doing so. That's the fundamental capability it provides. memory barriers and the like will only prevent the compiler from reordering as a side-effect of inline assembly or a function call. On systems that don't require extra effort to ensure that the processor orders correctly the compiler must still be instructed to do so.
There is no reason that it's not possible to write spinlocks in C so long as you have inline asm for atomic operations. I find it preferable.
volatile needs only to prevent the compiler from re-ordering or caching. I don't think the standard is so ambiguous, I think it's simply not implemented efficiently in gcc. It doesn't need to be replaced with another mechanism. baby. bathwater.
The only real efficiency lost by gcc is a few extra emitted instructions, which of course adds up to a lot over a whole kernel, however, the actual runtime differences are likely quite negligible.
But then it's redundant
There is no good reason that the barriers that prevent the processor from reordering accesses shouldn't prevent the compiler from doing the same. Constraints on processor memory ordering are stronger than constraints on compiler memory ordering. It's not even sensible to constrain the processor's memory ordering if the compiler isn't likewise constrained.
These memory barriers currently exist outside the language. If they were in the language, then you would not need volatile. Volatile is redundant when stronger and more precise ordering primitives are available. It's a non-sequitur that simpler processors don't need stronger barriers. If the order of memory accesses matter, it is useful to describe the constraints to the compiler and the hardware. Just because the current version of the compiler and the CPU don't reorder a particular program doesn't make that program "right." A later version of the compiler and/or CPU may "break" that program.
Inline assembly is not C. What's the C keyword for atomic accesses? Oh, wait, there isn't any. You have to escape the language.
--
Program Intellivision and play Space Patrol!
You're beating the same
You're beating the same drum. Yes, I understand inline assembly acts to constrain re-ordering by the compiler. However, there are cases where you need to enforce ordering where there is no inline assembly. Even linus produced some strawman examples.
The two issues are orthogonal. If you want to slow down your code by using memory barriers where none are required, carry on.
What I would prefer
Since volatile has such ugly semantics, I personally would prefer to see order-dependent memory accesses performed via intrinsic functions. This makes it very clear what accesses there are and where they are. (See the "a = b = c" discussion on the other volatile thread to see just how crazy it can get.)
I'd like something along the lines of:
unsigned int volatile_read(unsigned int *ptr, ...) void volatile_write(unsigned int *ptr, unsigned int data, ...)The varargs would allow the programmer to indicate what other accesses this access must be ordered strongly with respect to. For example:
/* set up something in a peripheral. These writes can occur in any order. */ volatile_write(&periph[0], 0xABCD); volatile_write(&periph[1], 0xDEFA); volatile_write(&periph[2], 0x1234); /* the last write needs to happen after the above three. */ volatile_write(&periph[3], 0x5555, &periph[0], &periph[1], &periph[2]);Now the compiler can schedule things however it likes. For simple code like this example, there may be little advantage. For more complex code, it could be a benefit. I see this a lot w/ device driver code on our statically scheduled VLIW architecture at work.
The volatile_read and volatile_write functions wouldn't have to be actual function calls (though they could be implemented that way). They could be as efficient as pointer dereferencing.
--
Program Intellivision and play Space Patrol!
The kernel has read*, write* for reading and writing to IO
extern u8 readb(const volatile void __iomem *addr);
extern u16 readw(const volatile void __iomem *addr);
extern u32 readl(const volatile void __iomem *addr);
extern u64 readq(const volatile void __iomem *addr);
extern void writeb(u8 b, volatile void __iomem *addr);
extern void writew(u16 b, volatile void __iomem *addr);
extern void writel(u32 b, volatile void __iomem *addr);
extern void writeq(u64 b, volatile void __iomem *addr);
Of course they use volatile, but volatile isn't needed anywhere else.
Cool!
Now we just mechanisms like this to become part of the C standard. :-P
--
Program Intellivision and play Space Patrol!
As far as SMP concurrency
As far as SMP concurrency goes, it is important to use volatile to avoid caching results in a register.
volatile is irrelevant in this case because even if the compiler generates the read/write, there's no guarantee that the CPUs will actually read/write to/from main memory. What you need is memory barriers and -- guess what -- they also have the side effect of letting the compiler know that this var may have changed since the last time you checked. IIRC Linus stated there were only 2 acceptable uses of volatile in the kernel:
1) To access hardware registers (and those uses are wrapped in internal API functions, so nobody should use volatile directly for that)
2) The jiffies variable, which is only this way for historical reasons.
That is incorrect. If the
That is incorrect. If the program issues a read from memory into a register the value reflects what is in memory at that instant on any cache-coherent processor The processor can't neglect to perform the read. That's ridiculous. If the read is cached it fetches it from cache. If not it may fetch it from a remote node's cache or memory depending on the coherency design.
The compiler can neglect to perform the read if it believes the results can not have changed. The compiler has to assume any function call may have modified the value. It has to assume that assembly could've done something it doesn't understand as well. Otherwise, reading a value in a loop requires no extra synchronization but the read must be emitted if it is to work.
Memory barriers also don't force access to memory, they order access to memory.
Consider this loop in x86_64 assembly:static inline void
__raw_spin_lock(raw_spinlock_t *lock) { asm volatile( "\n1:\t" LOCK_PREFIX " ; decl %0\n\t" "jns 2f\n" "3:\n" "rep;nop\n\t" "cmpl $0,%0\n\t" "jle 3b\n\t" "jmp 1b\n" "2:\t" : "=m" (lock->slock) : : "memory"); }This is the basic spinlock routine from linux. This could be rewritten in c with the made-up atomic_dec inline as:
__raw_spin_lock(raw_spinlock_t *lock) { for (;;;) { if (atomic_dec_long(&lock->slock) == 0) break; while (lock->slock != 0); } }This C version would hang without volatile on the slock. Clearly here the compiler can cache the value of slock which it believes can not change. Introducing the pause instruction (spelled "rep;nop;" in linux) into the loop would make it work only as a side-effect of unrelated inline assembly.
There are other perfectly valid constructions which can only work with the proper use of volatile. It sounds as if it has been abused by linux kernel developers. That is unfortunate. That does not mean it is without use.
Where's your barriers?
Your spinlock is potentially broken unless you're sure atomic_dec_long implies a barrier. At least I'm pretty sure the LOCK_PREFIX on x86-64 does.
That style of code doesn't work correctly in the general case. Linus gave a great example, and it applies to your spinlock code above too:
--
Program Intellivision and play Space Patrol!
x86_64 only re-orders loads
x86_64 only re-orders loads with respect to other loads. It will not reorder stores. It will not reorder loads with stores. The lock prefix does prevent loads from being reordered. It also holds the cache line in the exclusive state for the duration of the operation. x86_64 and i386 have very strict memory models. For the most part, explicit memory barriers are only needed when used with uncached memory types. Otherwise lfence is only necessary in cases where a lock prefix is not used.
'that style of code' is exactly the linux spinlock implementation. You don't use an atomic or a barrier for every spin loop. Only those where you try to acquire the lock. While you're testing to see if it's unowned, you simply need to generate a load from memory. Which, incidentally, has the side-effect of forcing the cache-line into the shared state on all processors which have this line cached and generating extra bus traffic.
The 'general case' is actually a straw-man argument that Linus has proposed to support his opinion. The case Linus has produced is obviously broken, but the Linux spinlock implementation relies on exactly the same behavior. It just happens to be implemented in assembly so it avoids the compiler's use of atomics. The code isn't buggy in the presence of atomics if there is some external guarantee that prevents an endless loop, or otherwise a reasonable loop termination condition (like the lock becoming unowned).
Linux strives to support the Alpha model, not the i386 model
Looking at the i386 / x86-64 spinlock implementation is misleading, because that's not the model Linux strives to support throughout the kernel. Linux actually strives to support the Alpha model, in which memory accesses are aggressively reordered.
Sure, the spinlock implementation above might be ok for i386 and x86-64, but that's why it's in the arch-specific directory, and not sprinkled around the arch-independent portions of the tree. If you try to implement Linus' example of buggy code, your code may end up being just fine on your PC and break when compiled for another architecture.
There is a fine document in the Linux Documentation directory that explains the SMP ordering model the general kernel code expects. This is no strawman. It was developed based on what various CPUs do to speed up their memory systems when they don't have the legacy associated with x86. SPARC, Alpha, PowerPC, MIPS and so on all have much more aggressive memory systems than x86-64 and i386, and for good reason.
Take a look at Alpha's spinlock implementation:
static inline void __raw_spin_unlock(raw_spinlock_t * lock) { mb(); lock->lock = 0; } static inline void __raw_spin_lock(raw_spinlock_t * lock) { long tmp; __asm__ __volatile__( "1: ldl_l %0,%1\n" " bne %0,2f\n" " lda %0,1\n" " stl_c %0,%1\n" " beq %0,2f\n" " mb\n" ".subsection 2\n" "2: ldl %0,%1\n" " bne %0,2b\n" " br 1b\n" ".previous" : "=&r" (tmp), "=m" (lock->lock) : "m"(lock->lock) : "memory"); }What's that I spy there? mb, our good friend the memory barrier. If I read the section-directive shenanigans correctly, mb is the last instruction of the lock acquisition, and it's quite rightfully the first instruction of the lock release.
(For the uninitiated: The ldl_l and stl_c are "load-link" and "store-conditional." They're a decoupled form of atomic update. I'm pretty sure the "beq" goes back to the spin if the stl_c didn't update memory.)
At any rate, it's worth noting that the LL/SC pair aren't a barrier on Alpha. Otherwise, you wouldn't need the mb. Thus, Linus' "buggy example" would be well and truly buggy on Alpha and probably on other architectures as well. And the really insidious thing is it might work just fine on i386 and x86-64, which means its bugginess could go undetected for quite awhile. You don't want that sort of code anywhere outside the arch directory, where it can be vetted for the memory system semantics of that platform.
--
Program Intellivision and play Space Patrol!
The memory barrier is to
The memory barrier is to ensure that the memory that the spinlock is protecting is not speculatively accessed before lock acquisition or after release. It has nothing to do with the value you're spinning on.
My example works just fine on alpha if the atomic operation includes a barrier on that platform. I happen to have developed a fair bit of SMP safe code for the alpha.
You still don't need a barrier in the spin loop to check for the unowned condition.
We're talking past each other.
The point both Linus and I are making is that the atomic example he gave for synchronization is broken because it lacks a barrier between what the atomic guards and the atomic itself. I wasn't arguing you need a barrier in the poll loop. I was arguing you need a barrier after acquisition and a barrier before release.
The original person I replied to was trying to say volatile is enough on the basis that x86 didn't need a barrier.
--
Program Intellivision and play Space Patrol!
Yes and the point that I'm
Yes and the point that I'm making is that it's a trivial example that's obviously broken. There are plenty of perfectly legal constructions that are facilitated by volatile. It's not something to be avoided. It's something to be understood and used in the correct circumstances.
I never suggested there was no need for a barrier. Only that it and volatile are orthogonal issues.
This reminds me of the 'goto is always bad!' camp who construct painfully convoluted single-exit logic to avoid it at all costs when using it often simplifies things. They do this just because in the hands of a bad programmer it tends to make things worse.
I'll keep my goto and volatile and use them in the proper cases, thanks.
Volatile has its uses
Volatile has its uses, but I totally see Linus' point that putting it on atomic_read isn't a good idea. It's a "separation of concerns" and "say what you mean" type of issue.
In this case, it is the question of whether atomic_read must always go to memory. Linus' argument is that it's not an SMP synchronization primitive when used by itself, it's only an atomic access primitive. It guarantees you won't receive a mixture of old and new values in the bits returned, and nothing more. It doesn't even guarantee it went to memory to get those bits. You need barriers and volatile both in addition to atomicity in order to build SMP synchronization primitives.
BTW, you should get an account. :-)
--
Program Intellivision and play Space Patrol!
the aliasing problem says no
> The compiler can neglect to perform the [memory] read if it believes the
> results can not have changed. The compiler has to assume any function
> call may have modified the value. It has to assume that assembly
> could've done something it doesn't understand as well. Otherwise,
> reading a value in a loop requires no extra synchronization but
> the read must be emitted if it is to work.
False. When the programmer asks the compiler to access memory, it MUST access memory.
The only exceptions are:
1) if the programmer has used the C99 "restrict" keyword
2) if the programmer has been brave enough to run gcc with the -fstrict-aliasing language extension on.
If the compiler could start playing fast and loose with memory reads and writes, almost no multi-threaded programs would work correctly.
What?
The compiler is free to eliminate as many memory accesses as it can determine do not change the meaning of the program. For instance, the array reference x[i] below can be hoisted from the inner loop, because the compiler can plainly see it's independent of all other variables. Furthermore, the accumulation into z[i] can be replaced with a scalar access before and after the inner loop.
int x[8], y[8], z[8], i, j; /* ... assume code here initializes x, y ... */ for (i = 0; i < 8; i++) { z[i] = 0; for (j = 0; j < 8; j++) z[i] += x[i] * y[j]; }Not only that, the compiler can unroll the inner and outer loops, vectorize the loops, change the order of accesses, and so on. So, while the original program appears to read the elements of z[] 64 times and write them 72 times, an optimizing compiler may end up never reading z, and writing it precisely 8 times. That's 100% legal within the C language. In fact, I just tested this assertion with GCC 4.2 here, and that's precisely what it does. Your statement "When the programmer asks the compiler to access memory, it MUST access memory" is plainly false.
Also, -fstrict-aliasing is not really a language extension, at least compared to C99. In that context, -fno-strict-aliasing is the language extension. The only really cross-type aliasing C99 permits is between char * and other object types, and that's mainly to permit memcpy(), memmove(), malloc() and friends to work. Take a look at this excerpt from here:
953 — a type compatible with the effective type of the object,
954 — a qualified version of a type compatible with the effective type of the object,
955 — a type that is the signed or unsigned type corresponding to the effective type of the object,
956 — a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
957 — an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
958 — a character type.
So, the standard allows certain innocuous things, such as switching between (int *) and (unsigned int *), casting between a pointer-to-union and a pointer to one of its members, pointer-to-array-element and one of its members, and very specifically char * (signed or unsigned) and back. All but the last one are by and large casts between compatible (or nearly so) types.
GCC does not rely on programmers to adhere to this strict aliasing regime, because historically, C permitted a much wider range of behavior. This article looks like it gives a good overview of the current status quo.
--
Program Intellivision and play Space Patrol!
#include #include int
#include #include int i; int *ip; int main(int argc, char **argv) { int j; ip = &i; for (i = atoi(argv[1]); *ip; (*ip)--); printf("%d\n", i); exit(EXIT_SUCCESS); }At -O3 the loop becomes:
Notice it writes to (%edx) however it doesn't read from it again. Clearly, the compiler has optimized out the loads because there can be no aliased access because no function calls have happened. Multi-threaded code has to use locks to be correct. No amount of memory ordering or access would negate the need for some form of external synchronization.
Here's the loop with volatile:
Notice it generates a load of the pointer before the dec, and then again afterwards for the compare. This is 1:1 with what the programmer has requested. The compiler has not cached any results.
Compiler optimization
The compiler should just turn the loop into
mov $0,(%edx)
and be done with it!
-O3 implies the -fstrict-aliasing language extension
Go back and read my post again.
-O3 implies the -fstrict-aliasing language extension.
Where did you say that?
You said:
If the compiler could start playing fast and loose with memory reads and writes, almost no multi-threaded programs would work correctly.
So where in there did you say -O3 implies -fstrict-aliasing? That may be true, but the way you worded things, you implied -fstrict-aliasing was a completely out-there thing. (Never mind that it asserts that your code is compliant with C99.)
Edit: According to this page, -fstrict-aliasing gets enabled at -O2, at least under GCC 4.2.0. That means it's a pretty common feature to have enabled.
--
Program Intellivision and play Space Patrol!