Hi,
Andi spotted this exchange on the gcc list. I don't think he's
brought it up here yet, but it worries me enough that I'd like
to discuss it.Starts here
http://gcc.gnu.org/ml/gcc/2007-10/msg00266.htmlConcrete example here
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.htmlBasically, what the gcc developers are saying is that gcc is
free to load and store to any memory location, so long as it
behaves as if the instructions were executed in sequence.I guess that dynamically allocated memory and computed pointers
are more difficult for gcc to do anything unsafe with, because
it is harder to tell if a given function has deallocated the
memory. However even that could theoretically happen in future
if the compiler can work out the address comes from a global
variable or is not changed intermediately.Linux makes extensive use of both trylocks and interruptible
locks (ie. which automatically result in divergant code paths,
one of which holds the lock, the other doesn't). However there
are also other code paths which will either hold a particular
lock or will not hold it, depending on context or some flags
etc. barrier() doesn't help.For x86, obviously the example above shows it can be miscompiled,
but it is probably relatively hard to make it happen for a non
trivial sequence. For an ISA with lots of predicated instructions
like ia64, it would seem to be much more likely. But of course
we don't want even the possibility of failures.The gcc guys seem to be saying to mark everything volatile that
could be touched in a critical section. This is insane for Linux.Any thoughts?
-
How is this even an optimization? It looks SLOWER to me. The
conditional read wastes memory bandwidth sometimes, if the condition is
true, and v isn't already in the cache. The unconditional write wastes
memory bandwidth ALL the time, and dirties/flushes caches, in addition
to not being thread safe.This SHOULD be using a conditional write instead of a conditional read
and an unconditional write.-
Are you surprised?
The gcc developers seem to have had a total disregard for what people
want or need, and every time some code generation issue comes up, there's
a lot of people on the list that do language-lawyering, rather than admit
that there might be a problem.It's happened before, it will happen again. I don't think it's true of all
gcc developers (or even most, I hope), but it's common enough. For some
reason, compiler developers seem to be far enough removed from "real life"
that they have a tendency to talk in terms of "this is what the spec says"
rather than "this is a problem".Happily, at least in this kind of situation, threading is a real issue for
other projects than just the kernel, so maybe it gets solved properly.But I have to admit that for the last five years or so, I've really wanted
some other compiler team to come up with a good open-source compiler.
Exactly due to issues like this (Q: "Gcc creates bogus code that doesn't
work!" A: "It's not bogus, it's technically allowed by the language specs
that don't talk about xyz, the fact that it doesn't work isn't our
problem").I think the OpenBSD people decided to actually do something about this,
and I suspect it had *nothing* to do with license issues, and everything
to do with these kinds of problems. I wish them all the luck, although
personally I think LLVM is a much more interesting project.Linus
-
And on the LLVM side all hopes for clang [0] at least for better C++ error
reporting ;-)--
Faith is believing what you know isn't so -- Mark Twain
-
Someone should take 'sparse' and use that as a C language front-end to
LLVM...Jeff
-
Among clang's "features":
"A single unified parser for C/ObjC/C++"bleh. I cannot imagine how ugly a C parser gets, after being taught
C++. IMO since you can basically redefine everything in C++, it's not a
language but a proto-language.Jeff
-
I asked a collection of knowledgeable people I know about the issue. The
consensus is that the optimization is not permitted in POSIX code but that
it is permitted in pure C code. The basic argument goes like this:To make POSIX-compliant code even possible, surely optimizations that add
writes to variables must be prohibited. That is -- if POSIX prohibits
writing to a variable in certain cases only the programmer can detect, then
a POSIX-compliant compiler cannot write to a variable except where
explicitly told to do so. Any optimization that *adds* a write to a variable
that would not otherwise occur *must* be prohibited.Otherwise, it is literally impossible to comply with the POSIX requirement
that concurrent modifications and reads to shared variables take place while
holding a mutex.The simplest solution is simply to ditch the optimization. If it really
isn't even an optimization, then that's an easy way out.DS
-
Can you retain cc list, please?
For some things, I'd expect it will be an optimisation, which is why
they're even doing it. Even on x86 perhaps, where they do tricks with
sbb/adc. If it avoids an unpredictable branch, it could help. Actually
a silly microbenchmark shows it's worth 10% to do a cmov vs an
unpredictable conditional jump, but another 2.5% to do the adc and
unconditional store (which is the problematic bit).And for unshared things like local variables where their address
hasn't escaped, it's fine.Still, I guess that for most non-stack variables, you would actually
_want_ to do a cmov rather than the adc, even in a single threaded
program. Because you don't want to touch the cacheline if possible,
let alone dirty it.
-
[ I had to resend this message. Sorry if you received two copies. ]
Nick Piggin writes:
> Can you retain cc list, please?
>
> On Friday 26 October 2007 07:42, David Schwartz wrote:> > I asked a collection of knowledgeable people I know about the
> > issue. The consensus is that the optimization is not permitted in
> > POSIX code but that it is permitted in pure C code. The basic
> > argument goes like this:
> >
> > To make POSIX-compliant code even possible, surely
> > optimizations that add writes to variables must be
> > prohibited. That is -- if POSIX prohibits writing to a variable
> > in certain cases only the programmer can detect, then a
> > POSIX-compliant compiler cannot write to a variable except where
> > explicitly told to do so. Any optimization that *adds* a write to
> > a variable that would not otherwise occur *must* be prohibited.I don't think that POSIX is quite as explicit as that. See
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf> > Otherwise, it is literally impossible to comply with the POSIX
> > requirement that concurrent modifications and reads to shared
> > variables take place while holding a mutex.
>
> Now all you have to do is tell this to the gcc developers ;)We're listening, really. It's unacceptable that gcc should break
code.However, fixing it is hard. The best plan is probably to implement
(part of) the proposed standard memory model, and that requires
careful consideration. We could in theory simply disable this
particular optimization, but that probably isn't a good idea on its
own because other optimizations might well break other code in a
similar way. We need to have a very close look at the thread-safe
memory model in order to determine where we do things that might
break.An official standard containing this is still at least two years out.
Andrew.
--
Red Hat UK Ltd, Am...
Nick Piggin writes:
> Can you retain cc list, please?
>
> On Friday 26 October 2007 07:42, David Schwartz wrote:> > I asked a collection of knowledgeable people I know about the
> > issue. The consensus is that the optimization is not permitted in
> > POSIX code but that it is permitted in pure C code. The basic
> > argument goes like this:
> >
> > To make POSIX-compliant code even possible, surely
> > optimizations that add writes to variables must be
> > prohibited. That is -- if POSIX prohibits writing to a variable
> > in certain cases only the programmer can detect, then a
> > POSIX-compliant compiler cannot write to a variable except where
> > explicitly told to do so. Any optimization that *adds* a write to
> > a variable that would not otherwise occur *must* be prohibited.I don't think that POSIX is quite as explicit as that. See
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf> > Otherwise, it is literally impossible to comply with the POSIX
> > requirement that concurrent modifications and reads to shared
> > variables take place while holding a mutex.
>
> Now all you have to do is tell this to the gcc developers ;)We're listening, really. It's unacceptable that gcc should break
code.However, fixing it is hard. The best plan is probably to implement
(part of) the proposed standard memory model, and that requires
careful consideration. We could in theory simply disable this
particular optimization, but that probably isn't a good idea on its
own because other optimizations might well break other code in a
similar way. We need to have a very close look at the thread-safe
memory model in order to determine where we do things that might
break.An official standard containing this is still at least two years out.
Andrew.
--
Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, UK
R...
In that case a conversion of a conditional branch to an unconditional
write to a visible variable is not an acceptable behaviour. Aside from
the kernel issues, it would break any number of threaded userspace apps.As was mentioned elsewhere, it's akin to sprinkling
int j = i; i = j;
throughout the code. If "i" is accessed by multiple threads, this is
not allowed unless a lock is held.Chris
-
Hi,
The BSD people are adopting pcc [1] which is a rewritten version of
some C compiler originally developed in the late 70s. And yeah, it's
basically because they think gcc is becoming too painful to live with
[2].Pekka
1. http://pcc.ludd.ltu.se/
2. http://www.thejemreport.com/mambo/content/view/369/
-
This case is clearly a bug, a very likely code pessimization.
I guess it wasn't intentional, just an optimization that is usefulOften accesses happen without function calls inbetween.
Also I think newer gcc (not 3.x) can determine if a pointerWe don't have much choice: If such a case is found it has to be marked
volatile or that particular compiler version be unsupported.It might be useful to come up with some kind of assembler pattern
matcher to check if any such code is generated for the kernel
and try it with different compiler versions.-Andi
-
Although there can be cases where it looks much more like an
optimisation (eg. where the branch and increment occurs much
more often), but it would still be a bug. Granted they are
rather constructed cases, but I don't think you want to rely onMarking volatile I think is out of the question. To start with,
volatile creates really poor code (and most of the time we actually
do want the code in critical sections to be as tight as possible).
But also because I don't think these bugs are just going to beHard to know how to do it. If you can, then it would be interesting.
-
Poor code is better than broken code I would say. Besides
the cross CPU synchronization paths are likely dominated by
cache misses anyways; it's unlikely they're actually limited
by the core CPU. So it'll probably not matter all that much
if the code is poor or not.I checked my kernel for "adc" at least (for the trylock/++ pattern)
and couldn't find any (in fact all of them were in
data the compiler thought to be code). That was not a allyesconfig build,
so i missed some code.For cmov it's at first harder because they're much more frequent
and cmov to memory is a multiple instruction pattern (the instruction
does only support memory source operands). Luckily gcc
doesn't know the if (x) mem = a; => ptr = cmov(x, &a, &dummy); *ptr = a;
transformation trick so I don't think there are actually any conditional
stores on current x86.It might be a problem on other architectures which support true
conditional stores though (like IA64 or ARM)Also I'm not sure if gcc doesn't know any other tricks like the
conditional add using carry, although I cannot think of any related
to stores from the top of my hat.Anyways, if it's only conditional add if we ever catch such a case
it could be also handled with inline assembly similar to local_inc()-Andi
-
But we have to mark all memory access inside the critical section
as volatile, even after the CPU synchronisation, and often the
common case where there is no contention or cacheline bouncing.Sure, real code is dominated by cache misses anyway, etc etc.
However volatile prevents a lot of real optimisations that we'dIt might just depend on the instruction costs that gcc knows about.
For example, if ld/st is expensive, they might hoist them out of
loops etc. You don't even need to have one of these predicate orBut we don't actually know what it is, and it could change with
different architectures or versions of gcc. I think the sanest thing
is for gcc to help us out here, seeing as there is this very well
defined requirement that we want.If you really still want the optimisation to occur, I don't think it
is too much to use a local variable for the accumulator (eg. in the
simple example case).
-
I don't think it makes sense for memory. It may may make sense for registers.
But given that my kernel doesn't seem to contain any instances
at all it's probably not too useful for it anyways.So I wouldn't have a problem disabling it, but it would
be better if they fixed their heuristics to only apply it when
it makes sense.Unfortunately it's not possible as far as I know with current gccs.
I didn't think you can disable only ADC/SBB.Disabling all of CMOV would be a pity though. Also I don't think
there is any way to do that except selecting a CPU architecture
that doesn't support it which would have other bad side effects
on the code.-Andi
-
That's what I mean -- disabling it for memory. I mean, I am just
talking about the conditional => unconditional store to a shared
variable optimisation.So for example, adc, sbb, and cmov are all still fine when they
are used for the right things. I don't want to turn them off
because they really are quite useful.As far as it being theoretical... I hope it is. But we should
nip this in the "bud" (gcc 3.x does this too, sigh) before it
causes problems for us (and any and all other threaded programs
and libraries out there). And by that I mean ask them for a
compiler option rather than start adding volatile.-
No. A *working*compiler* is better than broken code.
There's no way to use volatile for these things, since it can hit
*anything*. When the compiler generates buggy code, it's buggy code. We
can't add volatiles to every single data structure. We'd be better off
having a million monkeys on crack try to hand-assemble the thing, than
having a totally buggy compiler do it for us.Linus
-
No it can't (at least not on x86) as I have explained in the rest of the mail
you conveniently snipped.-Andi
-
I "conveniently snipped it" because it was pointless.
"adc" or "cmov" has nothing what-so-ever to do with it. If some routine
returns 0-vs-1 and gcc then turns "if (routine()) x++" into
"x+=routine()", what does that have to do with adc or cmov?The fact is, these kinds of optimizations are *bogus* and they are
dangerous.Now, it's equally true that we probably don't have those kinds of patterns
in the kernel, and we'll probabaly not hit it, but wouldn't it be much
better to make sure that compilers shouldn't do that?Linus
-
That is not what gcc did in that case. I don't think it tracks sets of values
over function calls (or even inside functions) at all.The generated code was
cmpl $1, %eax ; test res
movl acquires_count, %edx ; load
adcl $0, %edx ; maybe add 1
movl %edx, acquires_count ; storeSo it just added the result of a comparison into a variable
by (ab)using carry for this.In theory such things can be done with CMOV too by redirecting
a store into a dummy variable to cancel it, but gcc doesn'tThe conditional add/sub using carry trick is not generally bogus.
It's just bogus for memory addresses not pretty much guaranteed in L1
[aka small stack frame] because there the pipeline benefit is unlikely to
offset the memory costs (and of course poor quality of implementation because of the
missing thread safety).But for registers it's a fine optimization.
-Andi
-
While this is OK in mono-threaded code, it introduces a race condition in
multi-threaded code. The code above tried to acquire a lock, and eax was
set to 1 if it succeeded. And whatever the result, all threads still
happily modify the shared memory area (acquires_count). So the classical
case where two threads perform the same operation at the same time endsEven with a CMOV, it's the memory write which should not be performed
if the lock was not acquired.100% agree.
What would really be needed is an attribute around conditions to
indicate whether they *may* be optimized or not. Something similar
to the likely/unlikely we currently use, we could have something
like __attribute__((unsafe_cond(cond))). I think that it could still
optimize by default but let the user explicitly state that he is
playing with thread-unsafe code. As you pointed out, you did not
find any such mis-optimization in the kernel, which means that it
does not hit too often. That's the reason why I'd let the user be
careful.Willy
-
For registers it's fine. For memory, it's a disaster. It's more than just
dirty cachelines and introducing race conditions, it's also about
protection and dirty pages.So even in user space, to even be correct in the first place, the compiler
would need to make sure that the variable is writable at all (or you might
take a SIGSEGV), but I guess that gcc just assumes it is, at least for
globals (or gcc could depend on seeing *other* writes that are done
unconditionally).More likely, the compiler people don't even care, because "the C standard
doesn't specify that behaviour" - ie things like write-protected memory or
garbage collection based on dirty/accessed bits are outside the scope of
what the language specifies. Much less things like pthreads or other
synchronization primitives in threads.Linus
-
It's actually a fair bit worse for us. We have paths where a false
optimization like this would hyperspace the machine. In fact, this
frightens me so much I've just gone off to investigate whether gcc has
gone and done this to any of our code.Clearly the right solution is to introduce threads and write protected
memory into gcc so that the developers are either motivated to ensure
they work or self-destruct.Zach
-
I don't think it is a BUG, but one should certainly be able
to turn it off. Gcc is correct in that the 'C' language allows
a lot of implimentation details that are not covered by theCheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.59 BogoMips).
My book : http://www.AbominableFirebug.com/
_****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.Thank you.
-
Bug as in an optimization that makes the code slower than it was
before. That is clearly a bug in a compiler.-Andi
-
On Thu, 25 Oct 2007 13:24:49 +1000
this optimization btw is a serious mis-optimization, it makes memory
more dirty and causes cachelines to become unshared.... I'm sure it
works great on microbenchmarks but it sucks bigtime for anything real
-
Well that's exactly right. For threaded programs (and maybe even
real-world non-threaded ones in general), you don't want to be
even _reading_ global variables if you don't need to. Cache misses
and cacheline bouncing could easily cause performance to completely
tank in some cases while only gaining a cycle or two in
microbenchmarks for doing these funny x86 predication things.I'm not sure about ia64 -- I _hope_ that for most of their
predication stuff, they also predicate the stores, rather than
just store unconditionally and rely on the source operand not
changing in the case they didn't intend the memory to change.
-
For some CPUs, replacing an conditional branch with a conditional move is a
*huge* win because it cannot be mispredicted. In general, compilers should
optimize for unshared data since that's much more common in typical code.
Even for shared data, the usual case is that you are going to access the
data few times, so pulling the cache line to the CPU is essentially free
since it will happen eventually.Heuristics may show that the vast majority of such constructs write anyway.
So the optimization may also be valid based on such heuristics.A better question is whether it's legal for a compiler that claims to
support POSIX threads. I'm going to post on comp.programming.threads, where
the threading experts hang out.A very interesting case to be sure.
DS
-
Hi David,
[BTW. can you retain cc lists, please?]
A *conditional* store should no be a problem.
However the funny trick of doing this conditional add (implemented with
unconditional store), is what is going to cause breakage.On the CPUs where predicated instructions are a big win, I'd expect
they should also implement a conditional store for use here. However
they might be slower than an unconditional store (eg. x86's cmov),This is not just a question of data that you were going to use anyway.
gcc generates memory accesses to locations that would never be accessed
Even stores. It is basically impossible to say that this is a real
performance win. Even on single threaded code: consider that cache
misses take the vast majority of time in many loads, which gives aI'd never say the optimisation would always be useless. But it's a nasty
thing to have on by default, and apparently even with no good way toEither way, I think we really need a way to turn it off for Linux.
-
>>>>> "Nick" == Nick Piggin <nickpiggin@yahoo.com.au> writes:
Nick> Either way, I think we really need a way to turn it off for
Nick> Linux.Someone would need to add an option to disable the "cselim" pass in
GCC tree-ssa-phiopt.c as far as I can tell from reading GCC source.Sam
--
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/-
Note the test case was for 3.4.x which doesn't have tree-ssa at all.
It does if-conversion on the RTL level-Andi
-
On Wed, 24 Oct 2007 21:29:56 -0700
please name one...
it's not about pulling it to the CPU, it's pulling it *out* of all the
other cpus AS WELL. (and writing it back to memory, taking away memory
bandwidth)--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
-
| Andrew Morton | -mm merge plans for 2.6.23 |
| Chuck Ebbert | Why do so many machines need "noapic"? |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 023/196] MCP_UCB1200: Convert from class_device to device |
git: | |
| David Miller | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 31/37] dccp: Remove manual influence on NDP Count feature |
| Gregory Haskins | [RFC PATCH 00/17] virtual-bus |
