Because atomic operations are generally used for synchronization, which requires volatile behavior. Most such codepaths currently use an inefficient barrier(). Some forget to and we get bugs, because people assume that atomic_read() actually reads something, and atomic_write() actually writes something. Worse, these are architecture-specific, even compiler version-specific bugs that are often difficult to track down. -- Chris -
Looks like we need to have lock and unlock semantics? atomic_read() which has no barrier or volatile implications. atomic_read_for_lock Acquire semantics? atomic_read_for_unlock Release semantics? -
I'm yet to see a single example from the current tree where this patch series is the correct solution. So far the only example has been a buggy piece of code which has since been fixed with a cpu_relax. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Btw.: we still have include/asm-i386/mach-es7000/mach_wakecpu.h: while (!atomic_read(deassert)); include/asm-i386/mach-default/mach_wakecpu.h: while (!atomic_read(deassert)); Looks like they need to be fixed as well. -
I don't know if this here is affected:
/* drivers/ieee1394/ieee1394_core.h */
static inline unsigned int get_hpsb_generation(struct hpsb_host *host)
{
return atomic_read(&host->generation);
}
/* drivers/ieee1394/nodemgr.c */
static int nodemgr_host_thread(void *__hi)
{
[...]
for (;;) {
[... sleep until bus reset event ...]
/* Pause for 1/4 second in 1/16 second intervals,
* to make sure things settle down. */
g = get_hpsb_generation(host);
for (i = 0; i < 4 ; i++) {
if (msleep_interruptible(63) ||
kthread_should_stop())
goto exit;
/* Now get the generation in which the node ID's we collect
* are valid. During the bus scan we will use this generation
* for the read transactions, so that if another reset occurs
* during the scan the transactions will fail instead of
* returning bogus data. */
generation = get_hpsb_generation(host);
/* If we get a reset before we are done waiting, then
* start the waiting over again */
if (generation != g)
g = generation, i = 0;
}
[... scan bus, using generation ...]
}
exit:
[...]
}
--
Stefan Richter
-=====-=-=== =--- -====
http://arcgraph.de/sr/
-
Hi Stefan, Yes, I think it is. You're clearly expecting the read to actually happen when you call get_hpsb_generation(). It's clearly not a busy-loop, so cpu_relax() sounds pointless / wrong solution for this case, so I'm now somewhat beginning to appreciate the motivation behind this series :-) But as I said, there are ways to achieve the same goals of this series without using "volatile". I think I'll submit a RFC/patch or two on this myself (will also fix Totally unrelated, but this looks weird. IMHO you actually wanted: msleep_interruptible(63); if (kthread_should_stop()) goto exit; here, didn't you? Otherwise the thread will exit even when kthread_should_stop() != TRUE (just because it received a signal), and it is not good for a kthread to exit on its own if it uses kthread_should_stop() or if some other piece of kernel code could eventually call kthread_stop(tsk) on it. Ok, probably the thread will never receive a signal in the first place because it's spawned off kthreadd which ignores all signals beforehand, but still ... [PATCH] ieee1394: Fix kthread stopping in nodemgr_host_thread The nodemgr host thread can exit on its own even when kthread_should_stop is not true, on receiving a signal (might never happen in practice, as it ignores signals). But considering kthread_stop() must not be mixed with kthreads that can exit on their own, I think changing the code like this is clearer. This change means the thread can cut its sleep short when receive a signal but looking at the code around, that sounds okay (and again, it might never actually recieve a signal in practice). Signed-off-by: Satyam Sharma <satyam@infradead.org> --- drivers/ieee1394/nodemgr.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/ieee1394/nodemgr.c b/drivers/ieee1394/nodemgr.c index 2ffd534..981a7da 100644 --- a/drivers/ieee1394/nodemgr.c +++ b/drivers/ieee1394/nodemgr.c @@ -1721,7 +1721,8 @@ static int nodemgr_host_thread(void ...
Nope, we're calling schedule which is a rather heavy-weight barrier. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
[...something like]
b = atomic_read(a);
for (i = 0; i < 4; i++) {
msleep_interruptible(63);
c = atomic_read(a);
if (c != b) {
b = c;
i = 0;
}
How does the compiler know that msleep() has got barrier()s?
--
Stefan Richter
-=====-=-=== =--- -====
http://arcgraph.de/sr/
-
Because msleep_interruptible() is in a separate compilation unit, the compiler has to assume that it might modify any arbitrary global. In many cases, the compiler also has to assume that msleep_interruptible() might call back into a function in the current compilation unit, thus possibly modifying global static variables. Thanx, Paul -
Yup, I've just verified this with a testcase. So a call to any function outside of the current compilation unit acts as a compiler barrier. Cool. -
No; compilation units have nothing to do with it, GCC can optimise across compilation unit boundaries just fine, if you tell it to compile more than one compilation unit at once. What you probably mean is that the compiler has to assume any code it cannot currently see can do anything (insofar as allowed by the It most often is smart enough to see what compilation-unit-local variables might be modified that way, though :-) Segher -
Last I checked, the Linux kernel build system did compile each .c file Yep. For example, if it knows the current value of a given such local variable, and if all code paths that would change some other variable cannot be reached given that current value of the first variable. At least given that gcc doesn't know about multiple threads of execution! Thanx, Paul -
[ The Cc: list scares me. Should probably trim it. ] I think this was just terminology confusion here again. Isn't "any code that it cannot currently see" the same as "another compilation unit", and wouldn't the "compilation unit" itself expand if we ask gcc to compile more than one unit at once? Or is there some more specific "definition" for "compilation unit" (in gcc lingo, possibly?) -
This is indeed my understanding -- "compilation unit" is whatever the compiler looks at in one go. I have heard the word "module" used for the minimal compilation unit covering a single .c file and everything that it #includes, but there might be a better name for this. Thanx, Paul -
Yes, that's what's called "compilation unit" :-) [/me double checks] Erm, the C standard actually calls it "translation unit". To be exact, to avoid any more confusion: 5.1.1.1/1: A C program need not all be translated at the same time. The text of the program is kept in units called source files, (or preprocessing files) in this International Standard. A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit. Segher -
I am OK with "translation" and "compilation" being near-synonyms. ;-) Thanx, Paul -
It is not; try gcc -combine or the upcoming link-time optimisation "compilation unit" is a C standard term. It typically boils down to "single .c file". Segher -
As you mentioned later, "single .c file with all the other files (headers
or other .c files) that it pulls in via #include" is actually "translation
unit", both in the C standard as well as gcc docs. "Compilation unit"
doesn't seem to be nearly as standard a term, though in most places it
is indeed meant to be same as "translation unit", but with the new gcc
inter-module-analysis stuff that you referred to above, I suspect one may
reasonably want to call a "compilation unit" as all that the compiler sees
at a given instant.
BTW I did some auditing (only inside include/asm-{i386,x86_64}/ and
arch/{i386,x86_64}/) and found a couple more callsites that don't use
cpu_relax():
arch/i386/kernel/crash.c:101
arch/x86_64/kernel/crash.c:97
that are:
while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
mdelay(1);
msecs--;
}
where mdelay() becomes __const_udelay() which happens to be in another
translation unit (arch/i386/lib/delay.c) and hence saves this callsite
from being a bug :-)
Curiously, __const_udelay() is still marked as "inline" where it is
implemented in lib/delay.c which is weird, considering it won't ever
be inlined, would it? With the kernel presently being compiled one
translation unit at a time, I don't see how the implementation would
be visible to any callsite out there to be able to inline it.
Satyam
-
That would be a bit confusing, would it not? They'd better find some better name for that if they want to name it at all (remember, none of these optimisations should have any effect on the semantics of the program, you just get fewer .o files etc.). Segher -
I have some patches to use -combine -fwhole-program for Linux. Highly experimental, you need a patched bleeding edge toolchain. If there's interest I'll clean it up and put it online. Or the most common thing: if neither the address of the translation- unit local variable nor the address of any function writing to that variable can "escape" from that translation unit, nothing outside Heh, only about the threads it creates itself (not relevant to the kernel, for sure :-) ) Segher -
But there is usually at least one externally callable function in ;-) Thanx, Paul -
Of course, but often none of those will (indirectly) write a certain static variable. Segher -
But there has to be some path to the static functions, assuming that they are not dead code. Yes, there can be cases where the compiler knows enough about the state of the variables to rule out some of code paths to them, but I have no idea how often this happens in kernel code. Thanx, Paul -
(trimmed Cc) Thanks, committed to linux1394-2.6.git. -- Stefan Richter -=====-=-=== =--- -==== http://arcgraph.de/sr/ -
[PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()
Use cpu_relax() in the busy loops, as atomic_read() doesn't automatically
imply volatility for i386 and x86_64. x86_64 doesn't have this issue because
it open-codes the while loop in smpboot.c:smp_callin() itself that already
uses cpu_relax().
For i386, however, smpboot.c:smp_callin() calls wait_for_init_deassert()
which is buggy for mach-default and mach-es7000 cases.
[ I test-built a kernel -- smp_callin() itself got inlined in its only
callsite, smpboot.c:start_secondary() -- and the relevant piece of
code disassembles to the following:
0xc1019704 <start_secondary+12>: mov 0xc144c4c8,%eax
0xc1019709 <start_secondary+17>: test %eax,%eax
0xc101970b <start_secondary+19>: je 0xc1019709 <start_secondary+17>
init_deasserted (at 0xc144c4c8) gets fetched into %eax only once and
then we loop over the test of the stale value in the register only,
so these look like real bugs to me. With the fix below, this becomes:
0xc1019706 <start_secondary+14>: pause
0xc1019708 <start_secondary+16>: cmpl $0x0,0xc144c4c8
0xc101970f <start_secondary+23>: je 0xc1019706 <start_secondary+14>
which looks nice and healthy. ]
Thanks to Heiko Carstens for noticing this.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
---
include/asm-i386/mach-default/mach_wakecpu.h | 3 ++-
include/asm-i386/mach-es7000/mach_wakecpu.h | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/asm-i386/mach-default/mach_wakecpu.h b/include/asm-i386/mach-default/mach_wakecpu.h
index 673b85c..3ebb178 100644
--- a/include/asm-i386/mach-default/mach_wakecpu.h
+++ b/include/asm-i386/mach-default/mach_wakecpu.h
@@ -15,7 +15,8 @@
static inline void wait_for_init_deassert(atomic_t *deassert)
{
- while (!atomic_read(deassert));
+ while (!atomic_read(deassert))
+ cpu_relax();
return;
}
diff --git ...For less-than-briliant people like me, it's totally non-obvious that cpu_relax() is needed for correctness here, not just to make P4 happy. IOW: "atomic_read" name quite unambiguously means "I will read this variable from main memory". Which is not true and creates potential for confusion and bugs. -- vda -
I find it also non obvious. It would be really better to have a barrier Agreed. -Andi -
To me, "atomic_read" means a read which is synchronized with other changes to the variable (using the atomic_XXX functions) in such a way that I will always only see the "before" or "after" state of the variable - never an intermediate state while a modification is happening. It doesn't imply that I have to see the "after" state immediately after another thread modifies it. Perhaps the Linux atomic_XXX functions work like that, or used to work like that, but it's counter-intuitive to me that "atomic" should imply a memory read. Later, Kenn -
So you are ok with compiler propagating n1 to n2 here: n1 += atomic_read(x); other_variable++; n2 += atomic_read(x); without accessing x second time. What's the point? Any sane coder will say that explicitly anyway: tmp = atomic_read(x); n1 += tmp; other_variable++; n2 += tmp; if only for the sake of code readability. Because first code is definitely hinting that it reads RAM twice, and it's actively *bad* for code readability when in fact it's not the case! Locking, compiler and CPU barriers are complicated enough already, please don't make them even harder to understand. -- vda -
No. This is a common mistake, and it's total crap. Any "sane coder" will often use inline functions, macros, etc helpers to do certain abstract things. Those things may contain "atomic_read()" calls. The biggest reason for compilers doing CSE is exactly the fact that many opportunities for CSE simple *are*not*visible* on a source code level. That is true of things like atomic_read() equally as to things like shared offsets inside structure member accesses. No difference what-so-ever. Yes, we have, traditionally, tried to make it *easy* for the compiler to generate good code. So when we can, and when we look at performance for some really hot path, we *will* write the source code so that the compiler doesn't even have the option to screw it up, and that includes things like doing CSE at a source code level so that we don't see the compiler re-doing accesses unnecessarily. And I'm not saying we shouldn't do that. But "performance" is not an either-or kind of situation, and we should: - spend the time at a source code level: make it reasonably easy for the compiler to generate good code, and use the right algorithms at a higher level (and order structures etc so that they have good cache behaviour). - .. *and* expect the compiler to handle the cases we didn't do by hand pretty well anyway. In particular, quite often, abstraction levels at a software level means that we give compilers "stupid" code, because some function may have a certain high-level abstraction rule, but then on a particular architecture it's actually a no-op, and the compiler should get to "untangle" our stupid code and generate good end results. - .. *and* expect the hardware to be sane and do a good job even when the compiler didn't generate perfect code or there were unlucky cache miss patterns etc. and if we do all of that, we'll get good performance. But you really do want all three levels. It's not enough to be good at any one ...
Hi Denys,
in atomic_{read,set} myself, but frankly, at least personally speaking
(now that I know better), I'm not so much in favour of implicit barriers
(compiler, memory or both) in atomic_{read,set}.
This might sound like an about-turn if you read my own postings to Nick
Piggin from a week back, but I do agree with most his opinions on the
matter now -- separation of barriers from atomic ops is actually good,
beneficial to certain code that knows what it's doing, explicit usage
of barriers stands out more clearly (most people here who deal with it
do know cpu_relax() is an explicit compiler barrier) compared to an
I'd have to disagree here -- atomic ops are all about _atomicity_ of
memory accesses, not _making_ them happen (or visible to other CPUs)
_then and there_ itself. The latter are the job of barriers.
The behaviour (and expectations) are quite comprehensively covered in
atomic_ops.txt -- let alone atomic_{read,set}, even atomic_{inc,dec}
are permitted by archs' implementations to _not_ have any memory
barriers, for that matter. [It is unrelated that on x86 making them
SMP-safe requires the use of the LOCK prefix that also happens to be
an implicit memory barrier.]
An argument was also made about consistency of atomic_{read,set} w.r.t.
the other atomic ops -- but clearly, they are all already consistent!
All of them are atomic :-) The fact that atomic_{read,set} do _not_
require any inline asm or LOCK prefix whereas the others do, has to do
with the fact that unlike all others, atomic_{read,set} are not RMW ops
and hence guaranteed to be atomic just as they are in plain & simple C.
But if people do seem to have a mixed / confused notion of atomicity
and barriers, and if there's consensus, then as I'd said earlier, I
have no issues in going with the consensus (eg. having API variants).
Linus would be more difficult to convince, however, I suspect :-)
Satyam
-
The confusion may be the result of us having barrier semantics in atomic_read. If we take that out then we may avoid future confusions. -
I think better name may help. Nuke atomic_read() altogether. n = atomic_value(x); // doesnt hint as strongly at reading as "atomic_read" n = atomic_fetch(x); // yes, we _do_ touch RAM n = atomic_read_uncached(x); // or this How does that sound? -- vda -
atomic_value() vs. atomic_fetch() should be rather unambiguous. atomic_read_uncached() begs the question of precisely which cache we are avoiding, and could itself cause confusion. So, if I were writing atomic.h from scratch, knowing what I know now, I think I'd use atomic_value() and atomic_fetch(). The problem is that there are a lot of existing users of atomic_read(), and we can't write a script to correctly guess their intent. I'm not sure auditing all uses of atomic_read() is really worth the comparatively miniscule benefits. We could play it safe and convert them all to atomic_fetch(), or we could acknowledge that changing the semantics 8 months ago was not at all disastrous, and make them all atomic_value(), allowing people to use atomic_fetch() where they really care. -- Chris -
Not just P4 ... there are other threaded cpus where it is useful to let the core know that this is a busy loop so it would be a good thing to let other threads have priority. Even on a non-threaded cpu the cpu_relax() could be useful in the future to hint to the cpu that it could drop into a lower power hogging state. But I agree with your main point that the loop without the cpu_relax() looks like it ought to work because atomic_read() ought to actually go out and read memory each time around the loop. -Tony -
Part of the motivation here is to fix heisenbugs. If I knew where they were, I'd be posting patches for them. Unlike most bugs, where we want to expose them as obviously as possible, these can be extremely difficult to track down, and are often due to people assuming that the atomic_* operations have the same semantics they've historically had. Remember that until recently, all SMP architectures except s390 (which very few kernel developers outside of IBM, Red Hat, and SuSE do much work on) had volatile declarations for atomic_t. Removing the volatile declarations from i386 and x86_64 may have created heisenbugs that won't manifest themselves until GCC 6.0 comes out and people start compiling kernels with -O5. We should have consistent semantics for atomic_* operations. The other motivation is to reduce the need for the barriers used to prevent/fix such problems which clobber all your registers, and instead force atomic_* operations to behave in the way they're actually used. After the (resubmitted) patchset is merged, we'll be able to remove a whole bunch of barriers, shrinking our source and our binaries, and improving performance. -- Chris -
By the same token we should probably disable optimisations altogether since that too can create heisenbugs. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Precisely the point -- use of volatile (whether in casts or on asms) in these cases are intended to disable those optimizations likely to result in heisenbugs. But they are also intended to leave other valuable optimizations in force. Thanx, Paul -
The only thing volatile on an asm does is create a side effect on the asm statement; in effect, it tells the compiler "do not remove this asm even if you don't need any of its outputs". It's not disabling optimisation likely to result in bugs, heisen- or otherwise; _not_ putting the volatile on an asm that needs it simply _is_ a bug :-) Segher -
Yep. And the reason it is a bug is that it fails to disable the relevant compiler optimizations. So I suspect that we might actually be saying the same thing here. Thanx, Paul -
We're not saying the same thing, but we do agree :-) Segher -
Almost everything is a tradeoff; and so is this. I don't believe most people would find disabling all compiler optimisations an acceptable price to pay for some peace of mind. Segher -
So why is this a good tradeoff? I also think that just adding things to APIs in the hope it might fix up some bugs isn't really a good road to go down. Where do you stop? On the actual proposal to make atomic_operators volatile: I think the better approach in the long term, for both maintainability of the code and education of coders, is to make the use of barriers _more_ explicit rather than sprinkling these "just in case" ones around. You may get rid of a few atomic_read heisenbugs (in noise when compared to all bugs), but if the coder was using a regular atomic load, or a test_bit (which is also atomic), etc. then they're going to have problems. It would be better for Linux if everyone was to have better awareness of barriers than to hide some of the cases where they're required. A pretty large number of bugs I see in lock free code in the VM is due to memory ordering problems. It's hard to find those bugs, or even be aware when you're writing buggy code if you don't have some feel for barriers. -- SUSE Labs, Novell Inc. -
I look at it the other way: keeping the "volatile" semantics in atomic_XXX() (or adding them to it, whatever) helps _prevent_ bugs; certainly most people expect that behaviour, and also that behaviour is *needed* in some places and no other interface provides that functionality. [some confusion about barriers wrt atomics snipped] Segher -
It's easy to be better than something really stupid :) So i386 and x86-64 don't have volatiles there, and it saves them a few K of kernel text. What you need to justify is why it is a good tradeoff to make them volatile (which btw, is much harder to go Yeah, but we could add lots of things to help prevent bugs and would never be included. I would also contend that it helps _hide_ bugs and encourages people to be lazy when thinking about these things. Also, you dismiss the fact that we'd actually be *adding* volatile semantics back to the 2 most widely tested architectures (in terms of test time, number of testers, variety of configurations, and coverage of driver code). This is a very important different from just keeping volatile semantics because it is basically a one-way I don't know that most people would expect that behaviour. Is there any documentation anywhere that would suggest this? Also, barrier() most definitely provides the required functionality. It is overkill in some situations, but volatile is overkill in _most_ situations. If that's what you're worried about, we should add a new What were you confused about? -- SUSE Labs, Novell Inc. -
I'm really surprised it's as much as a few K. I tried it on powerpc and it only saved 40 bytes (10 instructions) for a G5 config. Paul. -
I'm surprised too. Numbers were from the "...use asm() like the other atomic operations already do" thread. According to them, text data bss dec hex filename 3434150 249176 176128 3859454 3ae3fe atomic_normal/vmlinux 3436203 249176 176128 3861507 3aec03 atomic_volatile/vmlinux The first one is a stock kenel, the second is with atomic_read/set cast to volatile. gcc-4.1 -- maybe if you have an earlier gcc it won't optimise as much? -- SUSE Labs, Novell Inc. -
No, see my earlier reply. "volatile" really *is* an incredible piece of
crap.
Just try it yourself:
volatile int i;
int j;
int testme(void)
{
return i <= 1;
}
int testme2(void)
{
return j <= 1;
}
and compile with all the optimizations you can.
I get:
testme:
movl i(%rip), %eax
subl $1, %eax
setle %al
movzbl %al, %eax
ret
vs
testme2:
xorl %eax, %eax
cmpl $1, j(%rip)
setle %al
ret
(now, whether that "xorl + setle" is better than "setle + movzbl", I don't
really know - maybe it is. But that's not thepoint. The point is the
difference between
movl i(%rip), %eax
subl $1, %eax
and
cmpl $1, j(%rip)
and imagine this being done for *every* single volatile access.
Just do a
git grep atomic_read
to see how atomics are actually used. A lot of them are exactly the above
kind of "compare against a value".
Linus
-
gcc bugzilla bug #33102, for whatever that ends up being worth. ;-) Thanx, Paul -
I had totally forgotten that I'd already filed that bug more than six years ago until they just closed yours as a duplicate of mine :) Good luck in getting it fixed! Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Well, just got done re-opening it for the third time. And a local gcc community member advised me not to give up too easily. But I must admit that I am impressed with the speed that it was identified as duplicate. Should be entertaining! ;-) Thanx, Paul -
Right. ROTFL... volatile actually breaks atomic_t instead of making it safe. x++ becomes a register load, increment and a register store. Without volatile we can increment the memory directly. It seems that volatile requires that the variable is loaded into a register first and then operated upon. Understandable when you think about volatile being used to access memory mapped I/O registers where a RMW operation could be problematic. See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3506 -
No code does (or would do, or should do): x.counter++; on an "atomic_t x;" anyway. -
That's just an example of a general problem. No, you don't use "x.counter++". But you *do* use if (atomic_read(&x) <= 1) and loading into a register is stupid and pointless, when you could just do it as a regular memory-operand to the cmp instruction. And as far as the compiler is concerned, the problem is the 100% same: combining operations with the volatile memop. The fact is, a compiler that thinks that movl mem,reg cmpl $val,reg is any better than cmpl $val,mem is just not a very good compiler. But when talking about "volatile", that's exactly what ytou always get (and always have gotten - this is not a regression, and I doubt gcc is alone in this). Linus -
True, but that makes this a bad/poor code generation issue with the compiler, not something that affects the _correctness_ of atomic ops if "volatile" is used for that counter object (as was suggested), because we'd always use the atomic_inc() etc primitives to do increments, which Absolutely, this is definitely a bug report worth opening with gcc. And what you've said to explain this previously sounds definitely correct -- seeing "volatile" for any access does appear to just scare the hell out of gcc and makes it generate such (poor) code. Satyam -
One of the gcc guys claimed that he thought that the two-instruction sequence would be faster on some x86 machines. I pointed out that there might be a concern about code size. I chose not to point out that people might also care about the other x86 machines. ;-) Thanx, Paul -
Some (very few) x86 uarchs do tend to prefer "load-store" like code generation, and doing a "mov [mem],reg + op reg" instead of "op [mem]" can actually be faster on some of them. Not any that are relevant today, though. Also, that has nothing to do with volatile, and should be controlled by optimization flags (like -mtune). In fact, I thought there was a separate flag to do that (ie something like "-mload-store"), but I can't find it, so maybe that's just my fevered brain.. Linus -
Good point, will suggest this if the need arises. Thanx, Paul -
It doesn't mean that (volatile int*) cast is bad, it means that current gcc is bad (or "not good enough"). IOW: instead of avoiding volatile cast, Linus, in all honesty gcc has many more cases of suboptimal code, case of "volatile" is just one of many. Off the top of my head: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417 unsigned v; void f(unsigned A) { v = ((unsigned long long)A) * 365384439 >> (27+32); } gcc-4.1.1 -S -Os -fomit-frame-pointer t.c f: movl $365384439, %eax mull 4(%esp) movl %edx, %eax <===== ? shrl $27, %eax movl %eax, v ret Why is it moving %edx to %eax? gcc-4.2.1 -S -Os -fomit-frame-pointer t.c f: movl $365384439, %eax mull 4(%esp) movl %edx, %eax <===== ? xorl %edx, %edx <===== ??! shrl $27, %eax movl %eax, v ret Progress... Now we also zero out %edx afterwards for no apparent reason. -- vda -
I would agree that fixing the compiler in this case would be a good thing, even quite regardless of any "atomic_read()" discussion. I just have a strong suspicion that "volatile" performance is so low down the list of any C compiler persons interest, that it's never going to happen. And quite frankly, I cannot blame the gcc guys for it. That's especially as "volatile" really isn't a very good feature of the C language, and is likely to get *less* interesting rather than more (as user space starts to be more and more threaded, "volatile" gets less and less useful. [ Ie, currently, I think you can validly use "volatile" in a "sigatomic_t" kind of way, where there is a single thread, but with asynchronous events. In that kind of situation, I think it's probably useful. But once you get multiple threads, it gets pointless. Sure: you could use "volatile" together with something like Dekker's or Peterson's algorithm that doesn't depend on cache coherency (that's basically what the C "volatile" keyword approximates: not atomic accesses, but *uncached* accesses! But let's face it, that's way past insane. ] So I wouldn't expect "volatile" to ever really generate better code. It might happen as a side effect of other improvements (eg, I might hope that the SSA work would eventually lead to gcc having a much better defined model of valid optimizations, and maybe better code generation for volatile accesses fall out cleanly out of that), but in the end, it's such an ugly special case in C, and so seldom used, that I wouldn't depend on Well, the thing is, quite often, many of those "suboptimal code" generations fall into two distinct classes: - complex C code. I can't really blame the compiler too much for this. Some things are *hard* to optimize, and for various scalability reasons, you often end up having limits in the compiler where it doesn't even _try_ doing certain optimizations if you have excessive ...
Yep. The initial reaction was in fact to close my bug as a duplicate of 3506. But I was not asking for atomicity, but rather for smaller code to be generated, so I reopened it. Thanx, Paul -
So, if we want consistent behavior, we're pretty much screwed unless we use inline assembler everywhere? -- Chris -
Nah, this whole argument is flawed -- "without volatile" we still
*cannot* "increment the memory directly". On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.
And no, RMW on MMIO isn't "problematic" at all, either.
An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode. But three actual operations.
The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
you can guarantee you *do* get an atomic op;
3) you don't need to use "volatile <data>" which generates
not-all-that-good code on archs like x86, and we want to get rid
of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile <type>*)&<data>, which a) doesn't
exist in C; b) isn't documented or supported in GCC; c) has a recent
history of bugginess; d) _still uses volatile objects_; e) _still_
is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
you cannot with the other alternatives.
The only disadvantage I know of is potentially slightly worse
instruction scheduling. This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.
Segher
-
Maybe for some CPUs, but not all. ARM for instance can't use the load exclusive and store exclusive instructions to MMIO space. This means placing atomic_t or bitops into MMIO space is a definite no-go on ARM. It breaks. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
Sure, your CPU doesn't have RMW instructions -- how to emulate those if you don't have them is a totally different thing. Segher -
I thought that ARM's load exclusive and store exclusive instructions were its equivalent of LL and SC, which RISC machines typically use to build atomic sequences of instructions -- and which normally cannot be applied to MMIO space. Thanx, Paul -
Absolutely correct. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
Let me say it more clearly: On ARM, it is impossible to perform atomic operations on MMIO space. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
Actually, no one is suggesting that we try to do that at all. The discussion about RMW ops on MMIO space started with a comment attributed to the gcc developers that one reason why gcc on x86 doesn't use instructions that do RMW ops on volatile variables is that volatile is used to mark MMIO addresses, and there was some uncertainty about whether (non-atomic) RMW ops on x86 could be used on MMIO. This is in regard to the question about why gcc on x86 always moves a volatile variable into a register before doing anything to it. So the whole discussion is irrelevant to ARM, PowerPC and any other architecture except x86[-64]. Paul. -
It's even irrelevant on x86 because all modifying operations on atomic_t are coded in inline assembler and will always be RMW no matter if atomic_t is volatile or not. [ignoring atomic_set(x, atomic_read(x) + 1) which nobody should do] The only issue is if atomic_t should have a implicit barrier or not. My personal opinion is yes -- better safe than sorry. And any code impact it may have is typically dwarved by the next cache miss anyways, so it doesn't matter much. -Andi -
This question is GCC PR33102, which was incorrectly closed as a duplicate of PR3506 -- and *that* PR was closed because its reporter seemed to claim the GCC generated code for an increment on a volatile (namely, three machine instructions: load, modify, store) was incorrect, and it has to And even there, it's not something the kernel can take advantage of before GCC 4.4 is in widespread use, if then. Let's move on. Segher -
I agree that instant gratification is hard to come by when synching up compiler and kernel versions. Nonetheless, it should be possible to create APIs that are are conditioned on the compiler version. Thanx, Paul -
We've tried that, sort of. See the mess surrounding the whole extern/static/inline/__whatever boondogle, which seems to have changed semantics in every single gcc release since 2.95 or so. And recently mention was made that gcc4.4 will have *new* semantics in this area. Yee. Hah.
There is exactly one semantics change in gcc in this area, and that is
the change of the "extern inline" semantics in gcc 4.3 to the
C99 semantics.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
It's all completely beside the point, see the other subthread, but... Yeah, you can't do LL/SC to MMIO space; ARM isn't alone in that. You could still implement atomic operations on MMIO space by taking a lock elsewhere, in normal cacheable memory space. Why you would do this is a separate question, you probably don't want it :-) Segher -
One of the things that "volatile" generally screws up is a simple
volatile int i;
i++;
which a compiler will generally get horribly, horribly wrong.
In a reasonable world, gcc should just make that be (on x86)
addl $1,i(%rip)
on x86-64, which is indeed what it does without the volatile. But with the
volatile, the compiler gets really nervous, and doesn't dare do it in one
instruction, and thus generates crap like
movl i(%rip), %eax
addl $1, %eax
movl %eax, i(%rip)
instead. For no good reason, except that "volatile" just doesn't have any
good/clear semantics for the compiler, so most compilers will just make it
be "I will not touch this access in any way, shape, or form". Including
even trivially correct instruction optimization/combination.
This is one of the reasons why we should never use "volatile". It
pessimises code generation for no good reason - just because compilers
don't know what the heck it even means!
Now, people don't do "i++" on atomics (you'd use "atomic_inc()" for that),
but people *do* do things like
if (atomic_read(..) <= 1)
..
On ppc, things like that probably don't much matter. But on x86, it makes
a *huge* difference whether you do
movl i(%rip),%eax
cmpl $1,%eax
or if you can just use the value directly for the operation, like this:
cmpl $1,i(%rip)
which is again a totally obvious and totally safe optimization, but is
(again) something that gcc doesn't dare do, since "i" is volatile.
In other words: "volatile" is a horribly horribly bad way of doing things,
because it generates *worse*code*, for no good reason. You just don't see
it on powerpc, because it's already a load-store architecture, so there is
no "good code" for doing direct-to-memory operations.
Linus
-
Blech. Sounds like a chat with some gcc people is in order. Will see what I can do. -
Hi Linus,
[ and others; I think there's a communication gap in a lot of this
thread, and a little summary would be useful. Hence this posting. ]
True, and I bet *everybody* on this list has already agreed for a very
long time that using "volatile" to type-qualify the _declaration_ of an
object itself as being horribly bad (taste-wise, code-generation-wise,
often even buggy for sitations where real CPU barriers should've been
used instead).
However, the discussion on this thread (IIRC) began with only "giving
volatility semantics" to atomic ops. Now that is different, and may not
require the use the "volatile" keyword (at least not in the declaration
of the object) itself.
Sadly, most arch's *still* do type-qualify the declaration of the
"counter" member of atomic_t as "volatile". This is probably a historic
hangover, and I suspect not yet rectified because of lethargy.
Anyway, some of the variants I can think of are:
[1]
#define atomic_read_volatile(v) \
({ \
forget((v)->counter); \
((v)->counter); \
})
where:
#define forget(a) __asm__ __volatile__ ("" :"=m" (a) :"m" (a))
[ This is exactly equivalent to using "+m" in the constraints, as recently
explained on a GCC list somewhere, in response to the patch in my bitops
series a few weeks back where I thought "+m" was bogus. ]
[2]
#define atomic_read_volatile(v) (*(volatile int *)&(v)->counter)
This is something that does work. It has reasonably good semantics
guaranteed by the C standard in conjunction with how GCC currently
behaves (and how it has behaved for all supported versions). I haven't
checked if generates much different code than the first variant above,
(it probably will generate similar code to just declaring the object
as volatile, but would still be better in terms of code-clarity and
taste, IMHO), but in any case, we should pick whichever of these variants
works for us and generates good code.
[3]
static inline int atomic_read_volatile(atomic_t ...*vomit* :) Not only do I hate the keyword volatile, but the barrier is only a one-sided affair so its probable this is going to have slightly different allowed reorderings than a real volatile access. Also, why would you want to make these insane accessors for atomic_t types? Just make sure everybody knows the basics of barriers, and they can apply that knowledge to atomic_t and all other lockless memory I like order(x) better, but it's not the most perfect name either. -- SUSE Labs, Novell Inc. -
I wonder if this'll generate smaller and better code than _both_ the
other atomic_read_volatile() variants. Would need to build allyesconfig
Code that looks like:
while (!atomic_read(&v)) {
...
cpu_relax_no_barrier();
forget(v.counter);
^^^^^^^^
}
would be uglier. Also think about code such as:
a = atomic_read();
if (!a)
do_something();
forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */
forget();
a = atomic_read();
...
So much explicit sprinkling of "forget()" looks ugly.
atomic_read_volatile()
on the other hand, looks neater. The "_volatile()" suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
forget(x) is just a stupid-placeholder-for-a-better-name. order(x) sounds
good but we could leave quibbling about function or macro names for later,
this thread is noisy as it is :-)
-
I think they would both be equally ugly, but the atomic_read_volatile variant would be more prone to subtle bugs because of the weird implementation. And it would be more ugly than introducing an order(x) statement for all memory operations, and adding an order_atomic() wrapper for it Firstly, why is it ugly? It's nice because of those nice explicit statements there that give us a good heads up and would have some comments attached to them (also, lack of the word "volatile" is always a plus). Secondly, what sort of code would do such a thing? In most cases, it is probably riddled with bugs anyway (unless it is doing a really specific sequence of interrupts or something, but in that case it is very likely to either require locking or busy waits Just don't use the word volatile, and have barriers both before and after the memory operation, and I'm OK with it. I don't see the point though, when you could just have a single barrier(x) barrier function defined for all memory locations, rather than this odd thing that only works for atomics (and would have to be duplicated for atomic_set. -
You think both these are equivalent in terms of "looks":
|
while (!atomic_read(&v)) { | while (!atomic_read_xxx(&v)) {
... | ...
cpu_relax_no_barrier(); | cpu_relax_no_barrier();
order_atomic(&v); | }
} |
(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)
?
Oh, that order() / forget() macro [forget() was named such by Chuck Ebbert
earlier in this thread where he first mentioned it, btw] could definitely
atomic_read_xxx (where xxx = whatever naming sounds nice to you) would
obviously also give a heads up, and could also have some comments
See the nodemgr_host_thread() that does something similar, though not
That sounds amazingly frivolous, but hey, why not. As I said, ok,
How could that lead to bugs? (if you can point to existing code,
Why would it work only for atomics? You could use that generic macro
#define atomic_set_xxx for something similar. Big deal ... NOT.
-
I think the LHS is better if your atomic_read_xxx primitive is using the crazy one-sided barrier, because the LHS code you immediately know what barriers are happening, and with the RHS you have to look at the atomic_read_xxx definition. If your atomic_read_xxx implementation was more intuitive, then both are You can't think for yourself? Your atomic_read_volatile contains a compiler barrier to the atomic variable before the load. 2 such reads from different locations look like this: asm volatile("" : "+m" (v1)); atomic_read(&v1); asm volatile("" : "+m" (v2)); atomic_read(&v2); Which implies that the load of v1 can be reordered to occur after the load I'm sorry, all this waffling about made up code which might do this and that is just a waste of time. Seriously, the thread is bloated enough and never going to get anywhere with all this handwaving. If someone is saving up all the really real and actually good arguments for why we Typo. I meant: defined for a single memory location (ie. order(x)). -
^^^^^ No. As I said, the _xxx (whatever the heck you want to name it as) should First, you could try looking at the code. And by the way, as I've already said (why do *require* people to have to repeat things to you?) this isn't even about only existing code. -
I'm sure it would be a tiny effect. This whole thread is arguing about effects that are quite insignificant. On the one hand we have the non-volatile proponents, who want to let the compiler do extra optimizations - which amounts to letting it elide maybe a dozen loads in the whole kernel, loads which would almost always be L1 cache hits. On the other hand we have the volatile proponents, who are concerned that some code somewhere in the kernel might be buggy without the volatile behaviour, and who also want to be able to remove some barriers and thus save a few bytes of code and a few loads here and there (and possibly some stores too). Either way the effect on code size and execution time is miniscule. In the end the strongest argument is actually that gcc generates unnecessarily verbose code on x86[-64] for volatile accesses. Even then we're only talking about ~2000 bytes, or less than 1 byte per instance of atomic_read on average, about 0.06% of the kernel text size. The x86[-64] developers seem to be willing to bear the debugging cost involved in having the non-volatile behaviour for atomic_read, in order to save the 2kB. That's fine with me. Either way I think somebody should audit all the uses of atomic_read, not just for missing barriers, but also to find the places where it's used in a racy manner. Then we can work out where the races matter and fix them if they do. Paul. -
Hmm, the fact that this thread became what it did, probably means that most developers on this list do not mind thinking/arguing about effects or optimizations that are otherwise "tiny". But yeah, they are tiny nonetheless. -
[It wasn't explained on a GCC list in response to your patch, as far as I can see -- if I missed it, please point me to an archived version of it]. One last time: it isn't equivalent on older (but still supported by Linux) versions of GCC. Current versions of GCC allow it, but it has no documented behaviour at all, so use it at your own risk. Segher -
http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01758.html is a follow-up in the thread on the gcc-patches@gcc.gnu.org mailing list, which began with: http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01677.html that was posted by Jan Kubicka, as he quotes in that initial posting, after I had submitted: http://lkml.org/lkml/2007/7/23/252 which was a (wrong) patch to "rectify" what I thought was the "bogus" "+m" constraint, as per the quoted extract from gcc docs (that was given in that (wrong) patch's changelog). That's when _I_ came to know how GCC interprets "+m", but probably this has been explained on those lists multiple times. Who cares, True. -
I just couldn't find the thread you meant, I thought I missed have it, that's all :-) Segher -
Apart from having to fetch more bytes for the instructions (which does
matter), execution time is probably the same on modern processors, as they
convert the single instruction to RISC-style load, modify, store anyway.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-
But for atomic_t people use atomic_inc() anyways which does this correctly. It shouldn't really matter for atomic_t. I'm worrying a bit that the volatile atomic_t change caused subtle code breakage like these delay read loops people here pointed out. Wouldn't it be safer to just re-add the volatile to atomic_read() for 2.6.23? Or alternatively make it asm(), but volatile seems more proven. -Andi -
The problem with volatile is not just trashy code generation (which also definitely is a major problem), but definition holes, and implementation inconsistencies. Making it asm() is not the only other alternative to volatile either (read another reply to this mail), but considering most of the thread has been about people not wanting even a atomic_read_volatile() variant, making atomic_read() itself have volatile semantics sounds ... strange :-) PS: http://lkml.org/lkml/2007/8/15/407 was submitted a couple days back, any word if you saw that? I have another one for you: [PATCH] i386, x86_64: __const_udelay() should not be marked inline Because it can never get inlined in any callsite (each translation unit is compiled separately for the kernel and so the implementation of __const_udelay() would be invisible to all other callsites). In fact it turns out, the correctness of callsites at arch/x86_64/kernel/crash.c:97 and arch/i386/kernel/crash.c:101 explicitly _depends_ upon it not being inlined, and also it's an exported symbol (modules may want to call mdelay() and udelay() that often becomes __const_udelay() after some macro-ing in various headers). So let's not mark it as "inline" either. Signed-off-by: Satyam Sharma <satyam@infradead.org> --- arch/i386/lib/delay.c | 2 +- arch/x86_64/lib/delay.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/i386/lib/delay.c b/arch/i386/lib/delay.c index f6edb11..0082c99 100644 --- a/arch/i386/lib/delay.c +++ b/arch/i386/lib/delay.c @@ -74,7 +74,7 @@ void __delay(unsigned long loops) delay_fn(loops); } -inline void __const_udelay(unsigned long xloops) +void __const_udelay(unsigned long xloops) { int d0; diff --git a/arch/x86_64/lib/delay.c b/arch/x86_64/lib/delay.c index 2dbebd3..d0cd9cd 100644 --- a/arch/x86_64/lib/delay.c +++ b/arch/x86_64/lib/delay.c @@ -38,7 +38,7 @@ void __delay(unsigned long loops) } EXPORT_SYMBOL(__delay); -inline void ...
It's just a (target-specific, perhaps) missed-optimisation kind Just nobody taught it it can do this; perhaps no one wanted to add optimisations like that, maybe with a reasoning like "people who hit the go-slow-in-unspecified-ways button should get what they deserve" ;-) Segher -
My point is that people *already* made those assumptions. There
are two ways to clean up this mess:
1) Have the "volatile" semantics by default, change the users
that don't need it;
2) Have "non-volatile" semantics by default, change the users
that do need it.
Option 2) randomly breaks stuff all over the place, option 1)
doesn't. Yeah 1) could cause some extremely minor speed or
code size regression, but only temporarily until everything has
Sure. We aren't _adding_ anything here though, not on the platforms
I'm not dismissing that. x86 however is one of the few architectures
where mistakenly leaving out a "volatile" will not easily show up on
user testing, since the compiler will very often produce a memory
That's a good point. Maybe we should create _two_ new APIs, one
Not really I think, no. But not the other way around, either.
Me? Not much.
Segher
-
