2.6.32-stable review patch. If anyone has any objections, please let us know. ------------------ From: Glauber Costa <glommer@redhat.com> In recent stress tests, it was found that pvclock-based systems could seriously warp in smp systems. Using ingo's time-warp-test.c, I could trigger a scenario as bad as 1.5mi warps a minute in some systems. (to be fair, it wasn't that bad in most of them). Investigating further, I found out that such warps were caused by the very offset-based calculation pvclock is based on. This happens even on some machines that report constant_tsc in its tsc flags, specially on multi-socket ones. Two reads of the same kernel timestamp at approx the same time, will likely have tsc timestamped in different occasions too. This means the delta we calculate is unpredictable at best, and can probably be smaller in a cpu that is legitimately reading clock in a forward ocasion. Some adjustments on the host could make this window less likely to happen, but still, it pretty much poses as an intrinsic problem of the mechanism. A while ago, I though about using a shared variable anyway, to hold clock last state, but gave up due to the high contention locking was likely to introduce, possibly rendering the thing useless on big machines. I argue, however, that locking is not necessary. We do a read-and-return sequence in pvclock, and between read and return, the global value can have changed. However, it can only have changed by means of an addition of a positive value. So if we detected that our clock timestamp is less than the current global, we know that we need to return a higher one, even though it is not exactly the one we compared to. OTOH, if we detect we're greater than the current time source, we atomically replace the value with our new readings. This do causes contention on big boxes (but big here means *BIG*), but it seems like a good trade off, since it provide us with a time source guaranteed to be stable wrt time warps. After this ...
Hey, 2.6.32.16 fails to boot on my KVM domains using qemu-kvm 0.11.1. Bisecting between 2.6.32.14 which worked and .16 turned up this commit as the first culprit[0]. The host is still running 2.6.32.14 and has 8 cores on 2 CPUs. The single-cpu KVM domain hangs just after printing 'Write protecting the kernel read-only data: 9492k'[1]. On a successful boot this line would usually be followed by 'INIT: version 2.86 booting'. A 2.6.32.16 with this patch reverted boots fine. If there's any info you need please just ask. Cheers, Peter 0. http://asteria.noreply.org/~weasel/volatile/2010-07-07-x9KxN34l17c/fileztnjyZ 1. http://asteria.noreply.org/~weasel/volatile/2010-07-07-VTRuAQGOKlY/zoe-2.6.32.16.png 2. http://asteria.noreply.org/~weasel/volatile/2010-07-07-Fq0PVc1ecsc/config-2.6.32.16-ds... -- | .''`. ** Debian GNU/Linux ** Peter Palfrader | : :' : The universal http://www.palfrader.org/ | `. `' Operating System | `- http://www.debian.org/ --
if you boot with another clocksource, and then switch to kvmclock with the machine already running, do you see anything strange or suspicious? --
Booting with various clocksource=xxx kernel parameters does not change
the behaviour at all, i.e. the boot still hangs.
Cheers,
Peter
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
And what if you provide -cpu qemu64,-kvmclock to qemu command line? -- Gleb. --
Adding that to the glob of options that already were there from libvirt didn't disable it, but using an LD_PRELOAD wrapper[1] to that purpose on the host when starting kvm did help. Now, with kvmclock no longer being available at all the system picks tsc and indeed boots 2.6.32.16 successfully. - Peter 1. http://people.debian.org/~paravoid/kvm-noclock-3.tar.gz -- | .''`. ** Debian GNU/Linux ** Peter Palfrader | : :' : The universal http://www.palfrader.org/ | `. `' Operating System | `- http://www.debian.org/ --
Strange. -kvmclock should have had the same effect. What qemu is it? -- Gleb. --
| QEMU PC emulator version 0.11.1 (qemu-kvm-0.11.1), Copyright (c) 2003-2008 Fabrice Bellard
from the debian package qemu-kvm (0.11.1+dfsg-1~bpo50+1)
/usr/bin/kvm -S -M pc-0.11 -enable-kvm -m 512 -smp 1 -name zoe -uuid 1885e784-c831-4ef8-9576-8eaa9abf3a8b -monitor unix:/var/lib/libvirt/qemu/zoe.monitor,server,nowait -boot c -drive file=/dev/vg_sookie_system/zoe-boot,if=ide,bus=0,unit=0,boot=on -drive file=/dev/vg_sookie_system/zoe-root,if=virtio -drive file=/dev/vg_sookie_system/zoe-swap,if=virtio -net nic,macaddr=00:16:36:40:00:0b,vlan=0,model=virtio,name=net0 -net tap,fd=25,vlan=0,name=hostnet0 -net nic,macaddr=00:16:36:40:05:0b,vlan=1,model=virtio,name=net1 -net tap,fd=28,vlan=1,name=hostnet1 -serial none -parallel none -usb -vnc 127.0.0.1:10 -vga cirrus -balloon virtio
And I had added "-cpu qemu64,-kvmclock" to that at either the front or the end.
Cheers,
[should we trim the CC list? If yes, to what?]
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
wow, it is really weird then. that patch shouldn't affect anything outside the pvclock realm. --
Unless you added data which is mistakenly in read-only section, I can't see how it would affect anything either. Of course, you have changed the data and text size, it is possible this triggered another bug. What exact section does the per-cpu pvclock data fall into? It's read-only in the kernel, but writeable from the hypervisor. Also, did this patch arrive before or after the pvclock reboot bugfix? Because if the hypervisor is still writing the pvclock page, clocksource=xxx would not change that behavior without that fix. Zach --
Yes, but at least this patch alone, does not touch anything on that it is not explicitly marked read only in the kernel, afaik. Dunno in which order they got applied in stable. But from what I understood, this is not a reboot scenario. It happens on first boot, right? --
I'm unable to reproduce. Can you provide the stack trace where this hangs? Add CONFIG_DEBUG_INFO, and then - copy vmlinux somewhere on the host filesystem - start the guest - when it hangs, type 'gdbserver' in the qemu monitor - on the host, the 'gdb /path/to/vmlinux' - 'target remote :1234' - 'backtrace' -- error compiling committee.c: too many arguments to function --
Sorry for the delay. Here goes: | (gdb) target remote :1234 | Remote debugging using :1234 | [New Thread 1] | 0xffffffff81702314 in _spin_lock (lock=0xffffffff81ab9e30) | at /scratch/kernel/2.6.32.16/arch/x86/include/asm/spinlock.h:65 | 65 /scratch/kernel/2.6.32.16/arch/x86/include/asm/spinlock.h: No such file or directory. | in /scratch/kernel/2.6.32.16/arch/x86/include/asm/spinlock.h | (gdb) bt | #0 0xffffffff81702314 in _spin_lock (lock=0xffffffff81ab9e30) | at /scratch/kernel/2.6.32.16/arch/x86/include/asm/spinlock.h:65 | #1 0xffffffff8107b214 in vprintk ( | fmt=0xffffffff818a3ee8 "<1>BUG: unable to handle kernel ", | args=0xffff88001f8f2c38) at kernel/printk.c:705 | #2 0xffffffff816ff387 in printk (fmt=0xffffffff81ab9e30 "") | at kernel/printk.c:595 | #3 0xffffffff8105b94a in no_context (regs=0xffff88001f8f2e98, error_code=3, | address=18446744071586262656) at arch/x86/mm/fault.c:583 | #4 0xffffffff8105bb65 in __bad_area_nosemaphore (regs=0xffff88001f8f2e98, | error_code=3, address=18446744071586262656, si_code=196609) | at arch/x86/mm/fault.c:741 | #5 0xffffffff8105bc4e in bad_area_nosemaphore (regs=0xffffffff81ab9e30, | error_code=18446612132843695160, address=4294892530) | at arch/x86/mm/fault.c:748 | #6 0xffffffff8105c02f in do_page_fault (regs=0xffff88001f8f2e98, error_code=3) | at arch/x86/mm/fault.c:1061 | #7 0xffffffff81702765 in page_fault () | #8 0x000d3496569e13a8 in ?? () | #9 0x00000000b4b2c8e7 in ?? () | #10 0x00000003ede5a5a5 in ?? () | #11 0x0000000000000001 in per_cpu__irq_stack_union () | #12 0xffff880001c11e80 in ?? () | ---Type <return> to continue, or q <return> to quit--- | #13 0x0000000000000000 in ?? () | (gdb) | (gdb) | (gdb) | (gdb) bt full | #0 0xffffffff81702314 in _spin_lock (lock=0xffffffff81ab9e30) | at /scratch/kernel/2.6.32.16/arch/x86/include/asm/spinlock.h:65 | No locals. | #1 0xffffffff8107b214 in vprintk ( | fmt=0xffffffff818a3ee8 ...
So it looks like last_value was placed in a read only section. Please post your System.map somewhere. -- error compiling committee.c: too many arguments to function --
weasel@intrepid:~$ publish System.map http://asteria.noreply.org/~weasel/volatile/2010-07-13-mbm2xEdd8Q4/System.map weasel@intrepid:~$ grep -i last_value System.map ffffffff81712e80 r last_value ffffffff81b05240 b last_value.26163 Cheers, Peter -- | .''`. ** Debian GNU/Linux ** Peter Palfrader | : :' : The universal http://www.palfrader.org/ | `. `' Operating System | `- http://www.debian.org/ --
"r" = "read only" How does it look in 'nm arch/x86/kernel/pvclock.o'? --
The same:
[git|v2.6.32.16] weasel@thelma:/scratch/kernel/2.6.32.16$ nm arch/x86/kernel/pvclock.o
0000000000000000 r last_value
U native_read_tsc
0000000000000040 T pvclock_clocksource_read
0000000000000100 T pvclock_read_wallclock
0000000000000000 T pvclock_tsc_khz
U set_normalized_timespec
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
Doesn't make any sense. Let's try to see if the toolchain is confused: - rename last_value to some random name - drop the 'static' qualifier - drop the '= ATOMIC64_INIT(0)' - all of the above (better start with the last). No need to rebuild everything, just look at the output of nm pvclock.o. My bet is that dropping 'static' will fix it. We may have the wrong constraints on atomic64_cmpxchg64(), so the compiler thinks we never change last_value. -- error compiling committee.c: too many arguments to function --
The constraints are there, but maybe the toolchain is confused. -- error compiling committee.c: too many arguments to function --
No they aren't, as Linus just pointed out. His patch should fix the problem. -- error compiling committee.c: too many arguments to function --
Linus's patch touches __xchg() whereas we're using __cmpxchg() in this
particular case I think.
At least, applying it to my 2.6.32.16 tree didn't help, last_value was
still read-only. Or I backported it wrong.
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
No, you didn't back-port it wrong. I just didn't fix the right part. I
thought the PV code used xchg, not cmpxchg, so I only patched that.
But cmpxchg has the exact same issue.
Does this fix it?
Again: UNTESTED.
Linus
On Tue, Jul 13, 2010 at 10:50 AM, Linus Torvalds
Btw, this second patch was a bit more aggressive than the first one,
and actually removes the "memory" clobber entirely, and the fake cast
of the target type.
That shouldn't matter _except_ if people actually use cmpxchg to
implement their own locking, since now the compiler could potentially
move unrelated memory references around the lock. Of course, if you
use cmpxchg to implement your own locking, you're probably doing
something wrong anyway (ie you'll get the wrong memory barriers on
various architectures), so it should all be fine.
But I thought I'd mention it. And I don't really know how much gcc
moves memory accesses around a "asm volatile" - the gcc docs are
historically very vague ("volatile asms aren't moved around
'significantly'", whatever 'significant' means)
And btw, none of the above should be taken to mean that I have tested
the patch or found it to be otherwise good. It might be totally broken
for other reasons. Caveat emptor.
Linus
--
There are some places which rely on xchg/cmpxchg being a barrier in
arch-specific code. For example, the Xen code uses as part of the
"asm volatile"'s only real meaning is that it will not get elided if it
appears its output is unused (assuming it is reachable at all). I don't
think you can consider it having any meaningful effects on ordering.
J
--
Actually, I believe volatile operations (including asm volatile) are strictly ordered *with respect to other volatile operations*. As such I would think we'd want to keep the "memory" clobber here, to make it strictly ordered with regards to *all* memory operations. As for the concept in this patch, it's obviously the right thing, but as Linus said elsewhere it's incomplete. Conceptually: Acked-by: H. Peter Anvin <hpa@zytor.com> Unfortunately I can't write the productization patch right now since I have a plane to catch in about an hour, but if noone beats me to it I'll to it Thursday. -hpa --
The documentation makes no reference to that property; in fact it
suggests it is outright not true:
Note that even a volatile `asm' instruction can be moved relative to
other
code, including across jump instructions. For example, on many targets
there is a system register which can be set to control the rounding
mode of floating point operations. You might try setting it with a
volatile `asm', like this PowerPC example:
asm volatile("mtfsf 255,%0" : : "f" (fpenv));
sum = x + y;
This will not work reliably, as the compiler may move the addition
back before the volatile `asm'. To make it work you need to add an
artificial dependency to the `asm' referencing a variable in the
code you don't want moved, for example:
asm volatile ("mtfsf 255,%1" : "=X"(sum): "f"(fpenv));
sum = x + y;
Similarly, you can't expect a sequence of volatile `asm'
instructions to remain perfectly consecutive.
[...]
An `asm' instruction without any output operands will be treated
That would keep its overall effect consistent.
J
--
The gcc documentation wrt inline asm's is totally worthless. Don't
even bother quoting it - because the gcc people themselves have never
cared. If the docs ever end up not matching what they want to do, they
will just change the documentation.
In other words, at least historically the docs are not in any way
meaningful. They are not a "these are the semantics we guarantee",
they are just random noise. As I mentioned, the docs historically just
said something like "will not be moved significantly", and apparently
they've been changed to be something else.
The only thing that has ever been meaningful is "this works". And, of
course, that has changed over time too, including actual
recommendations on how to make something work (as mentioned, iirc "+"
was only valid on register constraints, and "+m" used to not be
allowed _or_ recommended, these days it's the only way to do certain
things).
It's irritating, because in other circumstances, gcc people take the
reverse approach, and consider paper documentation more important than
actual implementation issues. So sometimes they say "hey, this is the
spec", but when it comes to their own docs, the answer has
historically been "we'll just change the spec".
Linus
--
Sure, I completely agree. At the moment the docs say "asm volatile
guarantees nothing", and we can work with that. So long as we don't
expect asm volatile to mean anything more (ie, magic semantics involving
reordering), everyone is happy.
BTW, gcc 2.95's docs do mention "asm volatile" having an effect on
ordering, which is probably where the notion came from: "If you write an
`asm' instruction with no outputs, GNU CC [...] not delete the
instruction or move it outside of loops. [...] you should write the
`volatile' keyword to prevent future versions of GNU CC from moving the
instruction around within a core region". Lucky we never relied on
that, right? Right?
J
--
If gcc ever starts reordering volatile operations, including "asm volatile", the kernel will break, and will be unfixable. Just about every single driver will break. All over the kernel we're explicitly or implicitly making the assumption that volatile operations are strictly ordered by the compiler with respect to each other. -hpa --
Can you give an example? All the cases I've seen rely on the ordering
properties of "memory" clobbers, which is sound. (And volatile
variables are a completely unrelated issue, of course.)
J
--
Yes, it looks like they should have memory barriers if we want them to
be ordered with respect to normal writes; afaict "asm volatile" has
never had strict ordering wrt memory ops.
Anything else?
J
--
[Adding H.J. to the Cc: list] Noone has talked about strict ordering between volatiles and (non-volatile) memory ops in general. I have been talking about volatile to volatile ordering, and I thought I'd been very clear about that. H.J., we're having a debate about the actual semantics of "volatile", especially "asm volatile" in gcc. In particular, I believe that volatile operations should not be possible to reorder with regards to each other, and the kernel depends on that fact. -hpa P.S: gcc 4.4 seems to handle "const volatile" incorrectly, probably by applying CSE to those values. --
I think we should consider that deprecated and rely on dependencies and
clobbers.
J
--
I don't think that's practical in general. If the compiler is *that broken*, I don't see how it is usable at all. As Linus indicated, I don't think we can assume the gcc documentation to accurately reflect the intent of the gcc team, mostly because I think it often lags way behind what they're thinking. -hpa --
Well, over the years, gcc has explicitly changed the docs to weaken the
meaning of "asm volatile" from "not being moved ``significantly''" to
"no real guarantees about movement at all".
I don't see why its so broken. There are lots of mechanisms to control
asm ordering; we don't need "asm volatile" as well. But we do need it
to mean "still emit an apparently dead asm with no outputs (or unused
outputs)".
They can still be omitted if the whole basic block is dead code, and
they can be duplicated by things like inlining and loop unrolling. But
presumably they can never be evaluated more times than the source code
says they should be, so to that extent they're like volatile variables.
I think the use of the "volatile" keyword here is a red herring. Just
because the gcc devs decided to use it as a qualifier for asm statements
doesn't mean that one should read anything other than a vague, vague
relationship to volatile variables (and extra-vague given how weakly
I can get not trusting gcc to do what the documentation says it should
do, but relying on it to do things that the documentation definitely
says it doesn't seems foolhardy.
I'm finding it a bit surreal to be arguing on the side of "don't trust
gcc" vs you and Linus on the "gcc developers are arbitrary, capricious
and untrustworthy, but we can rely on this piece of undocumented
behaviour" side.
J
--
That makes no sense. According to that logic, "asm volatile" has no
semantic meaning at ALL. That's just crazy talk.
The sane compiler semantics for "asm volatile" is that it acts as a
volatile memory access. That's what the naming implies, and it has
valid semantics that also happen to match the historical semantics. It
means that it cannot be removed or duplicated, and it cannot be
re-ordered wrt other volatile accesses (whether "asm volatile" or a
traditional C volatile memory access).
I agree that we could just add memory clobbers to them all, but my
objection to that is that it just makes the whole keyword totally
pointless.
Linus
--
There are some discussions on: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02001.html http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00001.html Are they related? -- H.J. --
Not directly as far as I can tell. The issue is if gcc can ever reorder, duplicate or elide a volatile operation (either asm volatile or a volatile-annotated memory reference.) In my (and Linus') opinion, this would be an incredibly serious bug. The gcc 4.4 issue, which is separate from the definition issue, is that it seems to assume that it can elide references to "const volatile" objects. "const volatile" should mean a value that could change at any time but which is a bug to write to -- for example a readonly hardware register. -hpa --
Are you asking for a bug report against the documentation? We're not sure what the semantics intended by the gcc team to be, which I guess is a documentation bug. -hpa --
Documentation bug is also a bug :-). -- H.J. --
The question is "what are the real ordering semantics of asm volatile"?
What ordering is enforced between other asm volatiles? What ordering is
enforced between asm volatiles and regular memory accesses? asm volatile
and other code?
The documentation discusses this to some extent, but mostly says "there
are no ordering guarantees". Older versions of gcc - 2.95, for example
- are more explicit, saying that "asm volatiles" won't be moved out of
their basic block (I think that's how I parse it, anyway).
Linux relies on "asm volatile" being ordered at least with respect to
other asm volatiles. Is this reasonable now? Will gcc break this at
some point in the future? If we can't rely on "asm volatile" ordering
semantics, what other mechanism can we use (for example, to order clts
with respect to FPU-using code)?
Thanks,
J
--
I looked over the text, and I think you're confusing static reordering
and duplication with execution reordering (gcc is indeed free to move
around and even replicate asm volatile statements). One thing is that
the docs makes it perfectly clear that asm volatile is not ordered with
respect to *non*-volatile operations (*all* the examples are
volatile-nonvolatile.) It says that asm volatile statements may not be
"consecutive", i.e. gcc may generate code in between them, but all of
this is well known and understood.
The other thing that the documentation *does* specifically make clear is
that an "asm volatile" has an implicit dependency on all memory (as an
input, as opposed to an output/clobber.)
I actually found the following statement in the gcc documentation,
although it is in the section on C++ (6.1 in my version); the text,
though, makes it clear that it applies to both C and C++:
Both the C and C++ standard have the concept of volatile objects. These
are normally accessed by pointers and used for accessing hardware. The
standards encourage compilers to refrain from optimizations concerning
accesses to volatile objects. The C standard leaves it implementation
defined as to what constitutes a volatile access. The C++ standard
omits to specify this, except to say that C++ should behave in a
similar manner to C with respect to volatiles, where possible. The
minimum either standard specifies is that at a sequence point all
previous accesses to volatile objects have stabilized and no subsequent
accesses have occurred. Thus an implementation is free to reorder and
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
combine volatile accesses which occur between sequence points, but
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cannot do so for accesses across a sequence point. The use of
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
volatiles does not allow you to violate the restriction on updating
objects ...For those that are interested: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44943 -hpa --
1. set up a mapping 2. invlpg or set cr3 3. use the mapping Moving the invlpg will break your code. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
invlpg uses memory clobbers. All the crX ops seem to use a
__force_order variable to sequence them - but it looks like it's done
precisely backwards and it's barking mad to do allow write_crX to be
reordered with respect to memory ops.
Hm, looks like glommer added it surreptitiously while unifying
system_32.h and system_64.h (system_32.h relied on asm volatile not
being reordered; system_64.h used memory clobbers).
J
--
clts() has no memory clobber; it is used to serialize execution of code within kernel_fpu_begin() / kernel_fpu_end() blocks. If the code within is reordered before the clts(), we've corrupted guest FPU state. That's the kind of bug I think Linus is talking about. We've been expecting volatile to work that way for over a decade, by my recollection, and if it doesn't, there is going to be a lot of broken code. Shouldn't we at least get a compiler switch to force the volatile behavior? I'd suggest it default to conservative. --
Hmm, well, despite that not being quite correct (if guest has used FPU, we save it, which has a memory clobber), it seems to be the case that a reordering of the clts() among the other volatile asm statements would be a very bad thing - you'd get kernel FPU exceptions. And among asm volatiles, clts() is fairly unique in not having any clobbers or dependencies at all, so it could happen. Zach --
Hm, that's awkward - you'd really need some way of specifying ordering
with respect to the FPU state, and "memory" would only be an approximate
proxy of that. And not a good one, since gcc wouldn't regard
register-only fpu ops as being affected by a memory clobber.
J
--
At this point, it looks like there is no reason to be alarmed. The documentation actually contains a statement about volatiles not being mutually reordered across sequence points, and since asm is a statement (rather than an expression) it is always surrounded by sequence points. I have filed a gcc ticket to ask for clarification. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
invlpg, in the general case, definitely needs a memory clobber even if volatiles are ordered, since it needs to be ordered with regards to non-volatile memory operations. Note that memory clobbers don't by themselves enforce ordering since they don't prevent the ordering of memory clobbers with respect to each other. -hpa --
Yes. I'd say write_crX should need that too, since they can they can
have a variety global effects on how memory addressing works (from tlb
Hm. Well we do *definitely* rely on that. I guess technically we'd
need at least a memory input on the asm to be ordered with respect to a
memory clobber. But that doesn't seem to have been an issue (or perhaps
its mixed up with the secret semantics of asm volatile).
J
--
It appears to, thanks a lot.
[git|v2.6.32.16] weasel@thelma:/scratch/kernel/2.6.32.16$ nm arch/x86/kernel/pvclock.o
0000000000000000 b last_value
..
Without the cast gcc spews a fair amount of warnings. About four every
time it hits the include file.
Just for completeness' sake I attached the patch on top of 2.6.32.16
that I built with.
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
0000000000000000 B some_random_name
make -f scripts/Makefile.build obj=arch/x86/kernel arch/x86/kernel/pvclock.o
gcc -Wp,-MD,arch/x86/kernel/.pvclock.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.3.2/include -Iinclude -I/scratch/kernel/2.6.32.16/arch/x86/include -include include/linux/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(pvclock)" -D"KBUILD_MODNAME=KBUILD_STR(pvclock)" -c -o arch/x86/kernel/pvclock.o arch/x86/kernel/pvclock.c
[git|v2.6.32.16] weasel@thelma:/scratch/kernel/2.6.32.16$ gcc --version
gcc (Debian 4.3.2-1.1) 4.3.2
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using -O0 also results in a writeable last_value (b).
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
I bet it is the same. And I have a suspicion: because the only write
access to that variable is in an asm that uses the "memory" clobber to
say it wrote to it (rather than say it writes to it directly), and
because the variable is marked 'static', gcc decides that nothing ever
writes to it in that compilation unit, and it can be made read-only.
Look at our definition for "xchg()" in
arch/x86/include/asm/cmpxchg_64.h. It boils down to
asm volatile("xchgq %0,%1" \
: "=r" (__x) \
: "m" (*__xg(ptr)), "0" (__x) \
: "memory"); \
for the 8-byte case (which is obviously what atomic64_xchg() uses).
And the _reason_ we do that thing where we use a memory _input_ and
then a clobber is that older versions of gcc did not accept the thing
we _want_ to use, namely using "+m" to say that we actually change the
memory. So the above is "wrong", but has historical reasons - and
it's apparently never been changed.
However, the "+m" was fixed, and we use it elsewhere, so I think the
"m" plus memory clobber is now purely historical. Does a patch
something like the appended fix it? I also suspect we should look at
some other uses in this area. The atomic64_64.h file uses "=m" and
"m", which looks like another legacy thing (again, "+m" historically
wasn't allowed, and then later became the 'correct' way to do things).
NOTE NOTE NOTE! This is UNTESTED and INCOMPLETE. We should do cmpxchg
too, and the 32-bit versions. I'm adding Ingo and Peter to the cc.
Linus
and works; I also failed to reproduce with 2.6.32.16. So I expect some toolchain involvement. Peter, what gcc are you using? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
And yes, I do believe it's likely a newer gcc that triggers it too,
otherwise we'd have seen more reports of it.
Linus
--
Did anyone get around to doing a proper fix for this? I don't see one
in tip.git or linux-2.6.git. And it needs to get back into stable...
Thanks,
J
--
I'll go ahead and write it up. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Okay, I've stared at this issue for most of the day, and I have a fix for the immediate issue. However, I also am convinced that the whole scheme with the __xg() macro is just broken. It was introduced in 1.3.11 when xchg and cmpxchg where neither "asm volatile" nor had "memory" clobbers as an apparent way to keep gcc from moving memory references around it, but since the array type is "unsigned long" rather than a char type, it probably doesn't even work right, and with asm volatile/memory clobber it should not be necessary. At this point I'd like to push an urgent patch to fix the immediate issue, and a cleanup patch getting rid of __xg() for .36. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Commit-ID: 113fc5a6e8c2288619ff7e8187a6f556b7e0d372 Gitweb: http://git.kernel.org/tip/113fc5a6e8c2288619ff7e8187a6f556b7e0d372 Author: H. Peter Anvin <hpa@zytor.com> AuthorDate: Tue, 27 Jul 2010 17:01:49 -0700 Committer: H. Peter Anvin <hpa@zytor.com> CommitDate: Tue, 27 Jul 2010 17:14:02 -0700 x86: Add memory modify constraints to xchg() and cmpxchg() xchg() and cmpxchg() modify their memory operands, not merely read them. For some versions of gcc the "memory" clobber has apparently dealt with the situation, but not for all. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Glauber Costa <glommer@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Peter Palfrader <peter@palfrader.org> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Zachary Amsden <zamsden@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: <stable@kernel.org> LKML-Reference: <4C4F7277.8050306@zytor.com> --- arch/x86/include/asm/cmpxchg_32.h | 68 ++++++++++++++++++------------------ arch/x86/include/asm/cmpxchg_64.h | 40 +++++++++++----------- 2 files changed, 54 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h index 8859e12..c1cf59d 100644 --- a/arch/x86/include/asm/cmpxchg_32.h +++ b/arch/x86/include/asm/cmpxchg_32.h @@ -27,20 +27,20 @@ struct __xchg_dummy { switch (size) { \ case 1: \ asm volatile("xchgb %b0,%1" \ - : "=q" (__x) \ - : "m" (*__xg(ptr)), "0" (__x) \ + : "=q" (__x), "+m" (*__xg(ptr)) \ + : "0" (__x) \ : "memory"); \ break; \ case 2: \ asm volatile("xchgw %w0,%1" \ - : "=r" (__x) \ - : "m" (*__xg(ptr)), "0" (__x) \ + : "=r" (__x), "+m" (*__xg(ptr)) \ + : "0" (__x) \ : "memory"); \ break; \ case 4: \ asm volatile("xchgl %0,%1" \ - : ...
On Tue, Jul 27, 2010 at 10:33 PM, tip-bot for H. Peter Anvin
Ack. I assume this doesn't really change the code generated? At least
not with a gcc that honors the whole memory clobber thing properly?
I also suspect that we can/should get rid of the __xg() thing - it was
there just to make sure gcc didn't see the memory read as a single
word and tried to optimize it. With the "+m" it probably doesn't
matter any more (don't know if it ever did)
Linus
--
I didn't do a binary A:B comparison, on the assumption it would preturb the code too much to be equal, but I guess it would be a good idea to at "+m" would still see as a single word, but the __xg() thing should be obsoleted by asm volatile + "memory", neither of which were in the 1.3.11 version. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
For what it's worth, it fairly heavily preturbs code around __set_64bit(), which implies it actually does something useful in that case. The rest of the code looks similar enough. -hpa --
Commit-ID: 4532b305e8f0c238dd73048068ff8a6dd1380291 Gitweb: http://git.kernel.org/tip/4532b305e8f0c238dd73048068ff8a6dd1380291 Author: H. Peter Anvin <hpa@linux.intel.com> AuthorDate: Wed, 28 Jul 2010 15:18:35 -0700 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Wed, 28 Jul 2010 15:24:09 -0700 x86, asm: Clean up and simplify <asm/cmpxchg.h> Remove the __xg() hack to create a memory barrier near xchg and cmpxchg; it has been there since 1.3.11 but should not be necessary with "asm volatile" and a "memory" clobber, neither of which were there in the original implementation. However, we *should* make this a volatile reference. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com> --- arch/x86/include/asm/cmpxchg_32.h | 75 ++++++++++++++++++++---------------- arch/x86/include/asm/cmpxchg_64.h | 61 ++++++++++++++++++++--------- 2 files changed, 84 insertions(+), 52 deletions(-) diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h index 20955ea..f5bd1fd 100644 --- a/arch/x86/include/asm/cmpxchg_32.h +++ b/arch/x86/include/asm/cmpxchg_32.h @@ -11,38 +11,42 @@ extern void __xchg_wrong_size(void); /* - * Note: no "lock" prefix even on SMP: xchg always implies lock anyway - * Note 2: xchg has side effect, so that attribute volatile is necessary, - * but generally the primitive is invalid, *ptr is output argument. --ANK + * Note: no "lock" prefix even on SMP: xchg always implies lock anyway. + * Since this is generally used to protect other memory information, we + * use "asm volatile" and "memory" clobbers to prevent gcc from moving + * information around. */ - -struct __xchg_dummy { - unsigned long a[100]; -}; -#define __xg(x) ((struct __xchg_dummy *)(x)) - #define __xchg(x, ptr, size) \ ({ \ __typeof(*(ptr)) __x = (x); \ switch (size) { \ case 1: \ - asm volatile("xchgb ...
Commit-ID: 90c8f92f5c807807ca74d5f2f313794925174e6b Gitweb: http://git.kernel.org/tip/90c8f92f5c807807ca74d5f2f313794925174e6b Author: H. Peter Anvin <hpa@linux.intel.com> AuthorDate: Wed, 28 Jul 2010 16:53:49 -0700 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Wed, 28 Jul 2010 16:53:49 -0700 x86, asm: Move cmpxchg emulation code to arch/x86/lib Move cmpxchg emulation code from arch/x86/kernel/cpu (which is otherwise CPU identification) to arch/x86/lib, where other emulation code lives already. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com> --- arch/x86/kernel/cpu/Makefile | 2 +- arch/x86/lib/Makefile | 1 + arch/x86/{kernel/cpu => lib}/cmpxchg.c | 0 3 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 3a785da..c47c439 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -16,7 +16,7 @@ obj-y := intel_cacheinfo.o addon_cpuid_features.o obj-y += proc.o capflags.o powerflags.o common.o obj-y += vmware.o hypervisor.o sched.o mshyperv.o -obj-$(CONFIG_X86_32) += bugs.o cmpxchg.o +obj-$(CONFIG_X86_32) += bugs.o obj-$(CONFIG_X86_64) += bugs_64.o obj-$(CONFIG_CPU_SUP_INTEL) += intel.o diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index f871e04..e10cf07 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -30,6 +30,7 @@ ifeq ($(CONFIG_X86_32),y) lib-y += checksum_32.o lib-y += strstr_32.o lib-y += semaphore_32.o string_32.o + lib-y += cmpxchg.o ifneq ($(CONFIG_X86_CMPXCHG64),y) lib-y += cmpxchg8b_emu.o atomic64_386_32.o endif diff --git a/arch/x86/kernel/cpu/cmpxchg.c b/arch/x86/lib/cmpxchg.c similarity index 100% rename from arch/x86/kernel/cpu/cmpxchg.c rename to arch/x86/lib/cmpxchg.c --
Commit-ID: a378d9338e8dde78314b3a6ae003de351936c729 Gitweb: http://git.kernel.org/tip/a378d9338e8dde78314b3a6ae003de351936c729 Author: H. Peter Anvin <hpa@linux.intel.com> AuthorDate: Wed, 28 Jul 2010 17:05:11 -0700 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Wed, 28 Jul 2010 17:05:11 -0700 x86, asm: Merge cmpxchg_486_u64() and cmpxchg8b_emu() We have two functions for doing exactly the same thing -- emulating cmpxchg8b on 486 and older hardware -- with different calling conventions, and yet doing the same thing. Drop the C version and use the assembly version, via alternatives, for both the local and non-local versions of cmpxchg8b. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com> --- arch/x86/include/asm/cmpxchg_32.h | 30 ++++++++++++++---------------- arch/x86/lib/cmpxchg.c | 18 ------------------ 2 files changed, 14 insertions(+), 34 deletions(-) diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h index f5bd1fd..284a6e8 100644 --- a/arch/x86/include/asm/cmpxchg_32.h +++ b/arch/x86/include/asm/cmpxchg_32.h @@ -246,8 +246,6 @@ static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, * to simulate the cmpxchg8b on the 80386 and 80486 CPU. */ -extern unsigned long long cmpxchg_486_u64(volatile void *, u64, u64); - #define cmpxchg64(ptr, o, n) \ ({ \ __typeof__(*(ptr)) __ret; \ @@ -265,20 +263,20 @@ extern unsigned long long cmpxchg_486_u64(volatile void *, u64, u64); __ret; }) - -#define cmpxchg64_local(ptr, o, n) \ -({ \ - __typeof__(*(ptr)) __ret; \ - if (likely(boot_cpu_data.x86 > 4)) \ - __ret = (__typeof__(*(ptr)))__cmpxchg64_local((ptr), \ - (unsigned long long)(o), \ - (unsigned long long)(n)); \ - else \ - __ret = (__typeof__(*(ptr)))cmpxchg_486_u64((ptr), \ - (unsigned long long)(o), \ - (unsigned ...
Here's the 2.6.32 version for stable.
J
From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
Date: Tue, 27 Jul 2010 23:03:58 -0700
Subject: [PATCH] x86: Add memory modify constraints to xchg() and cmpxchg()
xchg() and cmpxchg() modify their memory operands, not merely read
them. For some versions of gcc the "memory" clobber has apparently
dealt with the situation, but not for all.
Also adds the missing 8-byte case for __sync_cmpxchg().
Based on HPA's patch.
Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
Cc: "H. Peter Anvin"<hpa@zytor.com>
Cc: Stable Kernel<stable@kernel.org>
diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index ee1931b..5af5051 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -34,12 +34,12 @@ static inline void __set_64bit(unsigned long long *ptr,
unsigned int low, unsigned int high)
{
asm volatile("\n1:\t"
- "movl (%0), %%eax\n\t"
- "movl 4(%0), %%edx\n\t"
- LOCK_PREFIX "cmpxchg8b (%0)\n\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ LOCK_PREFIX "cmpxchg8b %0\n\t"
"jnz 1b"
- : /* no outputs */
- : "D"(ptr),
+ : "=m"(*ptr)
+ : "D" (ptr),
"b"(low),
"c"(high)
: "ax", "dx", "memory");
@@ -82,20 +82,20 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
switch (size) {
case 1:
asm volatile("xchgb %b0,%1"
- : "=q" (x)
- : "m" (*__xg(ptr)), "0" (x)
+ : "=q" (x), "+m" (*__xg(ptr))
+ : "0" (x)
: "memory");
break;
case 2:
asm volatile("xchgw %w0,%1"
- : "=r" (x)
- : "m" (*__xg(ptr)), "0" (x)
+ : "=r" (x), "+m" (*__xg(ptr))
+ : "0" (x)
: "memory");
break;
case 4:
asm volatile("xchgl %0,%1"
- : "=r" (x)
- : "m" (*__xg(ptr)), "0" (x)
+ : "=r" (x), "+m" ...Thanks, I'll queue it up when it hits Linus's tree. greg k-h --
Hello, 2.6.32.21 does not boot on my HP Compaq nx7300. I bisected it to this patch: Here's a video of the crash: http://store.lisk.in/tmp/09092010019.mp4 With boot_delay=30, it sometimes just reboots and sometimes prints the trace. Without it, it always prints the trace. The crash occurs during the initialization of intel-agp (which explains why I can't reproduce it in qemu). More details in the video. The system boots and works just fine with 2.6.32.21 with the patch reverted. v2.6.36-rc3-185-gd56557a works fine as well, perhaps there were other fixes that didn't get to stable (e.g. 4532b30). I'm compiling the kernel with the latest gcc in Debian (-dumpversion says 4.4.5, dpkg reports 4:4.4.4-2). Shall I post dmesg/lspci as well? Since I have an easy workaround and other things to do, I won't spend time trying to backport those other fixes myself, but if you're going to fix this, I will help with testing. Regards, -- Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/ --
Could you give us your .config and also what version of gcc and binutils is on your system? -hpa --
Hello, http://store.lisk.in/tmp/.config.2.6.32.21-debug $ ld -V GNU ld (GNU Binutils for Debian) 2.20.1-system.20100303 (that's 2.20.1-12 in Debian testing) $ cc -dumpversion 4.4.5 (that's 4:4.4.4-2 in Debian testing) $ cat /proc/version Linux version 2.6.36-rc3-lis-00187-ga784c32 (tomi@notes.lisk.in) (gcc version 4.4.5 20100728 (prerelease) (Debian 4.4.4-8) ) #103 SMP Thu Sep 9 16:40:09 CEST 2010 Regards, -- Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/ --
I believe this needs 69309a05907546fb686b251d4ab041c26afe1e1d, which I think is being queued up for stable. Could you try applying 69309a05907546fb686b251d4ab041c26afe1e1d and see if it solves your problem. -hpa --
Hello, It does indeed. Thanks. -- Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/ --
Hm, that patch doesn't apply to the latest .32-stable tree. Can someone provide me with a backported version so I make sure I get it correct? thanks, greg k-h --
Hello, I'm attaching the one I used and tested. -- Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/
That worked, thanks, I've queued it up now. greg k-h --
Thanks a lot everyone. This patch resolves the issue for me on top
of 2.6.32.17.
(I had some whitespace issue with applying the patch, so I've attached
the cleaned up version in case it wasn't just me.)
Cheers,
Peter
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
Thanks, I needed this version, it wasn't just you :) greg k-h --
gcc (Debian 4.3.2-1.1) 4.3.2
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
The experimental Debian packages of 2.6.35-rc5 are built with gcc 4.4 and they don't have this bug - last_value is marked as 'b' in System.map, which I assume means BSS. So this may be specific to gcc 4.3. Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse.
