I assume it's one or both of these loops in arch/x86/xen/time.c
do_stolen_accounting() that are being optimized into a divide which
generates a libgcc call:
while (stolen >= NS_PER_TICK) {
ticks++;
stolen -= NS_PER_TICK;
}
or
while (blocked >= NS_PER_TICK) {
ticks++;
blocked -= NS_PER_TICK;
}
I seem to recall in one previous case we added some dummy assembly code
to stop gcc from doing this. Not sure if that is a sustainable fix, though..
--
I think you refer to 38332cb9, "time: prevent the loop in
timespec_add_ns() from being optimised away".
while(unlikely(ns >= NSEC_PER_SEC)) {
+ /* The following asm() prevents the compiler from
+ * optimising this loop into a modulo operation. */
+ asm("" : "+r"(ns));
+
ns -= NSEC_PER_SEC;
a->tv_sec++;
It should be. The asm() arg tells GCC that the asm() could modify
"ns" in some way, so GCC cannot optimise away the loop, since it
doesn't have the required info about the induction variable to do
that.
Segher
--
Yep, it's guaranteed to work. But it's an ugly hack to work around an
over-enthusiastic compiler, and so is an inherent maintainability burden.
I think the correct fix here is to introduce an iter_div_rem() function
which contains this hack, so we can avoid scattering it all over the
place. I'll cook up a patch.
J
--
We have a few instances of the open-coded iterative div/mod loop, used
when we don't expcet the dividend to be much bigger than the divisor.
Unfortunately modern gcc's have the tendency to strength "reduce" this
into a full mod operation, which isn't necessarily any faster, and
even if it were, doesn't exist if gcc implements it in libgcc.
The workaround is to put a dummy asm statement in the loop to prevent
gcc from performing the transformation.
This patch creates a single implementation of this loop, and uses it
to replace the open-coded versions I know about.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Robert Hancock <hancockr@shaw.ca>
---
arch/x86/xen/time.c | 13 +++----------
include/linux/math64.h | 2 ++
include/linux/time.h | 11 ++---------
lib/div64.c | 23 +++++++++++++++++++++++
4 files changed, 30 insertions(+), 19 deletions(-)
===================================================================
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -12,6 +12,7 @@
#include <linux/clocksource.h>
#include <linux/clockchips.h>
#include <linux/kernel_stat.h>
+#include <linux/math64.h>
#include <asm/xen/hypervisor.h>
#include <asm/xen/hypercall.h>
@@ -150,11 +151,7 @@ static void do_stolen_accounting(void)
if (stolen < 0)
stolen = 0;
- ticks = 0;
- while (stolen >= NS_PER_TICK) {
- ticks++;
- stolen -= NS_PER_TICK;
- }
+ ticks = iter_div_u64_rem(stolen, NS_PER_TICK, &stolen);
__get_cpu_var(residual_stolen) = stolen;
account_steal_time(NULL, ticks);
@@ -166,11 +163,7 @@ static void do_stolen_accounting(void)
if (blocked < 0)
blocked = 0;
- ticks = 0;
- while (blocked >= NS_PER_TICK) {
- ticks++;
- blocked -= ...On Thu, 08 May 2008 16:16:41 +0100 I think it would be better to do s/unsigned/u32/ here. It's cosmetic, but all this sort of code is pretty formal about the sizes of the types which it uses, and it sure needs to be. --
Believe it or not, this patch causes one of my test machines to fail to find its disk. good dmesg: http://userweb.kernel.org/~akpm/dmesg-t61p.txt bad dmesg: http://userweb.kernel.org/~akpm/dmesg-t61p-dead.txt config: http://userweb.kernel.org/~akpm/config-t61p.txt --
Uh, OK. Do your other test machines work with it? Are there any
relevant config differences (x86-64 vs 32?).
Hm, it's not that it can't find its disk, I think it's this:
init[1]: segfault at ffffffffff7008d2 ip ffffffffff7008d2 sp 7fff86e67488 error 14
init[1]: segfault at ffffffffff7008d2 ip 311ac07628 sp 7fff86e66cb0 error 4 in libgcc_s-4.1.2-20070925.so.1[311ac00000+d000]
Is it that the vsyscall page is trying to call into the kernel?
notrace static noinline int do_realtime(struct timespec *ts)
{
unsigned long seq, ns;
do {
seq = read_seqbegin(&gtod->lock);
ts->tv_sec = gtod->wall_time_sec;
ts->tv_nsec = gtod->wall_time_nsec;
ns = vgetns();
} while (unlikely(read_seqretry(&gtod->lock, seq)));
timespec_add_ns(ts, ns);
return 0;
}
timespec_add_ns() used to be entirely inlined, but now it contains the
call to iter_div_u64_rem().
arch/x86/vdso/vclock_gettime.c has its own copies of other kernel
functions which it can't directly call. We could add
timespec_add_ns()/iter_div_u64_rem() to that list, though its pretty
hacky. Could we link lib/div64.o into the vdso?
J
--
You would need to annotate it and have a separate object file for the different sections. Also it would need to be compiled PIC. Inlining is better. -Andi --
BTW, I'm seeing the memcpy() in __vdso_gettimeofday() not being inlined.
J
--
Hmm works here. What compiler do you use? Normally gcc should recognize the memcpy is just two constant stores and always inline even with -Os. nm --dynamic arch/x86/vdso/vdso.so 0000000000000000 A LINUX_2.6 ffffffffff7007e0 T __vdso_clock_gettime ffffffffff700820 T __vdso_getcpu ffffffffff700750 T __vdso_gettimeofday ffffffffff7007e0 W clock_gettime ffffffffff700820 W getcpu ffffffffff700750 W gettimeofday Anyways if your compiler or config cannot get that right it would need to switch to __inline_memcpy(), but that would be slower or explicit copying field by field. If it's a common problem we could also implement a build time check, but normally such problems should be already caught in code review. -Andi --
It's:
jeremy@cosworth:~/hg/xen/paravirt/linux-x86_64$ gcc -v
Reading specs from /usr/lib/gcc/x86_64-linux/3.4.4/specs
Configured with: ../src/configure -v
--enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr
--libexecdir=/usr/lib --with-gxx-include-dir=/usr/include/c++/3.4
--enable-shared --with-system-zlib --enable-nls
--without-included-gettext --program-suffix=-3.4 --enable-__cxa_atexit
--enable-libstdcxx-allocator=mt --enable-clocale=gnu
--enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk
--disable-werror x86_64-linux
Thread model: posix
gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)
I think this is still a supported compiler, isn't it?
J
--
Yes. Here's a patch. You probably didn't see problems because you either don't have a glibc that supports the vdso or none of your programs gets the timezone from gettimeofday() [that is very obscure obsolete functionality anyways, normally it should be gotten from the disk locales] -Andi
Patch looks fine to me. I just noticed the call to memcpy by
inspection; I haven't tried to run this.
J
--
Hi Jeremy, Thanks for the patch. Applying to 2.6.26-rc1 allowed me to compile with gcc-4.3. Feel free to add a Tested-by :-) Thanks, Christian. -- BOFH excuse #442: Trojan horse ran out of hay --
It's not a "dummy" asm, it actually does something: it tells the compiler that it has to iterate the loop exactly as written, and ...and now the "meat" of this function isn't inline anymore. If we cared about not doing a divide here, you'll have to explain why You changed "+r" to "+rm" here. Why? Also, "rm" is an x86-ism, and this is generic code (it does work here, but why is it better Does this need to be exported? Can I suggest an alternative approach? Define a macro (with a good, descriptive name!) for just the asm("" : "+r"(x)), and use that. Much smaller patch, much clearer code, and doesn't result in different (and worse) code generation, so it's a much safer change. Segher --
No, it has no function of its own. It's bullying gcc into not
It's consistent with the other functions defined here. I agree it isn't
On x86-32 it compiles to 26 instructions and 47 bytes of code.
Admittedly it might be smaller inline - or on a 64-bit machine - but I
seriously doubt its suffering a huge performance hit from being out of
line. These days the inline threshold is very small - probably under 10
instructions. A direct call/return is essentially free, since it can be
Because it didn't seem all that unlikely. Besides, it makes not one bit
"rm" isn't x86-specific. I just wanted to give the compiler the freedom
Uh, could you suggest a name? Something along the lines of
prevent_gcc_from_strength_reducing_this_subtraction_loop_into_a_modulo_operation_thanks_oh_and_remember_to_use_it_in_all_the_right_places()
springs to mind.
Rather than putting this unsightly (though with a smear of lipstick)
hack into every open-coded iterative div-mod loop, we may as well
implement it once and just look out for places which should be using it.
I don't think the "worse" code generation is much of an issue, since it
isn't used anywhere performance critical, and moving the code out of
line means there should be an overall reduction in code size.
J
--
