Re: undefined reference to __udivdi3 (gcc-4.3)

Previous thread: none

Next thread: Re: Suspend to memory is freezing my machine by Robert Hancock on Sunday, May 4, 2008 - 10:36 am. (7 messages)
From: Robert Hancock
Date: Sunday, May 4, 2008 - 10:35 am

I assume it's one or both of these loops in arch/x86/xen/time.c 
do_stolen_accounting() that are being optimized into a divide which 
generates a libgcc call:

	while (stolen >= NS_PER_TICK) {
		ticks++;
		stolen -= NS_PER_TICK;
	}

or

	while (blocked >= NS_PER_TICK) {
		ticks++;
		blocked -= NS_PER_TICK;
	}

I seem to recall in one previous case we added some dummy assembly code 
to stop gcc from doing this. Not sure if that is a sustainable fix, though..
--

From: Segher Boessenkool
Date: Sunday, May 4, 2008 - 3:19 pm

I think you refer to 38332cb9, "time: prevent the loop in
timespec_add_ns() from being optimised away".


         while(unlikely(ns >= NSEC_PER_SEC)) {
+               /* The following asm() prevents the compiler from
+                * optimising this loop into a modulo operation.  */
+               asm("" : "+r"(ns));
+
                 ns -= NSEC_PER_SEC;
                 a->tv_sec++;

It should be.  The asm() arg tells GCC that the asm() could modify
"ns" in some way, so GCC cannot optimise away the loop, since it
doesn't have the required info about the induction variable to do
that.


Segher

--

From: Jeremy Fitzhardinge
Date: Wednesday, May 7, 2008 - 2:29 am

Yep, it's guaranteed to work.  But it's an ugly hack to work around an 
over-enthusiastic compiler, and so is an inherent maintainability burden.

I think the correct fix here is to introduce an iter_div_rem() function 
which contains this hack, so we can avoid scattering it all over the 
place.  I'll cook up a patch.

    J
--

From: Jeremy Fitzhardinge
Date: Thursday, May 8, 2008 - 8:16 am

We have a few instances of the open-coded iterative div/mod loop, used
when we don't expcet the dividend to be much bigger than the divisor.
Unfortunately modern gcc's have the tendency to strength "reduce" this
into a full mod operation, which isn't necessarily any faster, and
even if it were, doesn't exist if gcc implements it in libgcc.

The workaround is to put a dummy asm statement in the loop to prevent
gcc from performing the transformation.

This patch creates a single implementation of this loop, and uses it
to replace the open-coded versions I know about.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Robert Hancock <hancockr@shaw.ca>
---
 arch/x86/xen/time.c    |   13 +++----------
 include/linux/math64.h |    2 ++
 include/linux/time.h   |   11 ++---------
 lib/div64.c            |   23 +++++++++++++++++++++++
 4 files changed, 30 insertions(+), 19 deletions(-)

===================================================================
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -12,6 +12,7 @@
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
 #include <linux/kernel_stat.h>
+#include <linux/math64.h>
 
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
@@ -150,11 +151,7 @@ static void do_stolen_accounting(void)
 	if (stolen < 0)
 		stolen = 0;
 
-	ticks = 0;
-	while (stolen >= NS_PER_TICK) {
-		ticks++;
-		stolen -= NS_PER_TICK;
-	}
+	ticks = iter_div_u64_rem(stolen, NS_PER_TICK, &stolen);
 	__get_cpu_var(residual_stolen) = stolen;
 	account_steal_time(NULL, ticks);
 
@@ -166,11 +163,7 @@ static void do_stolen_accounting(void)
 	if (blocked < 0)
 		blocked = 0;
 
-	ticks = 0;
-	while (blocked >= NS_PER_TICK) {
-		ticks++;
-		blocked -= ...
From: Andrew Morton
Date: Thursday, May 8, 2008 - 1:26 pm

On Thu, 08 May 2008 16:16:41 +0100


I think it would be better to do s/unsigned/u32/ here.  It's cosmetic, but
all this sort of code is pretty formal about the sizes of the types which
it uses, and it sure needs to be.

--

From: Jeremy Fitzhardinge
Date: Thursday, May 8, 2008 - 3:00 pm

OK.

    J

--

From: Andrew Morton
Date: Tuesday, May 13, 2008 - 11:46 pm

Believe it or not, this patch causes one of my test machines to fail to
find its disk.

good dmesg: http://userweb.kernel.org/~akpm/dmesg-t61p.txt
bad dmesg: http://userweb.kernel.org/~akpm/dmesg-t61p-dead.txt
config: http://userweb.kernel.org/~akpm/config-t61p.txt
--

From: Jeremy Fitzhardinge
Date: Wednesday, May 14, 2008 - 12:33 am

Uh, OK.  Do your other test machines work with it?  Are there any 
relevant config differences (x86-64 vs 32?).

Hm, it's not that it can't find its disk, I think it's this:

init[1]: segfault at ffffffffff7008d2 ip ffffffffff7008d2 sp 7fff86e67488 error 14
init[1]: segfault at ffffffffff7008d2 ip 311ac07628 sp 7fff86e66cb0 error 4 in libgcc_s-4.1.2-20070925.so.1[311ac00000+d000]


Is it that the vsyscall page is trying to call into the kernel?

    notrace static noinline int do_realtime(struct timespec *ts)
    {
    	unsigned long seq, ns;
    	do {
    		seq = read_seqbegin(&gtod->lock);
    		ts->tv_sec = gtod->wall_time_sec;
    		ts->tv_nsec = gtod->wall_time_nsec;
    		ns = vgetns();
    	} while (unlikely(read_seqretry(&gtod->lock, seq)));
    	timespec_add_ns(ts, ns);
    	return 0;
    }
      

timespec_add_ns() used to be entirely inlined, but now it contains the 
call to iter_div_u64_rem().

arch/x86/vdso/vclock_gettime.c has its own copies of other kernel 
functions which it can't directly call.  We could add 
timespec_add_ns()/iter_div_u64_rem() to that list, though its pretty 
hacky.  Could we link lib/div64.o into the vdso?

    J
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 1:33 am

You would need to annotate it and have a separate object file for the
different sections. Also it would need to be compiled PIC.

Inlining is better.

-Andi

--

From: Jeremy Fitzhardinge
Date: Wednesday, May 14, 2008 - 2:55 am

BTW, I'm seeing the memcpy() in __vdso_gettimeofday() not being inlined.

    J
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 3:50 am

Hmm works here. What compiler do you use? Normally gcc should 
recognize the memcpy is just two constant stores and always inline even with -Os.

nm --dynamic arch/x86/vdso/vdso.so
0000000000000000 A LINUX_2.6
ffffffffff7007e0 T __vdso_clock_gettime
ffffffffff700820 T __vdso_getcpu
ffffffffff700750 T __vdso_gettimeofday
ffffffffff7007e0 W clock_gettime
ffffffffff700820 W getcpu
ffffffffff700750 W gettimeofday

Anyways if your compiler or config cannot get that right it would need
to switch to __inline_memcpy(), but that would be slower or explicit
copying field by field.

If it's a common problem we could also implement a build time check,
but normally such problems should be already caught in code review.

-Andi

--

From: Jeremy Fitzhardinge
Date: Wednesday, May 14, 2008 - 3:52 am

It's:
jeremy@cosworth:~/hg/xen/paravirt/linux-x86_64$ gcc -v
Reading specs from /usr/lib/gcc/x86_64-linux/3.4.4/specs
Configured with: ../src/configure -v 
--enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr 
--libexecdir=/usr/lib --with-gxx-include-dir=/usr/include/c++/3.4 
--enable-shared --with-system-zlib --enable-nls 
--without-included-gettext --program-suffix=-3.4 --enable-__cxa_atexit 
--enable-libstdcxx-allocator=mt --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk 
--disable-werror x86_64-linux
Thread model: posix
gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)

I think this is still a supported compiler, isn't it?

    J
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 4:21 am

Yes. Here's a patch. You probably didn't see problems because you either
don't have a glibc that supports the vdso or none of your programs
gets the timezone from gettimeofday() [that is very obscure obsolete
functionality anyways, normally it should be gotten from the disk locales]

-Andi




From: Jeremy Fitzhardinge
Date: Wednesday, May 14, 2008 - 5:58 am

Patch looks fine to me.  I just noticed the call to memcpy by 
inspection; I haven't tried to run this.

    J
--

From: Christian Kujau
Date: Friday, May 9, 2008 - 4:45 am

Hi Jeremy,


Thanks for the patch. Applying to 2.6.26-rc1 allowed me to compile with
gcc-4.3. Feel free to add a Tested-by :-)

Thanks,
Christian.
-- 
BOFH excuse #442:

Trojan horse ran out of hay

--

From: Segher Boessenkool
Date: Thursday, May 8, 2008 - 1:52 pm

It's not a "dummy" asm, it actually does something: it tells the
compiler that it has to iterate the loop exactly as written, and


...and now the "meat" of this function isn't inline anymore.  If we
cared about not doing a divide here, you'll have to explain why


You changed "+r" to "+rm" here.  Why?  Also, "rm" is an x86-ism,
and this is generic code (it does work here, but why is it better

Does this need to be exported?


Can I suggest an alternative approach?  Define a macro (with a
good, descriptive name!) for just the asm("" : "+r"(x)), and use
that.  Much smaller patch, much clearer code, and doesn't result
in different (and worse) code generation, so it's a much safer
change.


Segher

--

From: Jeremy Fitzhardinge
Date: Thursday, May 8, 2008 - 2:57 pm

No, it has no function of its own.  It's bullying gcc into not 

It's consistent with the other functions defined here.  I agree it isn't 

On x86-32 it compiles to 26 instructions and 47 bytes of code.  
Admittedly it might be smaller inline - or on a 64-bit machine - but I 
seriously doubt its suffering a huge performance hit from being out of 
line.  These days the inline threshold is very small - probably under 10 
instructions.  A direct call/return is essentially free, since it can be 

Because it didn't seem all that unlikely.  Besides, it makes not one bit 

"rm" isn't x86-specific.  I just wanted to give the compiler the freedom 


Uh, could you suggest a name?  Something along the lines of 
prevent_gcc_from_strength_reducing_this_subtraction_loop_into_a_modulo_operation_thanks_oh_and_remember_to_use_it_in_all_the_right_places() 
springs to mind.

Rather than putting this unsightly (though with a smear of lipstick) 
hack into every open-coded iterative div-mod loop, we may as well 
implement it once and just look out for places which should be using it.

I don't think the "worse" code generation is much of an issue, since it 
isn't used anywhere performance critical, and moving the code out of 
line means there should be an overall reduction in code size.

    J
--

Previous thread: none

Next thread: Re: Suspend to memory is freezing my machine by Robert Hancock on Sunday, May 4, 2008 - 10:36 am. (7 messages)