Re: [PATCH] x86: merge the simple bitops and move them to bitops.h

Previous thread: broken access to a USB HID device after suspend by Tino Keitel on Wednesday, March 12, 2008 - 12:34 pm. (17 messages)

Next thread: Re: linux+glibc memory allocator, poor performance by J.C. Pizarro on Wednesday, March 12, 2008 - 1:09 pm. (21 messages)
From: Alexander van Heukelum
Date: Wednesday, March 12, 2008 - 1:01 pm

x86: merge the simple bitops and move them to bitops.h

Some of those can be written in such a way that the same
inline assembly can be used to generate both 32 bit and
64 bit code.

For ffs and fls, x86_64 unconditionally used the cmov
instruction and i386 unconditionally used a conditional
branch over a mov instruction. In the current patch I
chose to select the version based on the availability
of the cmov instruction instead. A small detail here is
that x86_64 did not previously set CONFIG_X86_CMOV=y.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---

 arch/x86/Kconfig.cpu        |    2 +-
 include/asm-x86/bitops.h    |   90 +++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/bitops_32.h |   64 ------------------------------
 include/asm-x86/bitops_64.h |   76 ------------------------------------
 4 files changed, 91 insertions(+), 141 deletions(-)

Hi Ingo,

On a 64-bit machine the cmov instruction is always available. I made
that explicit by turning on CONFIG_X86_CMOV for all 64-bit builds.

This patch introduces a change in behaviour for 32-bit builds: if
the selected cpu has the cmov instruction, it will now be used by
the functions ffs and fls.

For the i386 defconfig the number of occurrences of cmov increases
from 4336 to 4429 and vmlinux text size increases 15 bytes. When
building for 486 no cmovs are generated, as expected.

Greetings,
	Alexander

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 31e92fb..fb7399b 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -399,7 +399,7 @@ config X86_TSC
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7)
+	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || X86_64)
 
 config X86_MINIMUM_CPU_FAMILY
 	int
diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 1a23ce1..bbdecee 100644
--- ...
From: Jeremy Fitzhardinge
Date: Friday, March 14, 2008 - 11:07 am

Looks good in general.  What's left in bitops_{32,64}.h now?


This comment seems wrong.  My "man ffs" says that it returns 1-32 for 
non-zero inputs, and 0 for a zero input.  This function returns 0-31, or 
-1 for a zero input.

DESCRIPTION
       The ffs() function returns the position of the first (least significant)
       bit set in the word i.  The least significant bit is position 1 and  the
       most  significant  position  is,  for example, 32 or 64.   The functions
       ffsll() and ffsl() do the same but take arguments of possibly  different
       size.

RETURN VALUE
       These  functions  return  the  position of the first bit set, or 0 if no

The prevailing style in bitops.h is to use "bsf"/"cmovz" and let the 

And this comment is even more wrong, given that ffs and fls are 
completely different functions ;)

I know these are from the original, but its worth fixing given that 

    J
--

From: Alexander van Heukelum
Date: Friday, March 14, 2008 - 12:43 pm

On Fri, 14 Mar 2008 11:07:55 -0700, "Jeremy Fitzhardinge"

Thanks for taking a look!

bitops_{32,64}.h are getting pretty empty ;)

Both contain find_first_bit/find_first_zero_bit, i386 has them inlined,
x86_64 has an ugly define to select between small bitmaps (inlined) and
an out-of-line version. I think they should be unified much like how
find_next_bit and find_next_zero_bit work now (in x86#testing).

Both define fls64(), but i386 uses a generic one and x86_64 defines
one all by itself. The generic one is currently not suitable for
use by 64-bit archs... that can change.

x86_64 defines ARCH_HAS_FAST_MULTIPLIER, i386 not. This affects a
choice of generated code in the (generic) hweight function. It would
be nice if that could move to some other file.

x86_64 has a mysterious inline function set_bit_string, which is
only used by pci-calgary_64.c and pci-gart_64.c. Not sure what to



The comment rightly mentions that ffs differs in spirit from the rest
of the functions here. The type to be used here it always a 4-byte one
(int); most other functions operate on longs. I have not checked if the
explicit size is absolutely necessary, but I feel much safer with the

I'll see if I can come up with something better.

Greetings,
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - A no graphics, no pop-ups email service

--

From: Andi Kleen
Date: Friday, March 14, 2008 - 12:55 pm

The CMOV define should probably be dependent on what CPU the kernel
is tuned for. It was originally written for when x86-64 was only
K8 which has fast CMOV, but e.g. on P4 CMOV is actually deprecated

It is very unlikely a generic one will ever be able to compete

It depends on the CPU, but it can be probably safely set on pretty much

It's generic and could really live in linux/bitops.h

-Andi

--

From: Alexander van Heukelum
Date: Friday, March 14, 2008 - 2:33 pm

On Fri, 14 Mar 2008 20:55:20 +0100, "Andi Kleen" <andi@firstfloor.org>

Hi Andi,

I guess you are right. But there is quite a big number of different
types of P4. Let's see what the current situation is... defconfigs
(of current x86#testing+this patch/current linus) with
CONFIG_GENERIC_CPU=n:

Athlon: 4764 / 4667 occurences of cmovxx
Pentium-IV: 4079 / 3982 occurences of cmovxx
Pentium-M: 3939 / 3841 occurences of cmovxx
Core-2: 4335 / 4330 occurences of cmovxx

So it adds a few percent extra cmovxx's. The last one is fishy...

Generic is maybe not the right term: asm-generic/bitops/fls64.h has:

static inline int fls64(__u64 x)
{
        __u32 h = x >> 32;
        if (h)
                return fls(h) + 32;
        return fls(x);
}

I just wanted to move the 64-bit version to that header, with some

In fact I just found out that it only had an effect for 64 bit

It could. But it is a trivial (slow?) implementation. Probably fine
for the uses in pci-calgary_64.c and pci-gart_64.c (small ranges?),
but I would worry about people using it, thinking it was a near-
optimal implementation.

Greetings,
    Alexander
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - IMAP accessible web-mail

--

From: Andi Kleen
Date: Friday, March 14, 2008 - 2:42 pm

It is this way because the callers in 95+% of all cases only
set a single bit. For that case it is not slow.

-Andi
--

From: Alexander van Heukelum
Date: Friday, March 14, 2008 - 3:01 pm

On Fri, 14 Mar 2008 22:42:05 +0100, "Andi Kleen" <andi@firstfloor.org>

I agree that it should end up using bsr. It would look like this in
the end, I guess. Might be familiar.

#if BITS_PER_LONG == 32
static inline int fls64(__u64 x)
{
        __u32 h = x >> 32;
        if (h)
                return __fls(h) + 33;
        return fls(x);
}
#else
static inline int fls64(__u64 x)
{
        if (x == 0)
                return 0;
        return __fls(x) + 1;
}

This is the only reason that this define exists. With another

And my feeling is that this is exactly the reason why this is
not a good version for a generic implementation in bitops.h. But
I don't care much.

Greetings,
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - Accessible with your email software
                          or over the web

--

From: Andi Kleen
Date: Friday, March 14, 2008 - 3:18 pm

That would require a polymorphic macro __fls that adapts to 32bit and 64bit

AFAIK it only exists because some ancient sparc chips had incredibly

I bet most different approaches who might be slightly
faster for larger bit strings would make the one bit 
case slower.

-Andi

--

From: Alexander van Heukelum
Date: Saturday, March 15, 2008 - 10:54 am

On Fri, 14 Mar 2008 23:18:45 +0100, "Andi Kleen" <andi@firstfloor.org>

Hi Andi,

It's unsigned long __fls(unsigned long)... and this is only compiled
if unsigned long is as long as u64. Seems fine to me. Moreover, it

Good to know. And I realized that there is also the machines without
a hardware multiply instruction at all. So you are right. i386/x86_64

That is true, of course. But then the name of the function should
give a hint that it is optimized for short sequences.

Greetings,
    Alexander
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - Or how I learned to stop worrying and
                          love email again

--

From: Alexander van Heukelum
Date: Saturday, March 15, 2008 - 12:19 pm

K8, EFFICEON and CORE2 support the cmovxx instructions.

Instead of listing the cpu's that have support for the
cmovxx instructions, list the cpu's that don't.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---

A bit of playing resulted in:

CPUS="M386 M486 M586 M586TSC M586MMX M686 MPENTIUMII MPENTIUMIII"
CPUS=$CPUS" MPENTIUMM MPENTIUM4 MK6 MK7 MK8 MCRUSOE MEFFICEON"
CPUS=$CPUS" MWINCHIPC6 MWINCHIP2 MWINCHIP3D MGEODEGX1 MGEODE_LX"
CPUS=$CPUS" MCYRIXIII MVIAC3_2 MVIAC7 MPSC MCORE2"

for cpu in $CPUS
do
        echo "CONFIG_${cpu}=y" > testconfig
        make ARCH=i386 allnoconfig KCONFIG_ALLCONFIG=testconfig > /dev/null
        echo ${cpu} >> result
        grep X86_CMOV .config >> result
        echo >> result
done

I'm quite sure that K8, EFFICEON and CORE2 support HAVE_CMOV, but
they did not set X86_CMOV.

Greetings,
	Alexander van Heukelum

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 31e92fb..7a3a2d4 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -399,7 +399,7 @@ config X86_TSC
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7)
+	depends on !(MCYRIXIII || MGEODE_LX || MGEODEGX1 || MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MCRUSOE || MK6 || M586MMX || M586TSC || M586 || M486 || M386)
 
 config X86_MINIMUM_CPU_FAMILY
 	int

--

From: H. Peter Anvin
Date: Saturday, March 15, 2008 - 1:18 pm

Crusoe has CMOV as well.

	-hpa
--

From: Alexander van Heukelum
Date: Saturday, March 15, 2008 - 2:06 pm

On Sat, 15 Mar 2008 21:18:16 +0100, "H. Peter Anvin" <hpa@zytor.com>

Hi hpa,

I believe you... and the Makefile_32.cpu: it gets a -march=i686. I
also found out that I missed X86_ELAN, which is treated as a i486.

Also, MGEODE_LX does not get any optimization flags. I think that
is unintentional, but what should they be?

And MPSC (Intel P4 / older Netburst based Xeon) depends on X86_64.
I would be very surprised if that one could not run 32-bit code,
though. What flags should that one get?

Thanks for pointing out I missed Crusoe. As a punishment you
can answer my questions ;).

Greetings,
    Alexander
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - Send your email first class

--

From: Willy Tarreau
Date: Saturday, March 15, 2008 - 2:11 pm

in my experience (on user-space code), optimizing for "i586" gives
good results. However, the cache is small, so everything which can
reduce code size (especially loop/jump/function alignment) is worth
checking.

Regards,
Willy

--

From: Alexander van Heukelum
Date: Sunday, March 16, 2008 - 6:16 am

x86: K8, GEODE_LX, CRUSOE, EFFICEON and CORE2 support CMOV.

Instead of summing up the cpu's that have support for the
cmov instruction, sum up the cpu's that don't.

GEODE_LX has no entry in arch/x86/Makefile_32.cpu, so it
did not get any -march or -mtune arguments at all. I quite
arbitrarily added -march=i686 -mtune=pentium-mmx. This way
gcc is allowed to use cmov instructions.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---


Thanks for letting me know.

I chose i686 with pentium-mmx scheduling. This enables cmov, while
the scheduling is still pentium-like. If i586 is really better,
GEODE_LX should be listed as not using cmov at all (because the
compiler does not generate them).

Loop/jump/function alignment is automatically turned off if you
compile the kernel with CONFIG_CC_OPTIMIZE_FOR_SIZE=y so I think
it should not be set explicitly.

Qemu does not run GEODE_LX-kernels (no 3dNow).

Greetings,
	Alexander

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 31e92fb..38af3e1 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -399,7 +399,7 @@ config X86_TSC
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7)
+	depends on !(MCYRIXIII || MGEODEGX1 || MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MK6 || M586MMX || M586TSC || M586 || M486 || M386 || X86_ELAN)
 
 config X86_MINIMUM_CPU_FAMILY
 	int
diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
index e372b58..b5e2438 100644
--- a/arch/x86/Makefile_32.cpu
+++ b/arch/x86/Makefile_32.cpu
@@ -38,8 +38,9 @@ cflags-$(CONFIG_MCORE2)		+= -march=i686 $(call tune,core2)
 # AMD Elan support
 cflags-$(CONFIG_X86_ELAN)	+= -march=i486
 
-# Geode GX1 support
+# Geode GX1 and GX/LX support
 cflags-$(CONFIG_MGEODEGX1)	+= -march=pentium-mmx
+cflags-$(CONFIG_MGEODE_LX)	+= -march=i686 $(call tune,pentium-mmx)
 
 # add at the end to overwrite eventual tuning options from ...
From: Ingo Molnar
Date: Friday, March 21, 2008 - 5:38 am

thanks, applied.

	Ingo
--

From: Alexander van Heukelum
Date: Friday, March 14, 2008 - 1:35 pm

x86: merge the simple bitops and move them to bitops.h.

Some of those can be written in such a way that the same
inline assembly can be used to generate both 32 bit and
64 bit code.

For ffs and fls, x86_64 unconditionally used the cmov
instruction and i386 unconditionally used a conditional
branch over a mov instruction. In the current patch I
chose to select the version based on the availability
of the cmov instruction instead. A small detail here is
that x86_64 did not previously set CONFIG_X86_CMOV=y.

Improved comments for ffs and fls.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---

Hi,

Here is a new version with improved (I hope) comments for
the functions ffs and fls, suggested by Jeremy. I left
the explicit use of the -l forms in the inline assembly
of ffs and fls, because I think it is clearer that way.

Greetings,
	Alexander

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 31e92fb..fb7399b 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -399,7 +399,7 @@ config X86_TSC
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7)
+	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || X86_64)
 
 config X86_MINIMUM_CPU_FAMILY
 	int
diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 1a23ce1..363a4ff 100644
--- a/include/asm-x86/bitops.h
+++ b/include/asm-x86/bitops.h
@@ -66,7 +66,6 @@ static inline void __set_bit(int nr, volatile void *addr)
 		     : "Ir" (nr) : "memory");
 }
 
-
 /**
  * clear_bit - Clears a bit in memory
  * @nr: Bit to clear
@@ -310,6 +309,104 @@ static int test_bit(int nr, const volatile unsigned long *addr);
 	 constant_test_bit((nr),(addr)) :	\
 	 variable_test_bit((nr),(addr)))
 
+/**
+ * __ffs - find first bit in word.
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check ...
From: Randy Dunlap
Date: Friday, March 14, 2008 - 4:30 pm

find first bit set in word

[The "first bit in word" could have any value in {0, 1}, so the






---
~Randy
--

From: Alexander van Heukelum
Date: Saturday, March 15, 2008 - 5:04 am

x86: merge the simple bitops and move them to bitops.h.

Some of those can be written in such a way that the same
inline assembly can be used to generate both 32 bit and
64 bit code.

For ffs and fls, x86_64 unconditionally used the cmov
instruction and i386 unconditionally used a conditional
branch over a mov instruction. In the current patch I
chose to select the version based on the availability
of the cmov instruction instead. A small detail here is
that x86_64 did not previously set CONFIG_X86_CMOV=y.

Improved comments for ffs, ffz, fls and variations.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---

Hi Randy, Ingo,

Version 3, for x86#testing. Compared to v1 there are only
changes in the comments as suggested by Jeremy and Randy.

Thanks to both for their comments.

Greetings,
	Alexander

 arch/x86/Kconfig.cpu        |    2 +-
 include/asm-x86/bitops.h    |   99 ++++++++++++++++++++++++++++++++++++++++++-
 include/asm-x86/bitops_32.h |   64 ----------------------------
 include/asm-x86/bitops_64.h |   76 ---------------------------------
 4 files changed, 99 insertions(+), 142 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 31e92fb..fb7399b 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -399,7 +399,7 @@ config X86_TSC
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7)
+	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || X86_64)
 
 config X86_MINIMUM_CPU_FAMILY
 	int
diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 1a23ce1..923bdc2 100644
--- a/include/asm-x86/bitops.h
+++ b/include/asm-x86/bitops.h
@@ -66,7 +66,6 @@ static inline void __set_bit(int nr, volatile void *addr)
 		     : "Ir" (nr) : "memory");
 }
 
-
 /**
  * clear_bit - Clears a bit in memory
  * @nr: Bit to clear
@@ -310,6 +309,104 @@ static ...
From: Ingo Molnar
Date: Friday, March 21, 2008 - 5:35 am

thanks Alexander, applied.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Friday, March 14, 2008 - 2:15 pm

Indeed I did.

    J
--

Previous thread: broken access to a USB HID device after suspend by Tino Keitel on Wednesday, March 12, 2008 - 12:34 pm. (17 messages)

Next thread: Re: linux+glibc memory allocator, poor performance by J.C. Pizarro on Wednesday, March 12, 2008 - 1:09 pm. (21 messages)