Generic versions of __find_first_bit and __find_first_zero_bit are introduced as simplified versions of __find_next_bit and __find_next_zero_bit. Their compilation and use are guarded by a new config variable GENERIC_FIND_FIRST_BIT. The generic versions of find_first_bit and find_first_zero_bit are implemented in terms of the newly introduced __find_first_bit and __find_first_zero_bit. This patch also converts i386 to the generic functions. The text size shrinks slightly due to uninlining of the find_*_bit functions. text data bss dec hex filename 4764939 480324 622592 5867855 59894f vmlinux (i386 defconfig before) 4764645 480324 622592 5867561 598829 vmlinux (i386 defconfig after) Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- Hi Ingo, Here is another step in the unification of the bitops for i386 and x86_64. This patch implements a minimal conversion to a generic implementation of find_first_bit/find_first_zero_bit for i386. The optimization for small bitmaps and the conversion of x86_64 will follow soon. Compiles and runs fine on i386 and x86_64 (current x86#testing). Greetings, Alexander arch/x86/Kconfig | 3 ++ include/asm-x86/bitops_32.h | 56 ----------------------------------------- include/linux/bitops.h | 34 +++++++++++++++++++++++++ lib/Makefile | 1 + lib/find_next_bit.c | 58 +++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 96 insertions(+), 56 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6b3626d..fa7d16d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -80,6 +80,9 @@ config GENERIC_BUG def_bool y depends on BUG +config GENERIC_FIND_FIRST_BIT + def_bool X86_32 + config GENERIC_FIND_NEXT_BIT def_bool y diff --git a/include/asm-x86/bitops_32.h b/include/asm-x86/bitops_32.h index 3ed64b2..2e86302 100644 --- a/include/asm-x86/bitops_32.h +++ b/include/asm-x86/bitops_32.h @@ -4,62 +4...
thanks, applied. I guess we should keep the bitops.h portions alive for the moment though (surrounded by #ifndef GENERIC_FIND_FIRST_BIT), to make it easy for anyone to flip the Kconfig bit around via a oneliner patch and do some benchmarking? At least initially - i'm convinced that we want the generic versions in the long run. (especially if, as your patches do it, the generic code picks up the easy constant tests and optimizes them at build time) Ingo --
x86: generic versions of find_first_(zero_)bit, convert i386 Generic versions of __find_first_bit and __find_first_zero_bit are introduced as simplified versions of __find_next_bit and __find_next_zero_bit. Their compilation and use are guarded by a new config variable GENERIC_FIND_FIRST_BIT. The generic versions of find_first_bit and find_first_zero_bit are implemented in terms of the newly introduced __find_first_bit and __find_first_zero_bit. This patch does not remove the i386-specific implementation, but it does switch i386 to use the generic functions by setting GENERIC_FIND_FIRST_BIT=y for X86_32. Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> arch/x86/Kconfig | 3 ++ include/asm-x86/bitops_32.h | 2 + include/linux/bitops.h | 34 +++++++++++++++++++++++++ lib/Makefile | 1 + lib/find_next_bit.c | 58 +++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 98 insertions(+), 0 deletions(-) Keeping the old version alive for the moment is fine with me. Here is a replacement patch that does exactly that. Runs on i386 with this patch and also with GENERIC_FIND_FIRST_BIT=n. I'll finish the patches for X86_64 and the constant-size bitmap optimizations after work. Greetings, Alexander diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6b3626d..fa7d16d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -80,6 +80,9 @@ config GENERIC_BUG def_bool y depends on BUG +config GENERIC_FIND_FIRST_BIT + def_bool X86_32 + config GENERIC_FIND_NEXT_BIT def_bool y diff --git a/include/asm-x86/bitops_32.h b/include/asm-x86/bitops_32.h index 3ed64b2..ba2c0de 100644 --- a/include/asm-x86/bitops_32.h +++ b/include/asm-x86/bitops_32.h @@ -5,6 +5,7 @@ * Copyright 1992, Linus Torvalds. */ +#ifndef CONFIG_GENERIC_FIND_FIRST_BIT /** * find_first_zero_bit - find the first zero bit in a memory region * @addr: The address to start the search at @@ -59,6 +60,7 @...
fwiw there's a way to do ffz / ntz which can do lg(n) conditional moves in parallel... i'm not sure what (non-x86) architectures this might be best on, but it might be a good choice for the generic code... although maybe the large number of constants required will be a burden on RISC processors. take a look at figure 5-17 here http://hackersdelight.org/revisions.pdf int ntz(unsigned x) { unsigned y, bz, b4, b3, b2, b1, b0; y = x & -x; // Isolate rightmost 1-bit. bz = y ? 0 : 1; // 1 if y = 0. b4 = (y & 0x0000FFFF) ? 0 : 16; b3 = (y & 0x00FF00FF) ? 0 : 8; b2 = (y & 0x0F0F0F0F) ? 0 : 4; b1 = (y & 0x33333333) ? 0 : 2; b0 = (y & 0x55555555) ? 0 : 1; return bz + b4 + b3 + b2 + b1 + b0; } -dean --
On Sun, 6 Apr 2008 10:03:43 -0700 (PDT), "dean gaudet" <dean@arctic.org> Hello Dean, The current generic implementation of ffz is O(lg(n)) already, but the version you suggest might indeed be a bit faster if the compiler recognises that is can use conditional moves and the architecture can handle large constants efficiently. On the other had, the bit-search functions tend to be avoided as much as possible, because they are often not implemented as a hardware instruction and even if they are implemented in hardware, they might be slow. The generic version is slow anyhow. That's why the bitmap searches first test if a word in the bitmap is all-0-bits/all-1-bits. The single-word version of ffz might even be better off if it was optimized for size instead of being fully Note: mask32 = ~0ul; mask16 = mask32 ^ (mask32 << 16), mask8 = ... Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worrying and love email again --
it's O(lg(n)) time... the operations all depend on each other. the implementation i pointed to is O(lg(n)) code space... and the time depends on how parallel the machine is, they're not dependent on each other. -dean --
On Sun, 6 Apr 2008 13:22:58 -0700 (PDT), "dean gaudet" <dean@arctic.org> Indeed. The worst dependencies are in the sum of all the partial results in this implementation. And addition is associative, so partial results can be written as ((a+b)+(c+d))+(e+f). Assuming perfect parallel execution this would lead to O(ln(ln(n))). Good. Care to implement ffs and __ffs like this? Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - The professional email service --
Hello all, I've implemented ffs (find first set bit) like it is shown in http://www.hackersdelight.org/ (see revisions, page 21). I would be interested to see how it compares with the generic implementation of __ffs in the linux kernel, in particular on architectures that do not have an obvious fast way of determining the first set bit of a word. I have included a simple benchmark program that should give a reasonable estimate of the performance ratio of the two implementations. Please test and report :). Is this implementation suitable to replace the current one? Greetings, Alexander P.S. Some results for some machines I could test on: ----------- On a 2.1 GHz Athlon XP: $ gcc -m32 -Os -fomit-frame-pointer ffs.c && time ./a.out Original: 396 tics, 756 tics New: 378 tics, 851 tics Assembly: 262 tics, 383 tics Empty loop: 128 tics, 203 tics real 0m33.862s user 0m33.026s sys 0m0.344s ----------- On a 2.33 GHz Xeon: $ gcc -m64 -Os ffs.c && time ./a.out Original: 277 tics, 324 tics New: 230 tics, 236 tics Assembly: 90 tics, 84 tics Empty loop: 90 tics, 82 tics real 0m14.490s user 0m14.270s sys 0m0.220s $ gcc -m32 -Os -fomit-frame-pointer ffs.c && time ./a.out Original: 303 tics, 449 tics New: 231 tics, 394 tics Assembly: 90 tics, 144 tics Empty loop: 102 tics, 116 tics real 0m18.521s user 0m18.380s sys 0m0.140s ----------- On an alpha EV6.7 (21264A) operating at 667 MHz: $ gcc -Os ffs.c && time ./a.out Original: 622 tics, 633 tics New: 431 tics, 465 tics Assembly: 235 tics, 210 tics Empty loop: 199 tics, 218 tics 50.358u 0.259s 1:14.28 68.1% 0+1k 2+0io 2pf+0w ----------- #include <stdio.h> #include <sys/times.h> #define LOOPS32 (1<<(30-5-1)) #define LOOPS64 (1<<(30-6-1)) #define A...
technically you can compute x4 with the original value prior to isolating the least-significant one-bit -- the compiler probably can't figure this i'm never sure if it's better to use | or + here... i bet it depends on what native operations the processor has... and depends on how ?: are implemented. -dean --
How about: u8 x; value &= -value; x = (value & 0x55555555) ? 0 : 1; x |= (value & 0x33333333) ? 0 : 2; x |= (value & 0x0f0f0f0f) ? 0 : 4; x |= (value & 0x00ff00ff) ? 0 : 8; x |= (value & 0x0000ffff) ? 0 : 16; return x; Harvey --
any reasonable compiler should figure out the two are the same... but i really prefer spelling out the lack of dependencies of the computations by breaking it out per-bit. -dean --
It seems gcc 4.3 (-Os or -O2) isn't a reasonable compiler.
I think this might be best:
int ffs32(unsigned int value)
{
int x;
value &= -value;
if (!(value & 0x55555555))
x = 1;
else
x = 0;
if (!(value & 0x33333333))
x |= 2;
if (!(value & 0x0f0f0f0f))
x |= 4;
if (!(value & 0x00ff00ff))
x |= 8;
if (!(value & 0x0000ffff))
x |= 16;
return x;
}
--That produces the shortest assembly for me, also uses the fewest registers. Harvey --
unfortunately it kind of defeats the purpose of the original code... which is high parallelism / no-dependencies. have you benchmarked it? -dean --
I modified Alexander's benchmark: http://lkml.org/lkml/2008/4/18/267 to include 32 and 64 bit variants called smallest. On an old ARM: $ gcc --version gcc (GCC) 3.4.6 $ cat /proc/cpuinfo Processor : Intel StrongARM-110 rev 4 (v4l) BogoMIPS : 262.14 Hardware : Rebel-NetWinder Revision : 57ff Serial : 000000000000185c $ gcc -Os -fomit-frame-pointer ffs.c $ ./a.out Original: 3180 tics, 8379 tics New: 4280 tics, 8890 tics Smallest: 4027 tics, 7835 tics Empty loop: 1543 tics, 2260 tics $ gcc -O2 -fomit-frame-pointer ffs.c $ ./a.out Original: 3161 tics, 7843 tics New: 4778 tics, 8783 tics Smallest: 4408 tics, 7149 tics Empty loop: 1515 tics, 2140 tics $ gcc -O3 -fomit-frame-pointer ffs.c $ ./a.out Original: 3078 tics, 7692 tics New: 4714 tics, 8671 tics Smallest: 4344 tics, 7117 tics Empty loop: 1444 tics, 2024 tics On my old laptop: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz stepping : 4 cpu MHz : 1200.000 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm up bogomips : 2400.03 clflush size : 64 $ gcc --version gcc (GCC) 4.3.0 $ gcc -Os -fomit-frame-pointer ffs.c $ ./a.out Original: 901 tics, 1426 tics New: 862 tics, 1244 tics Smallest: 911 tics, 1331 tics Assembly: 208 tics, 402 tics Empty loop: 208 tics, 304 tics $ gcc -O2 -fomit-frame-pointer ffs.c $ ./a.out Original: 918 tics, 1386 tics New: 872 tics, 1193 tics Smallest: 906 tics, 1309 tics ...
... I am curious, why not take the code already in glibc ffs() for ARM ? That is, if the ffs() is all that important detail in kernel ? /Matti Aarnio --
Here's test results with the glibc ffs implementation. (small const is still using slower add rather than or) $ gcc -Os -fomit-frame-pointer ffs.c $ ./a.out Original: 3155 tics, 8331 tics New: 4211 tics, 8793 tics Smallest: 4019 tics, 7754 tics Small const: 3552 tics, 6308 tics glibc: 2816 tics, 6911 tics Empty loop: 1516 tics, 2244 tics $ gcc -O2 -fomit-frame-pointer ffs.c $ ./a.out Original: 3155 tics, 7828 tics New: 4792 tics, 8825 tics Smallest: 4401 tics, 7155 tics Small const: 3539 tics, 5805 tics glibc: 2720 tics, 7061 tics Empty loop: 1516 tics, 2148 tics $ gcc -O3 -fomit-frame-pointer ffs.c $ ./a.out Original: 3080 tics, 7706 tics New: 4721 tics, 8663 tics Smallest: 4334 tics, 7116 tics Small const: 3466 tics, 5672 tics glibc: 2649 tics, 6939 tics Empty loop: 1444 tics, 2012 tics --
On Sat, 19 Apr 2008 20:06:57 -0700, "Joe Perches" <joe@perches.com> Hi, The glibc version is based on a table-lookup. This makes it behave differently in hot and cold cache situations. That's fine if __ffs is used in tight loops, but in the kernel such use of __ffs is avoided because it might be slow. I added it to the benchmark, but it would need testing for the cold cache case too. As for the importance of __ffs in the kernel: as far as I know the hot-spots in the kernel using __ffs are the schedular (sched_find_first_bit) and the cpu mask walking code (for_each_cpu_mask). Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service --
Perhaps those hot-spots would benefit from more broadly accelerable algorithms. ARM architecture v5 introduced a CLZ instruction -- Count Leading Zeroes. Well, gcc's __builtin_ffs() for ARM Arch5 and up (including XScale) does things in a bit more interesting way: http://mail-index.netbsd.org/port-arm/2002/08/20/0001.html $ cat try.c int foo(int i) { return __builtin_ffs(i); } $ arm-gp2x-linux-gcc -S -O -march=armv5 try.c $ more try.s .file "try.c" .text .align 2 .global foo .type foo, %function foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ lr needed for prologue rsb r3, r0, #0 and r3, r3, r0 clz r3, r3 rsb r0, r3, #32 bx lr .size foo, .-foo /Matti Aarnio --
On Sun, 20 Apr 2008 15:31:46 +0300, "Matti Aarnio" That would be a possibility too ;). The advantages of bitmaps Yeah, if such an instruction exist, the arch should provide optimized versions for the bit functions. The interest here was to provide a (beter) generic version in pure C for architectures without such instructions. Inline assembly versions are indeed provided in asm-arm/bitops.h for ARM5+. Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - I mean, what is it about a decent email service? --
i'm guessing the 32-bit constants suck :( the code could be modified to use 16-bit constants only -- it would add some dependent operations though (to move the hot bit into the low 16-bits). --
On Fri, 18 Apr 2008 21:13:47 -0700 (PDT), "dean gaudet"
That would look like this (although I chose to reduce to less than 128,
due to completely irrelevant x86 considerations ;) ).
static ATTR int __ffs32_smallconstant(unsigned int value)
{
int x0, x1, x2, x3, x4;
unsigned int t2, t4;
value &= -value;
t2 = value | (value >> 16);
t4 = t2 | (t2 >> 8);
x4 = (value << 16) ? 0 : 16;
x3 = (t2 << 24) ? 0 : 8;
x2 = (t4 & 0x0f) ? 0 : 4;
x1 = (t4 & 0x33) ? 0 : 2;
x0 = (t4 & 0x55) ? 0 : 1;
return x4 | x3 | x2 | x1 | x0;
}
I've added that to the benchmark, which you can now find here:
http://heukelum.fastmail.fm/ffs/. Testing the same with
"return x4 + x3 + x2 + x1 + x0;" as the last line would be
interesting too.
Greetings,
Thanks for testing, Harvey!
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - IMAP accessible web-mail
--retested on arm: $ gcc -Os -fomit-frame-pointer ffs.c $ ./a.out Original: 3170 tics, 8326 tics New: 4214 tics, 8793 tics Smallest: 4023 tics, 7733 tics Small const: 3442 tics, 6188 tics Empty loop: 1517 tics, 2243 tics $ gcc -O2 -fomit-frame-pointer ffs.c $ ./a.out Original: 3172 tics, 7832 tics New: 4805 tics, 8790 tics Smallest: 4405 tics, 7154 tics Small const: 3442 tics, 5612 tics Empty loop: 1516 tics, 2145 tics $ gcc -O3 -fomit-frame-pointer ffs.c $ ./a.out Original: 3080 tics, 7709 tics New: 4723 tics, 8656 tics Smallest: 4333 tics, 7121 tics Small const: 3379 tics, 5483 tics Adding is slower: $ gcc -Os -fomit-frame-pointer ffs.c $ ./a.out Original: 3152 tics, 8310 tics New: 4214 tics, 8789 tics Smallest: 4024 tics, 7737 tics Small const: 3538 tics, 6295 tics Empty loop: 1517 tics, 2243 tics $ gcc -O2 -fomit-frame-pointer ffs.c $ ./a.out Original: 3184 tics, 7849 tics New: 4790 tics, 8814 tics Smallest: 4406 tics, 7161 tics Small const: 3538 tics, 5806 tics Empty loop: 1521 tics, 2153 tics $ gcc -O3 -fomit-frame-pointer ffs.c $ ./a.out Original: 3091 tics, 7694 tics New: 4718 tics, 8656 tics Smallest: 4333 tics, 7124 tics Small const: 3467 tics, 5687 tics Empty loop: 1445 tics, 2066 tics --
On Sat, 19 Apr 2008 11:17:01 -0700, "Joe Perches" <joe@perches.com> Thanks! Added the version you sent to the program and added the results of the ARM processor to the page. -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - And now for something completely different --
dean gaudet writes: > On Fri, 18 Apr 2008, Joe Perches wrote: > > > On Fri, 2008-04-18 at 18:11 -0700, dean gaudet wrote: > > > have you benchmarked it? > > > > I modified Alexander's benchmark: > > http://lkml.org/lkml/2008/4/18/267 > > to include 32 and 64 bit variants called smallest. > > > > On an old ARM: > > i'm guessing the 32-bit constants suck :( > > the code could be modified to use 16-bit constants only 16-bit immediates don't help, as ARMs express immediate operands in arithmetic instructions as 8-bit values plus a 4-bit rotation count (which is multiplied by 2). Very new ARMs can construct full 16-bit immediates, but that still takes an additional instruction and an additional register. --
yep, i'd suggest we gravitate towards such a no-dependencies implementation for the generic code - especially modern CPUs would be able to execute them rather efficiently. Ingo --
[PATCH] x86: switch x86_64 to generic find_first_bit Switch x86_64 to generic find_first_bit. The x86_64-specific implementation is not removed. Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- arch/x86/Kconfig | 2 +- arch/x86/lib/bitops_64.c | 2 ++ include/asm-x86/bitops_64.h | 2 ++ 3 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index fa7d16d..e37e286 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -81,7 +81,7 @@ config GENERIC_BUG depends on BUG config GENERIC_FIND_FIRST_BIT - def_bool X86_32 + def_bool y config GENERIC_FIND_NEXT_BIT def_bool y diff --git a/arch/x86/lib/bitops_64.c b/arch/x86/lib/bitops_64.c index 0eeb704..568467d 100644 --- a/arch/x86/lib/bitops_64.c +++ b/arch/x86/lib/bitops_64.c @@ -1,3 +1,4 @@ +#ifndef CONFIG_GENERIC_FIND_FIRST_BIT #include <linux/bitops.h> #undef find_first_zero_bit @@ -105,3 +106,4 @@ long find_first_bit(const unsigned long * addr, unsigned long size) EXPORT_SYMBOL(find_first_bit); EXPORT_SYMBOL(find_first_zero_bit); +#endif diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h index d133520..4081d7e 100644 --- a/include/asm-x86/bitops_64.h +++ b/include/asm-x86/bitops_64.h @@ -5,6 +5,7 @@ * Copyright 1992, Linus Torvalds. */ +#ifndef CONFIG_GENERIC_FIND_FIRST_BIT extern long find_first_zero_bit(const unsigned long *addr, unsigned long size); extern long find_first_bit(const unsigned long *addr, unsigned long size); @@ -24,6 +25,7 @@ static inline long __scanbit(unsigned long val, unsigned long max) ((__builtin_constant_p((size)) && (size) <= BITS_PER_LONG \ ? (__scanbit(~*(unsigned long *)(addr), (size))) \ : find_first_zero_bit((addr), (size)))) +#endif static inline void set_bit_string(unsigned long *bitmap, unsigned long i, int len) -- 1.5.2.5 --
[PATCH] x86: optimize find_first_bit for small bitmaps
Avoid a call to find_first_bit if the bitmap size is know at
compile time and small enough to fit in a single long integer.
Modeled after an optimization in the original x86_64-specific
code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
include/linux/bitops.h | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 355d67b..48bde60 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -127,6 +127,20 @@ extern unsigned long __find_first_bit(const unsigned long *addr,
static __always_inline unsigned long
find_first_bit(const unsigned long *addr, unsigned long size)
{
+ /* Avoid a function call if the bitmap size is a constant */
+ /* and not bigger than BITS_PER_LONG. */
+
+ /* insert a sentinel so that __ffs returns size if there */
+ /* are no set bits in the bitmap */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG))
+ return __ffs((*addr) | (1ul << size));
+
+ /* the result of __ffs(0) is undefined, so it needs to be */
+ /* handled separately */
+ if (__builtin_constant_p(size) && (size == BITS_PER_LONG))
+ return ((*addr) == 0) ? BITS_PER_LONG : __ffs(*addr);
+
+ /* size is not constant or too big */
return __find_first_bit(addr, size);
}
@@ -143,6 +157,21 @@ extern unsigned long __find_first_zero_bit(const unsigned long *addr,
static __always_inline unsigned long
find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
+ /* Avoid a function call if the bitmap size is a constant */
+ /* and not bigger than BITS_PER_LONG. */
+
+ /* insert a sentinel so that __ffs returns size if there */
+ /* are no set bits in the bitmap */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
+ return __ffs(~(*addr) | (1ul << size));
+ }
+
+ /* the result of __ffs(0) is undefined, so it needs to ...[PATCH] x86: remove x86-specific implementations of find_first_bit x86 has been switched to the generic versions of find_first_bit and find_first_zero_bit, but the original versions were retained. This patch just removes the now unused x86-specific versions. Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- arch/x86/lib/Makefile | 1 - arch/x86/lib/bitops_64.c | 109 ------------------------------------------- include/asm-x86/bitops_32.h | 58 ----------------------- include/asm-x86/bitops_64.h | 23 --------- 4 files changed, 0 insertions(+), 191 deletions(-) delete mode 100644 arch/x86/lib/bitops_64.c This patch should of course wait until it is decided if the generic version is good enough to replace the x86-specific ones. I hacked together a benchmark that people can run on different machines. It can be downloaded from: http://heukelum.fastmail.fm/find_first_bit Greetings, Alexander diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 4360932..76f60f5 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -21,7 +21,6 @@ else lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o lib-y += thunk_64.o clear_page_64.o copy_page_64.o - lib-y += bitops_64.o lib-y += memmove_64.o memset_64.o lib-y += copy_user_64.o rwlock_64.o copy_user_nocache_64.o endif diff --git a/arch/x86/lib/bitops_64.c b/arch/x86/lib/bitops_64.c deleted file mode 100644 index 568467d..0000000 --- a/arch/x86/lib/bitops_64.c +++ /dev/null @@ -1,109 +0,0 @@ -#ifndef CONFIG_GENERIC_FIND_FIRST_BIT -#include <linux/bitops.h> - -#undef find_first_zero_bit -#undef find_first_bit - -static inline long -__find_first_zero_bit(const unsigned long * addr, unsigned long size) -{ - long d0, d1, d2; - long res; - - /* - * We must test the size in words, not in bits, because - * otherwise incoming sizes in the range -63..-1 will not run - * any scasq instructions, and ...
thanks Alexander, i've added your patches to x86.git/latest. Ingo --
I now put a much more sound benchmarking program there. And its results for P-IV-like Xeon, Athlon XP and Opteron for the current and generic implementations. Greetings, Alexander --
On Mon, 31 Mar 2008 19:15:06 +0200 Size isn't everything, what is the performance difference? --
Hi, Performance should not change too much. Uninlining of the functions has some impact, of course, but this should be visible only for small bitmap sizes. Measuring the performance impact by doing artificial benchmarks is a bit problematic too, because it is hard to guess what patterns are important. Anyhow, I hacked together a program (in userspace) that searches for a bit in a bitmap. In pseudo code: bitmap <- [0...] for bitmapsize=1 to 512 for bitposition=0 to bitmapsize-1 find_first_bit in bitmap bitmap[bitposition] <- 1 find_first_bit in bitmap bitmap[bitposition] <- 0 After each find_first_bit, the result is checked against the expected result. A similar test is done for searching zero bits. The two tests are performed 1000 times in a loop. On a 2.4GHz (P-IV-type) Xeon, I get the following results: $ gcc -DNEW -fomit-frame-pointer -Os find_first_bit.c && time ./a.out real 0m15.006s $ nm -nStd 0000000134513492 0000000000000065 T find_first_bit 0000000134513557 0000000000000062 T find_first_zero_bit 0000000134513619 0000000000000190 T testzerobit 0000000134513809 0000000000000187 T testonebit 0000000134513996 0000000000000045 T main and $ gcc -fomit-frame-pointer -Os find_first_bit.c && time ./a.out real 0m17.617s 0000000134513492 0000000000000293 T testzerobit 0000000134513785 0000000000000240 T testonebit 0000000134514025 0000000000000045 T main So in this particular case, on this particular machine, with this particular mix of searches, the new code is somewhat faster, even though it is out-of-line. A similar, but more convincing, change was seen when the assembly versions for find_next_bit and find_next_zero_bit where replaced by the generic ones. Greetings, Alexander --
I don't think that's a useful benchmark because you're testing mostly values that a real kernel will never use. -Andi --
| Francois Romieu | Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out" |
| Greg Kroah-Hartman | [PATCH 040/196] kobject: add kobject_add_ng function |
| Dave Airlie | [git pull] drm patches for 2.6.27 final |
| john stultz | [PATCH] correct inconsistent ntp interval/tick_length usage |
| Krzysztof Halasa | Re: [PATCH v2] Re: WAN: new PPP code for generic HDLC |
| David Miller | Re: [PATCH] Expose netdevice dev_id through sysfs |
| Dave Jones | odd RTL8139 quirk. |
| Auke Kok | [PATCH 4/8] e1000e: lower ring minimum size to 64 |
git: | |
| Miklos Vajna | [rfc] git submodules howto |
| Andrew Morton | Untracked working tree files |
| Ben Collins | Re: [kernel.org users] [RFD] On deprecating "git-foo" for builtins |
| Jon Smirl | ! [rejected] master -> master (non-fast forward) |
| rancor | How to copy/pipe console buffert to file? |
| Pieter Verberne | File collision while using pkg_add |
| Greg Thomas | Re: Is it possible to fix a stale NFS hadle without rebooting? |
| Didier Wiroth | win32-codecs, avi and amd64 question |
| Netfilter kernel module | 9 hours ago | Linux kernel |
| serial driver xmit problem | 12 hours ago | Linux kernel |
| Why Windows is better than Linux | 12 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 19 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 1 day ago | Linux general |
| Compiling gfs2 on kernel 2.6.27 | 2 days ago | Linux kernel |
