No need for a sentinal if we explicitly catch the no bits/all bits set
cases, make it clear they are special cases returning size/BITS_PER_LONG.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
---
include/linux/bitops.h | 93 ++++++++++++++++++++----------------------------
1 files changed, 39 insertions(+), 54 deletions(-)
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 48bde60..d9eb58a 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -127,18 +127,17 @@ extern unsigned long __find_first_bit(const unsigned long *addr,
static __always_inline unsigned long
find_first_bit(const unsigned long *addr, unsigned long size)
{
- /* Avoid a function call if the bitmap size is a constant */
- /* and not bigger than BITS_PER_LONG. */
-
- /* insert a sentinel so that __ffs returns size if there */
- /* are no set bits in the bitmap */
- if (__builtin_constant_p(size) && (size < BITS_PER_LONG))
- return __ffs((*addr) | (1ul << size));
-
- /* the result of __ffs(0) is undefined, so it needs to be */
- /* handled separately */
- if (__builtin_constant_p(size) && (size == BITS_PER_LONG))
- return ((*addr) == 0) ? BITS_PER_LONG : __ffs(*addr);
+ /*
+ * Avoid a function call if the bitmap size is a constant and not
+ * bigger than BITS_PER_LONG. Ensure we return size if there are
+ * no set bits.
+ */
+ if (__builtin_constant_p(size) && (size <= BITS_PER_LONG)) {
+ if (*addr == 0)
+ return (size < BITS_PER_LONG) ? size : BITS_PER_LONG;
+ else
+ return __ffs(*addr);
+ }
/* size is not constant or too big */
return __find_first_bit(addr, size);
@@ -157,20 +156,17 @@ extern unsigned long __find_first_zero_bit(const unsigned long *addr,
static __always_inline unsigned long
find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
- /* Avoid a function call if the bitmap size is a constant */
- /* and not bigger than BITS_PER_LONG. */
-
- /* insert a sentinel so that __ffs returns size if there ...These things need to be re-done anyway. David already complained that the thing inlines to something much much too big. Let's just make it not be an inline at all. Linus --
Oh, I didn't realize, I only did this because sparse started spewing out lots of: include/linux/bitops.h:166:32: warning: shift too big (65536) for type unsigned long due to shift by size there, and again on line 202...I just wanted something that sparse wouldn't warn about and was a little easier to understand to boot. Cheers, Harvey --
That's a sparse problem, really. I wonder if we simply should introduce a new node type: EXPR_WARN. So that expand would generate those from things like division by zero/overflow/bad shift *and* emitting an insn for those would generate a stored warning. Objections? --
Well, even though it is a sparse problem, I think my revised version was None here Harvey --
Well, that's really just a sparse bug/misfeature that didn't matter before. It happens because the warning is done as part of constant expression evaluation, but then the expression isn't actually *used*, so when we optimize it away - at a later date - it's too late to undo the warning. I don't rightly know how to fix it. We do want to do the constant evaluation early (since all the optimizations that may then make it a non-issue depends on constants being constants!), but in order to not output the warning we'd have to turn the constant into a "constant with warning 'xyz' if used". Which we have no support for in sparse yet. Linus --
It's not even a constant, really... What we need is * don't fold further if that sucker would be evaluated (i.e. 0 && <...> is folded, 0 & <...> is not) * don't consider an integer constant expression with value 0 for purposes of null pointer constant handling * emit corresponding insn in linearize_expression() when we run into such sucker. * somewhere around the call of vrfy_flow() walk through insns in remaining bb (we'd done dead code elimination already) and spew postponed warning on each such insn. Probably. Assuming that I'm not entirely confused about what's going on in that area - which is quite possible. Folks, could somebody familiar with the backend comment on the last part? --
The delta between the inlined version including the constant
optimizations and the original non inlined version is exactly 1400
Which makes the whole constant optimization moot.
How about making the optimization controlled by a config switch so
arch maintainers can decide whether they want to enable the constant
optimization or unconditionally call the lib function ?
See patch below. It gives back the 1400 bytes on SPARC64 and other
platforms that have no instruction for find bit.
Thanks,
tglx
---------->
Subject: bitops: optional bitops mapsize optimization on config switch
From: Thomas Gleixner <tglx@linutronix.de>
Date: Mon, 28 Apr 2008 00:22:06 +0200
The mapsize optimizations which were moved from x86 to the generic
code in commit 64970b68d2b3ed32b964b0b30b1b98518fde388e increased the
binary size on non x86 architectures.
Make the optimization depend on a config switch so architecture
maintainers can decide.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/Kconfig.cpu | 1 +
include/linux/bitops.h | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)
Index: linux-2.6/arch/x86/Kconfig.cpu
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -282,6 +282,7 @@ config X86_CPU
def_bool y
select GENERIC_FIND_FIRST_BIT
select GENERIC_FIND_NEXT_BIT
+ select GENERIC_FIND_BIT_MAPSIZE_OPTIMIZE
config X86_GENERIC
bool "Generic x86 support"
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -190,6 +190,7 @@ static __always_inline unsigned long
find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
+#ifdef CONFIG_GENERIC_FIND_BIT_MAPSIZE_OPTIMIZE
unsigned long value;
/* Avoid a function call if the ...Hi, I, personally, see this patch as a stopgap measure. The real problem is that the generic __ffs is inlined and ends up generating a lot of instructions. Until the generic implementation of __ffs is fixed not to be an enormous inline function, this seems to be a reasonable thing to do, though it's a shame that the optimization is now default- off for architectures with a good implementation of __ffs. Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - mmm... Fastmail... --
Yup, it hasn't been there before and we explicitely enable it on x86. If there are other archs which have a good one it's a nobrainer to enable it. Thanks, tglx --
No. This is just making that damn header line look worse and worse. Is there a _reason_ to optimize this stupid function this way? Linus --
On architectures which have a find bit instruction we can replace the call to the library function by a single instruction when the size of the bitmap is less/equal bits per long and when the bitnr is a constant. Thanks, tglx --
That's not what I asked. I asked whether there is a *reason* to optimize this and cause all these stupid problems. Is there a real hot path anywhere that actually uses this and depends on it? Linus --
Sorry, misunderstood your question. I checked the use cases and have not found a single one for find_next_(zero_)bit() which makes use of this micro optimization. In fact the .text section of vmlinux of the "optimized" and the straight function call are identical. So the effect of those micro optimizations is exactly zero. find_first_(zero_)bit() has a couple of places where the optimization hits, but the code size reduction is mere 21 bytes and the use cases are not in real hot pathes AFAICT. I doubt that that is worth the trouble and we should just remove those inlines alltogether. This was discussed before, but Andi objected to remove those micro optimizations and nobody had time to actually verify the real benefits. I'll whip up a patch. Thanks, tglx --
Thanks for checking. In general, I think we should act the other way: we should *assume* that inlines and complex macros in header files are not worth it and we'd be better off with just plain function calls, unless it's really obvious that the inline function is always worth it. We have tons of good examples of inline functions in headers that really *really* want to be that way: <linux/list.h> is generally a collection of things that clearly expand to somthing that is (a) easily critical and (b) inlining things generally really causes a smaller code footprint than having a "call" instruction and all the argument setup - regardless of how many times it is done. So we do have cases where the inlines are obviously worth it. But in general, I think we should try to move things from the header files into *.c files unless there is a really clear reason for keeping it that way. (Some inline functions do depend on things like constant arguments goign away, with kmalloc() being a good example of some really nasty header file tricks that are likely worth it despite their extreme ugliness) Linus --
there's another benefit, and in asm-x86 we prefer to move inlines to .c files even in borderline cases because it simplifies the type dependencies: not having to fully define all types at the function prototype site avoids include file dependency hell. Putting things like a task struct dereference into a lowlevel inline file easily causes dependency problems that causes people to use macros instead - which have their own set of readability and side-effect problems. a third argument is that inlines seldom get smaller. So if they are borderline and we move them into a .c, and later on the function gets larger, no harm is done. But if we keep the inline in a .h in the borderline case and we grow the inline later on, the whole kernel bloats in a multiplied way, without any apparent direct feedback to the developer that something wrong and harmful just happened. Ingo --
The mapsize optimizations which were moved from x86 to the generic
code in commit 64970b68d2b3ed32b964b0b30b1b98518fde388e increased the
binary size on non x86 architectures.
Looking into the real effects of the "optimizations" it turned out
that they are not used in find_next_bit() and find_next_zero_bit().
The ones in find_first_bit() and find_first_zero_bit() are used in a
couple of places but none of them is a real hot path.
Remove the "optimizations" all together and call the library functions
unconditionally.
Boot-tested on x86 and compile tested on every cross compiler I have.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/bitops.h | 115 +++++--------------------------------------------
lib/find_next_bit.c | 22 ++++-----
2 files changed, 22 insertions(+), 115 deletions(-)
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -114,8 +114,6 @@ static inline unsigned fls_long(unsigned
#ifdef __KERNEL__
#ifdef CONFIG_GENERIC_FIND_FIRST_BIT
-extern unsigned long __find_first_bit(const unsigned long *addr,
- unsigned long size);
/**
* find_first_bit - find the first set bit in a memory region
@@ -124,28 +122,8 @@ extern unsigned long __find_first_bit(co
*
* Returns the bit number of the first set bit.
*/
-static __always_inline unsigned long
-find_first_bit(const unsigned long *addr, unsigned long size)
-{
- /* Avoid a function call if the bitmap size is a constant */
- /* and not bigger than BITS_PER_LONG. */
-
- /* insert a sentinel so that __ffs returns size if there */
- /* are no set bits in the bitmap */
- if (__builtin_constant_p(size) && (size < BITS_PER_LONG))
- return __ffs((*addr) | (1ul << size));
-
- /* the result of __ffs(0) is undefined, so it needs to be */
- /* handled separately */
- if (__builtin_constant_p(size) && (size == ...From: Thomas Gleixner <tglx@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> --
From: David Miller <davem@davemloft.net>
Ironically, I just bisected sparc64 bootup failures to the following
changeset. It's very late here, and I haven't looked into the
details, but it seems to wedge in free_area_init_nodes() on a non-NUMA
system with CONFIG_NUMA disabled.
Isn't it funny that this optimization not only was useless, but also
broke things. :-/
64970b68d2b3ed32b964b0b30b1b98518fde388e is first bad commit
commit 64970b68d2b3ed32b964b0b30b1b98518fde388e
Author: Alexander van Heukelum <heukelum@mailshack.com>
Date: Tue Mar 11 16:17:19 2008 +0100
x86, generic: optimize find_next_(zero_)bit for small constant-size bitmaps
This moves an optimization for searching constant-sized small
bitmaps form x86_64-specific to generic code.
On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
changes with this applied. I have observed only four places where this
optimization avoids a call into find_next_bit:
In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
In __next_cpu a call is avoided for a 32-bit bitmap. That's it.
On x86_64, 52 locations are optimized with a minimal increase in
code size:
Current #testing defconfig:
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392637 846592 724424 6963653 6a41c5 vmlinux
After removing the x86_64 specific optimization for find_next_*bit:
94 x bsf, 79 x find_next_*bit
text data bss dec hex filename
5392358 846592 724424 6963374 6a40ae vmlinux
After this patch (making the optimization generic):
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392396 846592 724424 6963412 6a40d4 vmlinux
[ tglx@linutronix.de: build fixes ]
--
hm. I guess the next thing you plan to check is whether Thomas's patch yeah and sorry :-/ It would still be nice to understand why this happened, it looks like a weird (and unexpected) interaction. Ingo --
From: Ingo Molnar <mingo@elte.hu> Yes, I just needed some sleep first, which I just obtained :-) --
The question is, whether it broke things or just unearthed some bug hidden elsewhere. Thanks, tglx --
From: Thomas Gleixner <tglx@linutronix.de> So I did a quick test, just #if 0'ing out the optimization inline portions of the find_first_bit() code in linux/bitops.h, and forcing it to always unconditionally call __find_first_bit() fixes the regression. Given that others who tested could not find one case where the optimization cases actually applied, and it's breaking things for me, my theory is that it's triggering for some obscure case on sparc64 and thus showing a bug in these optimizations since in practice I'm the only person to actually test this new code. --
From: David Miller <davem@davemloft.net> Ok, I think I see the problem. The core issue is that (X << N) is undefined when N is >= the word size, but that's exactly what the find_next_bit() inline optimizations do. This optimization code will trigger on 64-bit if NR_CPUS is set to 64, and you actually have 64 cpus. It should also occur on 32-bit if NR_CPUS=32 and you have 32 cpus. The bogus expansion occurs in lib/cpumask.c:__next_cpu() After processing cpu 63, we'll use offset==64 and thus try to make the undefined shift I described above, causing the caller's cpumask iteration loop to run forever. This code was really not tested very well at all. --
A long time ago during profiling I found that in the scheduler back then for_each_cpu() with find_next_bit() was somewhat hot. But I'm not sure it still would be in the new scheduler anyways and the test case a little dumb anyways (overscheduling user space) Still I think for_each_cpu() should be reasonably stream lined code. -Andi --
On Sun, 27 Apr 2008 13:26:34 -0700 (PDT), "Linus Torvalds" Hello, I think that's the wrong way to go about it. The problem that David Miller pointed out was that the small-bitmap optimization caused an inliningdisaster. Apparently, the generic version of __ffs is very big on sparc64. In my opinion the implementation of __ffs should be made out of line. If you move the wrapper of find_first_bit etc out of line, all possibilities of optimization will be destroyed. Better to remove them completely, then. Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - Accessible with your email software or over the web --
Hi, Ingo just pointed me to this thread. On Sun, 27 Apr 2008 13:19:51 -0700, "Harvey Harrison" Ehm... assume size=16 and *addr=0x80000000 -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - mmm... Fastmail... --
