Re: [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()

Previous thread: FCNTL Performance problem by Rob Donovan on Sunday, August 8, 2010 - 12:26 pm. (4 messages)

Next thread: [PATCH] AFS: Implement an autocell mount capability by David Howells on Sunday, August 8, 2010 - 12:30 pm. (2 messages)
From: Michal Nazarewicz
Date: Sunday, August 8, 2010 - 12:29 pm

The put_dec_trunc() and put_dec_full() functions were based on
a code optimised for processors with 8-bit ALU but since we
don't need to limit ourselves to such small ALUs, the code was
optimised and used capacities of an 16-bit ALU anyway.

This patch goes further and uses the full capacity of a 32-bit
ALU and instead of splitting the number into nibbles and
operating on them it performs the obvious algorithm for base
conversion (except it uses optimised code for dividing by
ten).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  143 +++++++++++++++++++++++++++----------------------------
 1 files changed, 70 insertions(+), 73 deletions(-)

Compared to v1 only commit message and comments were changed.


I did some benchmark on the following processors:

ARM     : ARMv7 Processor rev 2 (v7l)                   (32-bit)
Atom    : Intel(R) Atom(TM) CPU N270 @ 1.60GHz          (32-bit)
Core    : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz   (64-bit)
Phenom  : AMD Phenom(tm) II X3 710 Processor            (64-bit)

Here are the results (normalised to the fastest/smallest):

                    :        ARM       Atom       Core     Phenom
-- Speed --------------------------------------------------------
orig_put_dec_full   :   1.054570   1.356214   1.732636   1.725760  Original
mod1_put_dec_full   :   1.000000   1.017216   1.255518   1.116559
mod3_put_dec_full   :   1.018222   1.000000   1.000000   1.000000  Proposed

orig_put_dec_trunc  :   1.137903   1.216017   1.850478   1.662370  Original
mod1_put_dec_trunc  :   1.000000   1.078154   1.355635   1.400637
mod3_put_dec_trunc  :   1.025989   1.000000   1.000000   1.000000  Proposed
-- Size ---------------------------------------------------------
orig_put_dec_full   :   1.212766   1.310345   1.355372   1.355372  Original
mod1_put_dec_full   :   1.021277   1.000000   1.000000   1.000000
mod3_put_dec_full   :   1.000000   1.172414   1.049587   1.049587  Proposed

orig_put_dec_trunc  :   1.363636   ...
From: Michal Nazarewicz
Date: Sunday, August 8, 2010 - 12:29 pm

Existing put_dec() function uses a do_div() function for
dividing the 64-bit argument.  On 32-bit machines this may be
a costly operation.  This patch, replaces the put_dec()
function on 32-bit processors to one that performs no 64-bit
divisions.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  114 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 108 insertions(+), 6 deletions(-)

Compared to previous version: the code is used only if:
1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
2. user did not select optimisation for size with Kconfig.


I did some benchmark on the following processors:

ARM     : ARMv7 Processor rev 2 (v7l)                   (32-bit)
Atom    : Intel(R) Atom(TM) CPU N270 @ 1.60GHz          (32-bit)

(I'm skipping 64-bit machines as this patch is intended only for
32-bit).

Here are the results (normalised to the fastest/smallest):

                    :        ARM       Atom
-- Speed ----------------------------------
orig_put_dec        :   9.333822   2.083110  Original
mod1_put_dec        :   9.282045   1.904564
mod2_put_dec        :   9.260409   1.910302
mod3_put_dec        :   9.320053   1.905689  Proposed by previous patch
mod4_put_dec        :   9.297146   1.933971
mod5_put_dec        :  13.034318   2.434942
mod6_put_dec        :   1.000000   1.000000  Proposed by this patch
mod7_put_dec        :   1.009574   1.014147
mod8_put_dec        :   7.226004   1.953460
-- Size -----------------------------------
orig_put_dec        :   1.000000   1.000000  Original
mod1_put_dec        :   1.000000   1.000000
mod2_put_dec        :   1.361111   1.403226
mod3_put_dec        :   1.000000   1.000000  Proposed by previous patch
mod4_put_dec        :   1.361111   1.403226
mod5_put_dec        :   1.000000   1.000000
mod6_put_dec        :   2.555556   3.508065  Proposed by this patch
mod7_put_dec        :   2.833333   3.911290
mod8_put_dec        :   2.027778   2.258065


Source of the ...
From: Michal Nazarewicz
Date: Sunday, August 8, 2010 - 12:29 pm

This commit adds a test application for the put_dec() and
family of functions that are used by the previous two commits.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 tools/put-dec/Makefile       |   14 +
 tools/put-dec/put-dec-test.c |  942 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 956 insertions(+), 0 deletions(-)
 create mode 100644 tools/put-dec/Makefile
 create mode 100644 tools/put-dec/put-dec-test.c

diff --git a/tools/put-dec/Makefile b/tools/put-dec/Makefile
new file mode 100644
index 0000000..77400c9
--- /dev/null
+++ b/tools/put-dec/Makefile
@@ -0,0 +1,14 @@
+put-dec-test: put-dec-test.c
+	exec $(CC) -Wall -Wextra -O2 -o $@ $<
+
+put-dec-test.s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -O2 -S -o $@ $<
+
+put-dec-test-s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -Os -o $@ $<
+
+put-dec-test-s.s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -Os -S -o $@ $<
+
+clean:
+	rm -f -- put-dec-test
diff --git a/tools/put-dec/put-dec-test.c b/tools/put-dec/put-dec-test.c
new file mode 100644
index 0000000..1860ae3
--- /dev/null
+++ b/tools/put-dec/put-dec-test.c
@@ -0,0 +1,942 @@
+/*
+ * put-dec-test.c -- Variaus put_dec*() functions implementation
+ *                   testing and benchmarking tool
+ * Written by                  Michal Nazarewicz <mina86@mina86.com>
+ * with helpful suggestions by Denys Vlasenko <vda.linux@googlemail.com>
+ */
+#define _BSD_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#if CHAR_BIT != 8
+#  error This code assumes CHAR_BIT == 8
+#endif
+
+
+
+#  define do_div(n, base) ({			\
+		uint32_t __base = (base);	\
+		uint32_t __rem = (n) % __base;	\
+		(n) /= ...
From: Denys Vlasenko
Date: Monday, August 9, 2010 - 9:15 pm

I measured the size and it does not seem to make sense
to exclude it on -Os. On x86:

put_dec_full change: 0x93 -> 0x47 bytes
put_dec      change: 0x12c -> 0x137 bytes

IOW, there is net code size reduction (compared to current kernel,
it may be a slight growth compared to patch 1).



Re speed: on Phenom II in 32-bit mode, I see ~x3.3 speedup
on conversions involving large integers (might be skewed
by gcc's full-blown 64-bit division in "old" code - kernel's

I tested [0, 100 million] and [2^64-100 million, 2^64-1] ranges.

I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX > 2^64-1",
since it expresses your intent better. Also, add comments explaining
what case you optimize for:

#if BITS_PER_LONG > 32 || ULLONG_MAX > 18446744073709551615ULL

/* Generic code */
...

#else /* BITS_PER_LONG <= 32 && ULLONG_MAX <= 2^64-1 */

/* Optimized code for arches with 64-bit long longs */


I experimented with moving division up, before put_dec_full4:
q   = d1 / 10000;
buf = put_dec_full4(buf, d1 % 10000);
but gcc appears to be smart emough to do this transformation
itself. But you may still do it for older (dumber) gcc's.

-- 
vda
--

From: Michał Nazarewicz
Date: Tuesday, August 10, 2010 - 12:42 am

Hmm...  I think those are new results, but I might have messed something



I wasn't sure where would be a better place to put this line.  I'll
follow your advice on this one then.

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--
--

From: Denys Vlasenko
Date: Tuesday, August 10, 2010 - 9:10 am

Thinko. I meant "or", not "and"...

-- 
vda
--

From: Denys Vlasenko
Date: Monday, August 9, 2010 - 8:17 pm

In my testing on Phenom II the speed gain is smaller,


-- 
vda
--

From: Michał Nazarewicz
Date: Tuesday, August 10, 2010 - 12:39 am

No.  81920 is a 17 bit number and when we multiply it by 0xcccd we lose
the most significant bit.  Therefore we cannot use the '(x * 0xcccd) >>

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--
--

From: Denys Vlasenko
Date: Tuesday, August 10, 2010 - 9:08 am

No. All x up to (exclusive) 81920 can be multiplied by 0xcccd
and result still fits into 32 bits. Proof:

# printf "%x\n" $((81919 * 0xcccd))
ffff7333

-- 
vda
--

From: Michal Nazarewicz
Date: Tuesday, August 10, 2010 - 3:42 pm

Turns out something else was a problem ((x * 13) >> 7 works for x <
69).  I'll update comments in the next version.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
--

Previous thread: FCNTL Performance problem by Rob Donovan on Sunday, August 8, 2010 - 12:26 pm. (4 messages)

Next thread: [PATCH] AFS: Implement an autocell mount capability by David Howells on Sunday, August 8, 2010 - 12:30 pm. (2 messages)