This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem". When PERF_BENCH is defined at preprocessor level, memcpy_64.S is preprocessed to includable form from the sources under tools/perf for benchmarking programs. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Ma Ling: <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> --- arch/x86/lib/memcpy_64.S | 30 ++++++++++++++++++++++++++++++ 1 files changed, 30 insertions(+), 0 deletions(-) diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index 75ef61e..72c6dfe 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -1,10 +1,23 @@ /* Copyright 2002 Andi Kleen */ +/* + * perf bench adoption by Hitoshi Mitake + * PERF_BENCH means that this file is included from + * the source files under tools/perf/ for benchmark programs. + * + * You don't have to care about PERF_BENCH when + * you are working on the kernel. + */ + +#ifndef PERF_BENCH + #include <linux/linkage.h> #include <asm/cpufeature.h> #include <asm/dwarf2.h> +#endif /* PERF_BENCH */ + /* * memcpy - Copy a memory block. * @@ -23,8 +36,13 @@ * This gets patched over the unrolled variant (below) via the * alternative instructions framework: */ +#ifndef PERF_BENCH .section .altinstr_replacement, "ax", @progbits .Lmemcpy_c: +#else + .globl memcpy_x86_64_rep +memcpy_x86_64_rep: +#endif movq %rdi, %rax movl %edx, %ecx @@ -34,12 +52,19 @@ movl %edx, %ecx rep movsb ret +#ifndef PERF_BENCH .Lmemcpy_e: .previous +#endif +#ifndef PERF_BENCH ENTRY(__memcpy) ENTRY(memcpy) ...
This patch adds new file: mem-memcpy-x86-64-asm.S for x86-64 specific memcpy() benchmarking. Added new benchmarks are, x86-64-rep: memcpy() implemented with rep instruction x86-64-unrolled: unrolled memcpy() Original idea of including the source files of kernel for benchmarking is suggested by Ingo Molnar. This is more effective than write-once programs for quantitative evaluation of in-kernel, little and leaf functions called high frequently. Because perf bench is in kernel source tree and executing it on various hardwares, especially new model CPUs, is easy. This way can also be used for other functions of kernel e.g. checksum functions. Example of usage on Core i3 M330: | % ./perf bench mem memcpy -l 500MB | # Running mem/memcpy benchmark... | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ... | | 578.732506 MB/Sec | % ./perf bench mem memcpy -l 500MB -r x86-64-rep | # Running mem/memcpy benchmark... | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ... | | 738.184980 MB/Sec | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled | # Running mem/memcpy benchmark... | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ... | | 767.483269 MB/Sec This shows clearly that unrolled memcpy() is efficient than rep version and glibc's one :) # checkpatch.pl warns about two externs in bench/mem-memcpy.c # added by this patch. But I think it is no problem. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Ma Ling: <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> --- tools/perf/Makefile | 8 ++++++++ tools/perf/bench/mem-memcpy-x86-64-asm.S | 4 ++++ ...
Hey, really cool output :-) You should put these: +#ifdef ARCH_X86_64 +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len); +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len); +#endif into a .h file - a new one if needed. That will make both checkpatch and me happier ;-) Thanks, Ingo --
Does Ma Ling's patched version mean, http://marc.info/?l=linux-kernel&m=128652296500989&w=2 the memcpy applied the patch of the URL? (It seems that this patch was written by Miao Xie.) OK, I'll separate these files. BTW, I found really interesting evaluation result. Current results of "perf bench mem memcpy" include the overhead of page faults because the measured memcpy() is the first access to allocated memory area. I tested the another version of perf bench mem memcpy, which does memcpy() before measured memcpy() for removing the overhead come from page faults. And this is the result: % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled # Running mem/memcpy benchmark... # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ... 4.608340 GB/Sec % ./perf bench mem memcpy -l 500MB # Running mem/memcpy benchmark... # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ... 4.856442 GB/Sec % ./perf bench mem memcpy -l 500MB -r x86-64-rep # Running mem/memcpy benchmark... # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ... 6.024445 GB/Sec The relation of scores reversed! I cannot explain the cause of this result, and this is really interesting phenomenon. So I'd like to add new command line option, like "--pre-page-faults" to perf bench mem memcpy, for doing memcpy() before measured memcpy(). How do you think about this idea? Thanks, --
Interesting indeed, and it would be nice to analyse that! (It should be possible, using various PMU metrics in a clever way, to figure out what's happening inside the Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for things like this.) An even better solution would be to output _both_ results by default, so that people can see both characteristics at a glance? Thanks, Ingo --
Outputting both result of prefaulted and non prefaulted will be useful, but this might be not good for using from scripts. So I'll implement --prefault option first. If there is request for outputting both, I'll consider to modify default output. # Please wait about the result of Miao Xie's patch, # benchmarking memcpy() of unaligned memory area is # a little difficult Thanks, Hitoshi --
Ok - it should definitely be easily scriptable. The default can be have both flags enabled and both results written to the output. People will try 'perf bench x86' to see performance at a glance - so printing all the tests we have is a good idea. Thanks, Ingo --
OK, I added --no-prefault and --only-prefault to perf bench mem memcpy. As you told, printing both of them is convenient. I send the updated patch later. Thanks, --
After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().
New options --no-prefault and --only-prefault are added
for printing single result, mainly for scripting usage.
Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 634.969014 MB/Sec
| 4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 642.725568 MB/Sec
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
tools/perf/bench/mem-memcpy.c | 215 +++++++++++++++++++++++++++++------------
1 files changed, 152 insertions(+), 63 deletions(-)
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index be31ddb..61b6ead 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -25,7 +25,8 @@ static const char *length_str = "1MB";
static const char *routine = "default";
static bool use_clock;
static int clock_fd;
-static bool prefault;
+static bool only_prefault;
+static bool no_prefault;
static const struct option options[] = {
OPT_STRING('l', "length", &length_str, "1MB",
@@ -35,15 +36,19 @@ static const struct option options[] = {
"Specify routine to copy"),
...Ok. Mind resending the whole series once all review feedback has been incorporated? Thanks, Ingo --
Really sorry for my late reply.. On 11/18/10 16:58, Ingo Molnar wrote: > > * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp> wrote: > >> After applying this patch, perf bench mem memcpy prints >> both of prefualted and without prefaulted score of memcpy(). >> >> New options --no-prefault and --only-prefault are added >> for printing single result, mainly for scripting usage. > > Ok. Mind resending the whole series once all review feedback has been incorporated? > OK, I'll send the patch series for prefaulting and porting memcpy_64.S to perf bench later. This series do some dirty things especially in Makefile of perf and defining ENTRY(). So I'd like to hear your comment. Could you review these? And I have another problem. I cannot see the name of memcpy based on rep prefix because the symbol of it is ".Lmemcpy_c". It seems that the symbol name start from "." cannot be seen from other object files. So I have to seek the way to find the name of rep memcpy... Thanks, Hitoshi --
After applying this patch, perf bench mem memcpy prints both of prefualted and without prefaulted score of memcpy(). New options --no-prefault and --only-prefault are added to print single result, mainly for scripting usage. Example of usage: | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 634.969014 MB/Sec | 4.828062 GB/Sec (with prefault) | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 4.705192 GB/Sec (with prefault) | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 642.725568 MB/Sec Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Ma Ling <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> --- tools/perf/bench/mem-memcpy.c | 219 ++++++++++++++++++++++++++++++----------- 1 files changed, 162 insertions(+), 57 deletions(-) diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c index 38dae74..db82021 100644 --- a/tools/perf/bench/mem-memcpy.c +++ b/tools/perf/bench/mem-memcpy.c @@ -12,6 +12,7 @@ #include "../util/parse-options.h" #include "../util/header.h" #include "bench.h" +#include "mem-memcpy-arch.h" #include <stdio.h> #include <stdlib.h> @@ -23,8 +24,10 @@ static const char *length_str = "1MB"; static const char *routine = "default"; -static bool use_clock = false; +static bool use_clock; static ...
Commit-ID: 49ce8fc651794878189fd5f273228832cdfb5be9 Gitweb: http://git.kernel.org/tip/49ce8fc651794878189fd5f273228832cdfb5be9 Author: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> AuthorDate: Thu, 25 Nov 2010 16:04:52 +0900 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 26 Nov 2010 08:15:57 +0100 perf bench: Print both of prefaulted and no prefaulted results by default After applying this patch, perf bench mem memcpy prints both of prefualted and without prefaulted score of memcpy(). New options --no-prefault and --only-prefault are added to print single result, mainly for scripting usage. Usage example: | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 634.969014 MB/Sec | 4.828062 GB/Sec (with prefault) | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 4.705192 GB/Sec (with prefault) | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 642.725568 MB/Sec Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: h.mitake@gmail.com Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Ma Ling <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andi Kleen <andi@firstfloor.org> LKML-Reference: <1290668693-27068-1-git-send-email-mitake@dcl.info.waseda.ac.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- tools/perf/bench/mem-memcpy.c | 219 ++++++++++++++++++++++++++++++----------- 1 files changed, 162 insertions(+), 57 deletions(-) diff --git a/tools/perf/bench/mem-memcpy.c ...
Looks interesting, and also interesting would be to be able to place probes that would wake up it too, for unmodified binaries to have something similar. Other kinds of triggers may be to hook on syscalls and when some expression matches, like connecting to host 1.2.3.4, start monitoring, stop when the socket is closed, i.e. monitor a connection lifetime, etc. I think it is worth pursuing and encourage you to work on it :-) - Arnaldo --
Sounds to me like you want something like a library with self-monitoring stuff. --
Yeah, that could be a way, an LD_PRELOAD thingy that would intercept library calls, setup counters, start a monitoring thread, etc. Along the lines of: http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c This one just intercepts calls, but the __init function could do the rest. To make it easier we could move the counter setup we have in record/top to a library, etc. - Arnaldo --
Nah, I was more thinking of something along the lines of libPAPI and libpfmon. A library that contains the needed building blocks for apps to profile themselves. --
Ok, you mean for the case where you can modify the app, I was thinking about when you can't. In both cases its good to move the counter creation, etc routines from record/top to a lib, that then could be used in the way you mention, and in the way I mention too. Two different usecases :-) - Arnaldo --
Thanks for your comments, Arnaldo, Peter.
I implement basic feature of my proposal,
and found that communicating perf stat and benchmarking programs
via socket is really dirty. As you said, unified form,
interception for unmodified binary and library for modifiable binary,
will be ideal for fine grain monitoring.
But I believe that measuring performance of some sort of programs
like in kernel routines requires more fine grain perf stating,
so I'll seek the unified way.
Anyway, I'll send my proof of concept patch later.
Thanks,
Hitoshi
--
This patch adds new option "--wait-on" option to perf stat.
Current perf stat can monitor
1) lifetime of program specified as command line argument, or
2) lifetime of perf stat. Target process is specified with pid,
and end of monitoring is triggered with signal.
1) is too coarse grain. And 2) is difficult to distinguish the range to monitor.
This patch makes it possible to wait before sys_perf_event_open().
Monitored process can wake up perf stat via unix domain socket,
and terminate monitoring via signal.
New option --wait-on requires the string as the path of unix domain socket.
perf stat read the pid from the socket for target_pid. Monitored program
should write the pid of itself to it.
perf stat replies the pid of itself to monitored program. The monitored program
should send signal SIGINT to perf stat with this pid. Then monitoring is terminated.
I feel current implementation is really dirty. As Arnaldo and Peter suggested,
more unified way like interception or self monitoring library is ideal.
This is the proof of concept version. I'd like to hear your comments.
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
---
tools/perf/builtin-stat.c | 63 ++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7ff746d..4cc10a1 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -51,6 +51,8 @@
#include <sys/prctl.h>
#include <math.h>
#include <locale.h>
+#include <sys/socket.h>
+#include <sys/un.h>
#define DEFAULT_SEPARATOR " "
@@ -90,11 +92,15 @@ ...This patch makes perf bench mem memcpy to use the new feature of perf stat. New option --wake-up requires path name of unix domain socket. If --only-prefault or --no-prefault is specified, the pid of itself is written to this socket before actual memcpy() to be monitored. And the pid of perf stat is read from it. The pid of perf stat is used for signaling perf stat to terminate monitoring. With this feature, the detailed performance monitoring of prefaulted (or non prefaulted only) memcpy() will be possible. Example of use, non prefaulted version: | mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait | After execution, perf stat waits the pid... | Performance counter stats for process id '27109': | | 440.534943 task-clock-msecs # 0.997 CPUs | 44 context-switches # 0.000 M/sec | 5 CPU-migrations # 0.000 M/sec | 256,002 page-faults # 0.581 M/sec | 934,443,072 cycles # 2121.155 M/sec | 780,408,435 instructions # 0.835 IPC | 111,756,558 branches # 253.684 M/sec | 392,170 branch-misses # 0.351 % | 8,611,308 cache-references # 19.547 M/sec | 8,533,588 cache-misses # 19.371 M/sec | | 0.441803031 seconds time elapsed in another shell, | mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --no-prefault -w /tmp/perf-stat-wait | # Running mem/memcpy benchmark... | # Copying 500MB Bytes ... | | 1.105722 GB/Sec Example of use, prefaulted version: | mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait | Performance counter stats for process id '27112': | | 105.001542 task-clock-msecs # 0.997 CPUs | 11 context-switches # 0.000 M/sec | 0 CPU-migrations ...
This patch ports arch/x86/lib/memcpy_64.S to perf bench mem memcpy for benchmarking memcpy() in userland with tricky and dirty way. util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and util/include/linux/linkage.h are dummy (but do a little work) for including memcpy_64.S without modification to it (e.g. defining ENTRY()). This makes checkpatch.pl angry like this: \#177: FILE: tools/perf/util/include/linux/linkage.h:7: +#define ENTRY(name) \ + .globl name; \ + name: WARNING: labels should not be indented \#179: FILE: tools/perf/util/include/linux/linkage.h:9: + name: because checkpatch.pl treat this file as the file written in C. But I think this can be forgived because original include/linux/linkage.h is doing the similar thing. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Ma Ling <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> --- tools/perf/Makefile | 11 +++++++++++ tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++ tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++ tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++ tools/perf/util/include/asm/cpufeature.h | 9 +++++++++ tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++ tools/perf/util/include/linux/linkage.h | 13 +++++++++++++ 7 files changed, 62 insertions(+), 0 deletions(-) create mode 100644 tools/perf/bench/mem-memcpy-arch.h create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm-def.h create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm.S ...
Commit-ID: ea7872b9d6a81101f6ba0ec141544a62fea35876 Gitweb: http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876 Author: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 26 Nov 2010 08:15:57 +0100 perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' This patch ports arch/x86/lib/memcpy_64.S to perf bench mem memcpy for benchmarking memcpy() in userland with tricky and dirty way. util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and util/include/linux/linkage.h are mostly dummy files with small wrappers, so that we are able to include memcpy_64.S unmodified. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: h.mitake@gmail.com Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Ma Ling <ling.ma@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andi Kleen <andi@firstfloor.org> LKML-Reference: <1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- tools/perf/Makefile | 11 +++++++++++ tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++ tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++ tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++ tools/perf/util/include/asm/cpufeature.h | 9 +++++++++ tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++ tools/perf/util/include/linux/linkage.h | 13 +++++++++++++ 7 files changed, 62 insertions(+), 0 deletions(-) diff --git a/tools/perf/Makefile b/tools/perf/Makefile index 74b684d..e0db197 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -185,7 +185,10 @@ ifeq ...
On 2010年11月26日 19:31, tip-bot for Hitoshi Mitake wrote: > Commit-ID: ea7872b9d6a81101f6ba0ec141544a62fea35876 > Gitweb: http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876 > Author: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp> > AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900 > Committer: Ingo Molnar<mingo@elte.hu> > CommitDate: Fri, 26 Nov 2010 08:15:57 +0100 > > perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' > > This patch ports arch/x86/lib/memcpy_64.S to perf bench mem > memcpy for benchmarking memcpy() in userland with tricky and > dirty way. > > util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and > util/include/linux/linkage.h are mostly dummy files with small > wrappers, so that we are able to include memcpy_64.S > unmodified. > > Signed-off-by: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp> > Cc: h.mitake@gmail.com > Cc: Miao Xie<miaox@cn.fujitsu.com> > Cc: Ma Ling<ling.ma@intel.com> > Cc: Zhao Yakui<yakui.zhao@intel.com> > Cc: Peter Zijlstra<a.p.zijlstra@chello.nl> > Cc: Arnaldo Carvalho de Melo<acme@redhat.com> > Cc: Paul Mackerras<paulus@samba.org> > Cc: Frederic Weisbecker<fweisbec@gmail.com> > Cc: Steven Rostedt<rostedt@goodmis.org> > Cc: Andi Kleen<andi@firstfloor.org> > LKML-Reference:<1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp> > Signed-off-by: Ingo Molnar<mingo@elte.hu> > --- > tools/perf/Makefile | 11 +++++++++++ > tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++ > tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++ > tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++ > tools/perf/util/include/asm/cpufeature.h | 9 +++++++++ > tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++ > tools/perf/util/include/linux/linkage.h | 13 +++++++++++++ > 7 files changed, 62 insertions(+), 0 deletions(-) > > ...
I don't like littering the actual kernel code with tools/perf/ ifdeffery.. --
Yeah - could we somehow accept that file into a perf build as-is? Thanks, Ingo --
I agree with your idea, but Ma Ling said this way may cause the i-cache miss problem. http://marc.info/?l=linux-kernel&m=128746120107953&w=2 (The size of the i-cache is 32K, the size of memcpy() in my patch is 560Byte, and the size of the last version in tip tree is 400Byte). But I have not tested it, so I don't know the real result. Maybe we should They are Core2 Duo E7300(Core name: Wolfdale) and Xeon X5260(Core name: Wolfdale-DP). The following is the detailed information of these two CPU: Core2 Duo E7300: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz stepping : 6 cpu MHz : 1603.000 cache size : 3072 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm dts bogomips : 5319.70 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Xeon X5260: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X5260 @ 3.33GHz stepping : 6 cpu MHz : 1999.000 cache size : 6144 KB physical id : 3 siblings : 2 core id : 1 cpu cores : 2 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority bogomips : 6649.07 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power ...
I compared memcpy()'s icache miss behaviour with my new --wait-on patch ( https://patchwork.kernel.org/patch/408801/ ). And the result is, default of tip tree % sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-load-misses Performance counter stats for process id '12559': 64,328 L1-icache-load-misses 0.106513157 seconds time elapsed Miao Xie's memcpy() % sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-misses Performance counter stats for process id '13159': 64,559 L1-icache-load-misses 0.107057925 seconds time elapsed It seems that there is no fatal icache miss. # I tested perf bench mem memcpy with Core i3 M 330 processor. But I don't understand well about cache characteristics of intel processor. Thanks for your information! Thanks, Hitoshi --
On Sat, Oct 30, 2010 at 06:08, Arnaldo Carvalho de Melo OK, it seems that I have to consider better solution. Could you tell me about the past problem for reference? Your experience must be useful for this case. -- Hitoshi Mitake h.mitake@gmail.com --
