perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

Previous thread: percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu operations by Christoph Lameter on Friday, October 29, 2010 - 8:56 am. (2 messages)

Next thread: [PATCH] xen: implement XENMEM_machphys_mapping by Stefano Stabellini on Friday, October 29, 2010 - 9:04 am. (1 message)
From: Hitoshi Mitake
Date: Friday, October 29, 2010 - 9:01 am

This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
When PERF_BENCH is defined at preprocessor level,
memcpy_64.S is preprocessed to includable form from the sources
under tools/perf for benchmarking programs.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling: <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/lib/memcpy_64.S |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 75ef61e..72c6dfe 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -1,10 +1,23 @@
 /* Copyright 2002 Andi Kleen */
 
+/*
+ * perf bench adoption by Hitoshi Mitake
+ * PERF_BENCH means that this file is included from
+ * the source files under tools/perf/ for benchmark programs.
+ *
+ * You don't have to care about PERF_BENCH when
+ * you are working on the kernel.
+ */
+
+#ifndef PERF_BENCH
+
 #include <linux/linkage.h>
 
 #include <asm/cpufeature.h>
 #include <asm/dwarf2.h>
 
+#endif /* PERF_BENCH */
+
 /*
  * memcpy - Copy a memory block.
  *
@@ -23,8 +36,13 @@
  * This gets patched over the unrolled variant (below) via the
  * alternative instructions framework:
  */
+#ifndef PERF_BENCH
 	.section .altinstr_replacement, "ax", @progbits
 .Lmemcpy_c:
+#else
+	.globl memcpy_x86_64_rep
+memcpy_x86_64_rep:
+#endif
 	movq %rdi, %rax
 
 	movl %edx, %ecx
@@ -34,12 +52,19 @@
 	movl %edx, %ecx
 	rep movsb
 	ret
+#ifndef PERF_BENCH
 .Lmemcpy_e:
 	.previous
+#endif
 
+#ifndef PERF_BENCH
 ENTRY(__memcpy)
 ENTRY(memcpy)
 ...
From: Hitoshi Mitake
Date: Friday, October 29, 2010 - 9:01 am

This patch adds new file: mem-memcpy-x86-64-asm.S
for x86-64 specific memcpy() benchmarking.
Added new benchmarks are,
 x86-64-rep:      memcpy() implemented with rep instruction
 x86-64-unrolled: unrolled memcpy()

Original idea of including the source files of kernel
for benchmarking is suggested by Ingo Molnar.
This is more effective than write-once programs for quantitative
evaluation of in-kernel, little and leaf functions called high frequently.
Because perf bench is in kernel source tree and executing it
on various hardwares, especially new model CPUs, is easy.

This way can also be used for other functions of kernel e.g. checksum functions.

Example of usage on Core i3 M330:

| % ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
|
|      578.732506 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-rep
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
|
|      738.184980 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
|
|      767.483269 MB/Sec

This shows clearly that unrolled memcpy() is efficient
than rep version and glibc's one :)

# checkpatch.pl warns about two externs in bench/mem-memcpy.c
# added by this patch. But I think it is no problem.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling: <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 tools/perf/Makefile                      |    8 ++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S |    4 ++++
 ...
From: Ingo Molnar
Date: Saturday, October 30, 2010 - 12:23 pm

Hey, really cool output :-)


You should put these:

 +#ifdef ARCH_X86_64
 +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
 +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
 +#endif

into a .h file - a new one if needed.

That will make both checkpatch and me happier ;-)

Thanks,

	Ingo
--

From: Hitoshi Mitake
Date: Sunday, October 31, 2010 - 10:36 pm

Does Ma Ling's patched version mean,

http://marc.info/?l=linux-kernel&m=128652296500989&w=2

the memcpy applied the patch of the URL?
(It seems that this patch was written by Miao Xie.)


OK, I'll separate these files.

BTW, I found really interesting evaluation result.
Current results of "perf bench mem memcpy" include
the overhead of page faults because the measured memcpy()
is the first access to allocated memory area.

I tested the another version of perf bench mem memcpy,
which does memcpy() before measured memcpy() for removing
the overhead come from page faults.

And this is the result:

% ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...

        4.608340 GB/Sec

% ./perf bench mem memcpy -l 500MB
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...

        4.856442 GB/Sec

% ./perf bench mem memcpy -l 500MB -r x86-64-rep
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...

        6.024445 GB/Sec

The relation of scores reversed!
I cannot explain the cause of this result, and
this is really interesting phenomenon.

So I'd like to add new command line option,
like "--pre-page-faults" to perf bench mem memcpy,
for doing memcpy() before measured memcpy().

How do you think about this idea?

Thanks,
--

From: Ingo Molnar
Date: Monday, November 1, 2010 - 2:02 am

Interesting indeed, and it would be nice to analyse that! (It should be possible, 
using various PMU metrics in a clever way, to figure out what's happening inside the 

Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for 
things like this.)

An even better solution would be to output _both_ results by default, so that people 
can see both characteristics at a glance?

Thanks,

	Ingo
--

From: Hitoshi Mitake
Date: Friday, November 5, 2010 - 10:05 am

Outputting both result of prefaulted and non prefaulted will be useful,
but this might be not good for using from scripts.
So I'll implement --prefault option first. If there is request
for outputting both, I'll consider to modify default output.

# Please wait about the result of Miao Xie's patch,
# benchmarking memcpy() of unaligned memory area is
# a little difficult

Thanks,
	Hitoshi
--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 2:12 am

Ok - it should definitely be easily scriptable. The default can be have both flags 
enabled and both results written to the output.

People will try 'perf bench x86' to see performance at a glance - so printing all 
the tests we have is a good idea.

Thanks,

	Ingo
--

From: Hitoshi Mitake
Date: Friday, November 12, 2010 - 8:01 am

OK, I added --no-prefault and --only-prefault to perf bench mem memcpy.
As you told, printing both of them is convenient.

I send the updated patch later.

Thanks,
--

From: Hitoshi Mitake
Date: Friday, November 12, 2010 - 8:02 am

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
for printing single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      634.969014 MB/Sec
|        4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 tools/perf/bench/mem-memcpy.c |  215 +++++++++++++++++++++++++++++------------
 1 files changed, 152 insertions(+), 63 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index be31ddb..61b6ead 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -25,7 +25,8 @@ static const char	*length_str	= "1MB";
 static const char	*routine	= "default";
 static bool		use_clock;
 static int		clock_fd;
-static bool		prefault;
+static bool		only_prefault;
+static bool		no_prefault;
 
 static const struct option options[] = {
 	OPT_STRING('l', "length", &length_str, "1MB",
@@ -35,15 +36,19 @@ static const struct option options[] = {
 		    "Specify routine to copy"),
 ...
From: Ingo Molnar
Date: Thursday, November 18, 2010 - 12:58 am

Ok. Mind resending the whole series once all review feedback has been incorporated?

Thanks,

	Ingo
--

From: Hitoshi Mitake
Date: Thursday, November 25, 2010 - 12:04 am

Really sorry for my late reply..

On 11/18/10 16:58, Ingo Molnar wrote:
 >
 > * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
 >
 >> After applying this patch, perf bench mem memcpy prints
 >> both of prefualted and without prefaulted score of memcpy().
 >>
 >> New options --no-prefault and --only-prefault are added
 >> for printing single result, mainly for scripting usage.
 >
 > Ok. Mind resending the whole series once all review feedback has been 
incorporated?
 >

OK, I'll send the patch series for prefaulting and
porting memcpy_64.S to perf bench later.
This series do some dirty things especially in Makefile
of perf and defining ENTRY(). So I'd like to hear your comment.
Could you review these?

And I have another problem. I cannot see the name of
memcpy based on rep prefix because the symbol of it is ".Lmemcpy_c".
It seems that the symbol name start from "." cannot be seen
from other object files. So I have to seek the way to
find the name of rep memcpy...

Thanks,
	Hitoshi
--

From: Hitoshi Mitake
Date: Thursday, November 25, 2010 - 12:04 am

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      634.969014 MB/Sec
|        4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
---
 tools/perf/bench/mem-memcpy.c |  219 ++++++++++++++++++++++++++++++-----------
 1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..db82021 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -12,6 +12,7 @@
 #include "../util/parse-options.h"
 #include "../util/header.h"
 #include "bench.h"
+#include "mem-memcpy-arch.h"
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -23,8 +24,10 @@
 
 static const char	*length_str	= "1MB";
 static const char	*routine	= "default";
-static bool		use_clock	= false;
+static bool		use_clock;
 static ...
From: tip-bot for Hitoshi Mitake
Date: Friday, November 26, 2010 - 3:30 am

Commit-ID:  49ce8fc651794878189fd5f273228832cdfb5be9
Gitweb:     http://git.kernel.org/tip/49ce8fc651794878189fd5f273228832cdfb5be9
Author:     Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
AuthorDate: Thu, 25 Nov 2010 16:04:52 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Print both of prefaulted and no prefaulted results by default

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Usage example:

 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |      634.969014 MB/Sec
 |        4.828062 GB/Sec (with prefault)
 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |        4.705192 GB/Sec (with prefault)
 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: h.mitake@gmail.com
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1290668693-27068-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/bench/mem-memcpy.c |  219 ++++++++++++++++++++++++++++++-----------
 1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c ...
From: Arnaldo Carvalho de Melo
Date: Sunday, December 12, 2010 - 6:46 am

Looks interesting, and also interesting would be to be able to place
probes that would wake up it too, for unmodified binaries to have
something similar.

Other kinds of triggers may be to hook on syscalls and when some
expression matches, like connecting to host 1.2.3.4, start monitoring,
stop when the socket is closed, i.e. monitor a connection lifetime, etc.

I think it is worth pursuing and encourage you to work on it :-)

- Arnaldo
--


Sounds to me like you want something like a library with self-monitoring
stuff.
--


Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
library calls, setup counters, start a monitoring thread, etc.

Along the lines of:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c

This one just intercepts calls, but the __init function could do the
rest.

To make it easier we could move the counter setup we have in record/top
to a library, etc.

- Arnaldo
--


Nah, I was more thinking of something along the lines of libPAPI and
libpfmon. A library that contains the needed building blocks for apps to
profile themselves.


--


Ok, you mean for the case where you can modify the app, I was thinking
about when you can't.

In both cases its good to move the counter creation, etc routines from
record/top to a lib, that then could be used in the way you mention, and
in the way I mention too. Two different usecases :-)

- Arnaldo
--


Thanks for your comments, Arnaldo, Peter.

I implement basic feature of my proposal,
and found that communicating perf stat and benchmarking programs
via socket is really dirty. As you said, unified form,
interception for unmodified binary and library for modifiable binary,
will be ideal for fine grain monitoring.

But I believe that measuring performance of some sort of programs
like in kernel routines requires more fine grain perf stating,
so I'll seek the unified way.

Anyway, I'll send my proof of concept patch later.

Thanks,
         Hitoshi

--

From: Hitoshi Mitake
Date: Monday, December 13, 2010 - 10:46 pm

This patch adds new option "--wait-on" option to perf stat.

Current perf stat can monitor
 1) lifetime of program specified as command line argument, or
 2) lifetime of perf stat. Target process is specified with pid,
    and end of monitoring is triggered with signal.
1) is too coarse grain. And 2) is difficult to distinguish the range to monitor.

This patch makes it possible to wait before sys_perf_event_open().
Monitored process can wake up perf stat via unix domain socket,
and terminate monitoring via signal.

New option --wait-on requires the string as the path of unix domain socket.
perf stat read the pid from the socket for target_pid. Monitored program
should write the pid of itself to it.
perf stat replies the pid of itself to monitored program. The monitored program
should send signal SIGINT to perf stat with this pid. Then monitoring is terminated.

I feel current implementation is really dirty. As Arnaldo and Peter suggested,
more unified way like interception or self monitoring library is ideal.
This is the proof of concept version. I'd like to hear your comments.

Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
---
 tools/perf/builtin-stat.c |   63 ++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7ff746d..4cc10a1 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -51,6 +51,8 @@
 #include <sys/prctl.h>
 #include <math.h>
 #include <locale.h>
+#include <sys/socket.h>
+#include <sys/un.h>
 
 #define DEFAULT_SEPARATOR	" "
 
@@ -90,11 +92,15 @@ ...
From: Hitoshi Mitake
Date: Monday, December 13, 2010 - 10:46 pm

This patch makes perf bench mem memcpy to use the new feature of perf stat.

New option --wake-up requires path name of unix domain socket.
If --only-prefault or --no-prefault is specified, the pid of itself is written
to this socket before actual memcpy() to be monitored. And the pid of perf stat
is read from it. The pid of perf stat is used for signaling perf stat
to terminate monitoring.

With this feature, the detailed performance monitoring of prefaulted
(or non prefaulted only) memcpy() will be possible.

Example of use, non prefaulted version:
| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
|

After execution, perf stat waits the pid...

|  Performance counter stats for process id '27109':
|
|         440.534943 task-clock-msecs         #      0.997 CPUs
|                 44 context-switches         #      0.000 M/sec
|                  5 CPU-migrations           #      0.000 M/sec
|            256,002 page-faults              #      0.581 M/sec
|        934,443,072 cycles                   #   2121.155 M/sec
|        780,408,435 instructions             #      0.835 IPC
|        111,756,558 branches                 #    253.684 M/sec
|            392,170 branch-misses            #      0.351 %
|          8,611,308 cache-references         #     19.547 M/sec
|          8,533,588 cache-misses             #     19.371 M/sec
|
|         0.441803031  seconds time elapsed

in another shell,

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --no-prefault -w /tmp/perf-stat-wait
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        1.105722 GB/Sec

Example of use, prefaulted version:

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
| Performance counter stats for process id '27112':
|
|         105.001542 task-clock-msecs         #      0.997 CPUs
|                 11 context-switches         #      0.000 M/sec
|                  0 CPU-migrations          ...
From: Hitoshi Mitake
Date: Thursday, November 25, 2010 - 12:04 am

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem memcpy
for benchmarking memcpy() in userland with tricky and dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are dummy (but do a little work) for
including memcpy_64.S without modification to it (e.g. defining ENTRY()).

This makes checkpatch.pl angry like this:
\#177: FILE: tools/perf/util/include/linux/linkage.h:7:
+#define ENTRY(name)                            \
+       .globl name;                            \
+       name:

WARNING: labels should not be indented
\#179: FILE: tools/perf/util/include/linux/linkage.h:9:
+       name:

because checkpatch.pl treat this file as the file written in C.
But I think this can be forgived because original include/linux/linkage.h
is doing the similar thing.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
---
 tools/perf/Makefile                          |   11 +++++++++++
 tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 7 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/bench/mem-memcpy-arch.h
 create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm-def.h
 create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm.S
 ...

Commit-ID:  ea7872b9d6a81101f6ba0ec141544a62fea35876
Gitweb:     http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
Author:     Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
memcpy for benchmarking memcpy() in userland with tricky and
dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are mostly dummy files with small
wrappers, so that we are able to include memcpy_64.S
unmodified.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: h.mitake@gmail.com
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/Makefile                          |   11 +++++++++++
 tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 7 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 74b684d..e0db197 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -185,7 +185,10 @@ ifeq ...

On 2010年11月26日 19:31, tip-bot for Hitoshi Mitake wrote:
 > Commit-ID:  ea7872b9d6a81101f6ba0ec141544a62fea35876
 > Gitweb: 
http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
 > Author:     Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
 > AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
 > Committer:  Ingo Molnar<mingo@elte.hu>
 > CommitDate: Fri, 26 Nov 2010 08:15:57 +0100
 >
 > perf bench: Add feature that measures the performance of the 
arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'
 >
 > This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
 > memcpy for benchmarking memcpy() in userland with tricky and
 > dirty way.
 >
 > util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
 > util/include/linux/linkage.h are mostly dummy files with small
 > wrappers, so that we are able to include memcpy_64.S
 > unmodified.
 >
 > Signed-off-by: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
 > Cc: h.mitake@gmail.com
 > Cc: Miao Xie<miaox@cn.fujitsu.com>
 > Cc: Ma Ling<ling.ma@intel.com>
 > Cc: Zhao Yakui<yakui.zhao@intel.com>
 > Cc: Peter Zijlstra<a.p.zijlstra@chello.nl>
 > Cc: Arnaldo Carvalho de Melo<acme@redhat.com>
 > Cc: Paul Mackerras<paulus@samba.org>
 > Cc: Frederic Weisbecker<fweisbec@gmail.com>
 > Cc: Steven Rostedt<rostedt@goodmis.org>
 > Cc: Andi Kleen<andi@firstfloor.org>
 > 
LKML-Reference:<1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
 > Signed-off-by: Ingo Molnar<mingo@elte.hu>
 > ---
 >   tools/perf/Makefile                          |   11 +++++++++++
 >   tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 >   tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 >   tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 >   tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 >   tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 >   tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 >   7 files changed, 62 insertions(+), 0 deletions(-)
 >
 > ...
From: Peter Zijlstra
Date: Friday, October 29, 2010 - 12:49 pm

I don't like littering the actual kernel code with tools/perf/
ifdeffery..
--

From: Ingo Molnar
Date: Saturday, October 30, 2010 - 12:21 pm

Yeah - could we somehow accept that file into a perf build as-is?

Thanks,

	Ingo
--

From: Miao Xie
Date: Sunday, December 19, 2010 - 11:30 pm

I agree with your idea, but Ma Ling said this way may cause the i-cache
miss problem.
   http://marc.info/?l=linux-kernel&m=128746120107953&w=2
(The size of the i-cache is 32K, the size of memcpy() in my patch is 560Byte,
and the size of the last version in tip tree is 400Byte).

But I have not tested it, so I don't know the real result. Maybe we should

They are  Core2 Duo E7300(Core name: Wolfdale) and Xeon X5260(Core name: Wolfdale-DP).

The following is the detailed information of these two CPU:
Core2 Duo E7300:
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz
stepping	: 6
cpu MHz		: 1603.000
cache size	: 3072 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm dts
bogomips	: 5319.70
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Xeon X5260:
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           X5260  @ 3.33GHz
stepping	: 6
cpu MHz		: 1999.000
cache size	: 6144 KB
physical id	: 3
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority
bogomips	: 6649.07
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power ...
From: Hitoshi Mitake
Date: Monday, December 20, 2010 - 8:34 am

I compared memcpy()'s icache miss behaviour with my new
--wait-on patch ( https://patchwork.kernel.org/patch/408801/ ).
And the result is,

default of tip tree

% sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-load-misses

 Performance counter stats for process id '12559':

            64,328 L1-icache-load-misses

        0.106513157  seconds time elapsed

Miao Xie's memcpy()

% sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-misses

 Performance counter stats for process id '13159':

            64,559 L1-icache-load-misses

        0.107057925  seconds time elapsed

It seems that there is no fatal icache miss.
# I tested perf bench mem memcpy with Core i3 M 330 processor.

But I don't understand well about cache characteristics of intel processor.

Thanks for your information!

Thanks,
        Hitoshi
--

From: Hitoshi Mitake
Date: Friday, November 5, 2010 - 10:10 am

On Sat, Oct 30, 2010 at 06:08, Arnaldo Carvalho de Melo

OK, it seems that I have to consider better solution.
Could you tell me about the past problem for reference?
Your experience must be useful for this case.

-- 
Hitoshi Mitake
h.mitake@gmail.com
--

Previous thread: percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu operations by Christoph Lameter on Friday, October 29, 2010 - 8:56 am. (2 messages)

Next thread: [PATCH] xen: implement XENMEM_machphys_mapping by Stefano Stabellini on Friday, October 29, 2010 - 9:04 am. (1 message)