Re: [PATCH 1/5] kmemtrace: Core implementation.

Previous thread: resume from s2ram regression (bisected to ftrace...) by Marcin Slusarz on Sunday, August 10, 2008 - 10:08 am. (6 messages)

Next thread: Boot problem by mengualjeanphi on Sunday, August 10, 2008 - 10:25 am. (1 message)
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

Hi everybody,

As usual, the kmemtrace userspace repo is located at
git://repo.or.cz/kmemtrace-user.git

It's not updated now, but I will rebase it. So re-clone it, don't just
git-rebase it. The changes were too extensive and I'd like to keep the
revision history clean.

Changes in kmemtrace:
- new ABI, supports variable sized packets and it's much shorter (it has
specific fields for allocations)
- we'll use splice() in userspace
- replaced timestamps with sequence numbers, since timestamps don't have a good
enough resolution (though they could be added as an additional feature)
- used relay_reserve() as Mathieu Desnoyers suggested
- moved additional docs into a different commit and documented the replacement
of inline with __always_inline in those commits

Please have a look and let me know what you think.

Eduard - Gabriel Munteanu (5):
  kmemtrace: Core implementation.
  kmemtrace: Additional documentation.
  kmemtrace: SLAB hooks.
  kmemtrace: SLUB hooks.
  kmemtrace: SLOB hooks.

 Documentation/ABI/testing/debugfs-kmemtrace |   71 ++++++
 Documentation/kernel-parameters.txt         |   10 +
 Documentation/vm/kmemtrace.txt              |  126 ++++++++++
 MAINTAINERS                                 |    6 +
 include/linux/kmemtrace.h                   |   85 +++++++
 include/linux/slab_def.h                    |   68 +++++-
 include/linux/slob_def.h                    |    9 +-
 include/linux/slub_def.h                    |   53 ++++-
 init/main.c                                 |    2 +
 lib/Kconfig.debug                           |   28 +++
 mm/Makefile                                 |    2 +-
 mm/kmemtrace.c                              |  335 +++++++++++++++++++++++++++
 mm/slab.c                                   |   71 +++++-
 mm/slob.c                                   |   37 +++-
 mm/slub.c                                   |   66 +++++-
 15 files changed, 933 insertions(+), 36 deletions(-)
 create mode 100644 ...
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

kmemtrace provides tracing for slab allocator functions, such as kmalloc,
kfree, kmem_cache_alloc, kmem_cache_free etc.. Collected data is then fed
to the userspace application in order to analyse allocation hotspots,
internal fragmentation and so on, making it possible to see how well an
allocator performs, as well as debug and profile kernel code.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 Documentation/kernel-parameters.txt |   10 +
 MAINTAINERS                         |    6 +
 include/linux/kmemtrace.h           |   85 +++++++++
 init/main.c                         |    2 +
 lib/Kconfig.debug                   |   28 +++
 mm/Makefile                         |    2 +-
 mm/kmemtrace.c                      |  335 +++++++++++++++++++++++++++++++++++
 7 files changed, 467 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/kmemtrace.h
 create mode 100644 mm/kmemtrace.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b52f47d..446a257 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -49,6 +49,7 @@ parameter is applicable:
 	ISAPNP	ISA PnP code is enabled.
 	ISDN	Appropriate ISDN support is enabled.
 	JOY	Appropriate joystick support is enabled.
+	KMEMTRACE kmemtrace is enabled.
 	LIBATA  Libata driver is enabled
 	LP	Printer support is enabled.
 	LOOP	Loopback device support is enabled.
@@ -941,6 +942,15 @@ and is between 256 and 4096 characters. It is defined in the file
 			use the HighMem zone if it exists, and the Normal
 			zone if it does not.
 
+	kmemtrace.enable=	[KNL,KMEMTRACE] Format: { yes | no }
+				Controls whether kmemtrace is enabled
+				at boot-time.
+
+	kmemtrace.subbufs=n	[KNL,KMEMTRACE] Overrides the number of
+			subbufs kmemtrace's relay channel has. Set this
+			higher than default (KMEMTRACE_N_SUBBUFS in code) if
+			you experience buffer overruns.
+
 	movablecore=nn[KMG]	[KNL,X86-32,IA-64,PPC,X86-64] This ...
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

Documented kmemtrace's ABI, purpose and design. Also includes a short
usage guide, FAQ, as well as a link to the userspace application's Git
repository, which is currently hosted at repo.or.cz.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 Documentation/ABI/testing/debugfs-kmemtrace |   71 +++++++++++++++
 Documentation/vm/kmemtrace.txt              |  126 +++++++++++++++++++++++++++
 2 files changed, 197 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/debugfs-kmemtrace
 create mode 100644 Documentation/vm/kmemtrace.txt

diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace
new file mode 100644
index 0000000..a5ff9a6
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-kmemtrace
@@ -0,0 +1,71 @@
+What:		/sys/kernel/debug/kmemtrace/
+Date:		July 2008
+Contact:	Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
+Description:
+
+In kmemtrace-enabled kernels, the following files are created:
+
+/sys/kernel/debug/kmemtrace/
+	cpu<n>		(0400)	Per-CPU tracing data, see below. (binary)
+	total_overruns	(0400)	Total number of bytes which were dropped from
+				cpu<n> files because of full buffer condition,
+				non-binary. (text)
+	abi_version	(0400)	Kernel's kmemtrace ABI version. (text)
+
+Each per-CPU file should be read according to the relay interface. That is,
+the reader should set affinity to that specific CPU and, as currently done by
+the userspace application (though there are other methods), use poll() with
+an infinite timeout before every read(). Otherwise, erroneous data may be
+read. The binary data has the following _core_ format:
+
+	Event ID	(1 byte)	Unsigned integer, one of:
+		0 - represents an allocation (KMEMTRACE_EVENT_ALLOC)
+		1 - represents a freeing of previously allocated memory
+		    (KMEMTRACE_EVENT_FREE)
+	Type ID		(1 byte)	Unsigned integer, one of:
+		0 - this is a kmalloc() / kfree()
+		1 - this is a kmem_cache_alloc() / ...
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

This adds hooks for the SLAB allocator, to allow tracing with kmemtrace.

We also convert some inline functions to __always_inline to make sure
_RET_IP_, which expands to __builtin_return_address(0), always works
as expected.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/linux/slab_def.h |   68 ++++++++++++++++++++++++++++++++++++++------
 mm/slab.c                |   71 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 123 insertions(+), 16 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 39c3a5e..7555ce9 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -14,6 +14,7 @@
 #include <asm/page.h>		/* kmalloc_sizes.h needs PAGE_SIZE */
 #include <asm/cache.h>		/* kmalloc_sizes.h needs L1_CACHE_BYTES */
 #include <linux/compiler.h>
+#include <linux/kmemtrace.h>
 
 /* Size description struct for general caches. */
 struct cache_sizes {
@@ -28,8 +29,26 @@ extern struct cache_sizes malloc_sizes[];
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static inline void *kmalloc(size_t size, gfp_t flags)
+#ifdef CONFIG_KMEMTRACE
+extern void *kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags);
+extern size_t slab_buffer_size(struct kmem_cache *cachep);
+#else
+static __always_inline void *
+kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags)
 {
+	return kmem_cache_alloc(cachep, flags);
+}
+static inline size_t slab_buffer_size(struct kmem_cache *cachep)
+{
+	return 0;
+}
+#endif
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *cachep;
+	void *ret;
+
 	if (__builtin_constant_p(size)) {
 		int i = 0;
 
@@ -50,10 +69,17 @@ static inline void *kmalloc(size_t size, gfp_t flags)
 found:
 #ifdef CONFIG_ZONE_DMA
 		if (flags & GFP_DMA)
-			return kmem_cache_alloc(malloc_sizes[i].cs_dmacachep,
-						flags);
+			cachep = ...
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

This adds hooks for the SLUB allocator, to allow tracing with kmemtrace.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/linux/slub_def.h |   53 ++++++++++++++++++++++++++++++++++--
 mm/slub.c                |   66 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d117ea2..d77012a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/workqueue.h>
 #include <linux/kobject.h>
+#include <linux/kmemtrace.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -203,13 +204,31 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
+#ifdef CONFIG_KMEMTRACE
+extern void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags);
+#else
+static __always_inline void *
+kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return kmem_cache_alloc(s, gfpflags);
+}
+#endif
+
 static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
 {
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
+	unsigned int order = get_order(size);
+	void *ret = (void *) __get_free_pages(flags, order);
+
+	kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret,
+			     size, PAGE_SIZE << order, flags);
+
+	return ret;
 }
 
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	void *ret;
+
 	if (__builtin_constant_p(size)) {
 		if (size > PAGE_SIZE)
 			return kmalloc_large(size, flags);
@@ -220,7 +239,13 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
 			if (!s)
 				return ZERO_SIZE_PTR;
 
-			return kmem_cache_alloc(s, flags);
+			ret = kmem_cache_alloc_notrace(s, flags);
+
+			kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC,
+					     ...
From: Eduard - Gabriel Munteanu
Date: Sunday, August 10, 2008 - 10:14 am

This adds hooks for the SLOB allocator, to allow tracing with kmemtrace.

We also convert some inline functions to __always_inline to make sure
_RET_IP_, which expands to __builtin_return_address(0), always works
as expected.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/linux/slob_def.h |    9 +++++----
 mm/slob.c                |   37 +++++++++++++++++++++++++++++++------
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
index 59a3fa4..0ec00b3 100644
--- a/include/linux/slob_def.h
+++ b/include/linux/slob_def.h
@@ -3,14 +3,15 @@
 
 void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
 
-static inline void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static __always_inline void *kmem_cache_alloc(struct kmem_cache *cachep,
+					      gfp_t flags)
 {
 	return kmem_cache_alloc_node(cachep, flags, -1);
 }
 
 void *__kmalloc_node(size_t size, gfp_t flags, int node);
 
-static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	return __kmalloc_node(size, flags, node);
 }
@@ -23,12 +24,12 @@ static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * kmalloc is the normal method of allocating memory
  * in the kernel.
  */
-static inline void *kmalloc(size_t size, gfp_t flags)
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	return __kmalloc_node(size, flags, -1);
 }
 
-static inline void *__kmalloc(size_t size, gfp_t flags)
+static __always_inline void *__kmalloc(size_t size, gfp_t flags)
 {
 	return kmalloc(size, flags);
 }
diff --git a/mm/slob.c b/mm/slob.c
index a3ad667..23375ed 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -65,6 +65,7 @@
 #include <linux/module.h>
 #include <linux/rcupdate.h>
 #include <linux/list.h>
+#include <linux/kmemtrace.h>
 #include <asm/atomic.h>
 
 /*
@@ -463,27 ...
From: Pekka Enberg
Date: Sunday, August 10, 2008 - 10:48 am

On Sun, Aug 10, 2008 at 8:14 PM, Eduard - Gabriel Munteanu

I think Matt acked this already but as you dropped the tags, I'll ask
once more before I merge this.
--

From: Matt Mackall
Date: Sunday, August 10, 2008 - 4:18 pm

Yeah, that's fine.

Acked-by: Matt Mackall <mpm@selenic.com>

-- 
Mathematics is the supreme nostalgia of our time.

--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 11:46 pm

Applied, thanks!
--

From: Christoph Lameter
Date: Monday, August 11, 2008 - 7:04 am

You could simplify the stuff in slub.h if you would fall back to the uninlined
functions in the case that kmemtrace is enabled. IMHO adding additional inline

_RET_IP == __builtin_return_address(0) right? Put that into a local variable?
At least we need consistent usage within one function. Maybe convert



And again.


And another one.


--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 7:09 am

So, if CONFIG_KMEMTRACE is enabled, make the inlined version go away
completely? I'm okay with that though I wonder if that means we now take
a performance hit when CONFIG_KMEMTRACE is enabled but tracing is

I think we should just convert SLUB to use _RET_IP_ everywhere. Eduard,
care to make a patch and send it and rebase this on top of that?

--

From: Christoph Lameter
Date: Monday, August 11, 2008 - 7:13 am

We already take a performance hit because of the additional function calls.

With the above approach the kernel binary will grow significantly because you
are now inserting an additional function call at all call sites.
--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 7:16 am

The function call is supposed to go away when we convert kmemtrace to
use Mathieu's markers but I suppose even then we have a problem with
inlining?

--

From: Christoph Lameter
Date: Monday, August 11, 2008 - 7:21 am

The function calls are overwritten with NOPs? Or how does that work?


--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 7:22 am

I have no idea. Mathieu, Eduard?

--

From: Eduard - Gabriel Munteanu
Date: Tuesday, August 12, 2008 - 8:29 am

Yes, the code is patched at runtime. But AFAIK markers already provide
this stuff (called "immediate values"). Mathieu's tracepoints also do
it. But it's not available on all arches. x86 and x86-64 work as far as
I remember.

--

From: Mathieu Desnoyers
Date: Tuesday, August 12, 2008 - 8:43 am

The markers present in mainline kernel does not use immediate values.
However, immediate values in tip does implement a load
immediate/test/branch for x86, x86_64 and powerpc. I also have support
for sparc64 in my lttng tree.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Matt Mackall
Date: Tuesday, August 12, 2008 - 7:09 pm

Did we ever see size(1) numbers for kernels with and without this
support? I'm still a bit worried about adding branches to such a popular
inline. Simply multiplying the branch test by the number of locations is
pretty substantial, never mind the unlikely part of the branch.

-- 
Mathematics is the supreme nostalgia of our time.

--

From: Steven Rostedt
Date: Monday, August 11, 2008 - 7:36 am

I believe in the latest version they are just a variable test. But when 
Mathieu's immediate code makes it in (which it is in linux-tip), we will 
be overwriting the conditionals with nops (Mathieu, correct me if I'm 
wrong here).

But the calls themselves are done in the unlikely branch. This is 
important, as Mathieu stated in previous thread. The reason is that all 
the stack setup for the function call is also in the unlikely branch, and 
the normal fast path does not take a hit for the function call setup.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, August 11, 2008 - 11:28 am

The current immediate values in tip does a load immediate, test, branch,
which removes the cost of the memory load. We will try to get gcc
support to be able to declare patchable static jump sites, which could
be patched with NOPs when disabled. But that will probably not happen
"now".


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, August 11, 2008 - 7:30 am

The kmemtrace_mark_alloc_node itself is an inline function, which calls 
another inline function "trace_mark" which is designed to test a 
read_mostly variable, and will do an "unlikely" jmp if the variable is 
set (which it is when tracing is enabled), to the actual function call.

There should be no extra function calls when this is configured on but 
tracing disabled. We try very hard to keep the speed of the tracer as 
close to a non tracing kernel as possible when tracing is disabled.

-- Steve

--

From: Christoph Lameter
Date: Monday, August 11, 2008 - 7:37 am

Makes sense. But then we have even more code bloat because of the tests that
are inserted in all call sites of kmalloc.



--

From: Frank Ch. Eigler
Date: Monday, August 11, 2008 - 8:34 am

Are you talking about the tests that implement checking whether a
marker is active or not?  Those checks are already efficient, and will
get more so with the "immediate values" optimization in or near the
tree.

- FChE
--

From: Christoph Lameter
Date: Monday, August 11, 2008 - 8:48 am

AFAICT: Each test also adds an out of line call to the tracing facility.
--

From: Steven Rostedt
Date: Monday, August 11, 2008 - 8:54 am

Frank,

Christoph is correct. He's not bringing up the issue of efficiency, but 
the issue of bloat.

The marker code will be added to everyplace that calls kmalloc. Which can 
be quite a lot.  I'd be interested in seeing the size of the .text section 
with and without this patch added and makers configure in.

-- Steve

--

From: Frank Ch. Eigler
Date: Monday, August 11, 2008 - 8:57 am

Hi -


Yes, but that call is normally placed out of the cache-hot path with unlikely().

- FChE
--

From: Mathieu Desnoyers
Date: Monday, August 11, 2008 - 11:29 am

The long-term goal is to turn the tests into NOPs, but only once we get
gcc support.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Eduard - Gabriel Munteanu
Date: Tuesday, August 12, 2008 - 8:25 am

Oh, good. I'm also thinking to add a macro that expands to simple inline when

Sure. Will get back soon.

--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 11:46 pm

Applied, thanks!
--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 11:46 pm

Applied, thanks!
--

From: Randy Dunlap
Date: Monday, August 18, 2008 - 12:57 pm

Why is this "- 4 bytes"?  Is there an implied alignment byte somewhere in the


...


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/
--

From: Pekka Enberg
Date: Monday, August 11, 2008 - 11:46 pm

Applied, thanks!
--

Previous thread: resume from s2ram regression (bisected to ftrace...) by Marcin Slusarz on Sunday, August 10, 2008 - 10:08 am. (6 messages)

Next thread: Boot problem by mengualjeanphi on Sunday, August 10, 2008 - 10:25 am. (1 message)