Re: [RFC PATCH 1/4] kmemtrace: Core implementation.

Previous thread: measuring memory pressure, expose freepages and zone watermarks under /proc perhaps? by Buddy Lumpkin on Tuesday, July 22, 2008 - 2:36 pm. (1 message)

Next thread: Re: pv_ops - 2.6.26 - unable to handle kernel paging request by Jeremy Fitzhardinge on Tuesday, July 22, 2008 - 2:46 pm. (1 message)
To: <penberg@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Tuesday, July 22, 2008 - 2:36 pm

Hi everyone,

I hopefully fixed all your previous objections. I have also set up a git tree
for anyone who'd like to try kmemtrace (gitweb URL):

http://repo.or.cz/w/linux-2.6/kmemtrace.git

Comment on the patchset and please try running kmemtrace if possible. Check
the docs for information on how to get the userspace tool and set it up.

Important: the kmemtrace-user repo went stable and I'll not alter the revision
history anymore. BTW, don't be scared if you see many errors being reported by
kmemtrace-report, this is a known issue (I could use some advice on this if
you know what's going on).

Changes since last submission:
1. fixed allocator tracing
2. wrote more documentation
3. reworked the ABI and documented it in Documentation/ABI; we don't include
kernel headers in userspace anymore
4. added support for disabling kmemtrace at boot-time
5. added provisions for disabling kmemtrace at runtime
6. changed slab allocators to use __always_inline instead of plain inline,
so that we're sure the return address is valid
7. removed some useless cast, as pointed out by Pekka Enberg

Since the changes were quite extensive, I chose not to preserve any tags such
as "Reviewed-by".

I'm waiting for your input on this.

Thanks,
Eduard

P.S.: Pekka, I followed your advice on adding a field containing the struct
size (managed to make room for it without adding to the current struct size).
This allows us to do crazy stuff in the future, like exporting the whole
stack trace on every allocation. Not sure how useful this is right now, but
let's keep the ABI extensible.

Eduard - Gabriel Munteanu (4):
kmemtrace: Core implementation.
kmemtrace: SLAB hooks.
kmemtrace: SLUB hooks.
kmemtrace: SLOB hooks.

Documentation/ABI/testing/debugfs-kmemtrace | 58 +++++++
Documentation/kernel-parameters.txt | 10 +
Documentation/vm/kmemtrace.txt | 126 ++++++++++++++
MAINTAINERS | 6 +
include/linux/kmemtrace.h ...

To: <penberg@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Tuesday, July 22, 2008 - 2:36 pm

kmemtrace provides tracing for slab allocator functions, such as kmalloc,
kfree, kmem_cache_alloc, kmem_cache_free etc.. Collected data is then fed
to the userspace application in order to analyse allocation hotspots,
internal fragmentation and so on, making it possible to see how well an
allocator performs, as well as debug and profile kernel code.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
Documentation/ABI/testing/debugfs-kmemtrace | 58 +++++++
Documentation/kernel-parameters.txt | 10 +
Documentation/vm/kmemtrace.txt | 126 ++++++++++++++
MAINTAINERS | 6 +
include/linux/kmemtrace.h | 110 ++++++++++++
init/main.c | 2 +
lib/Kconfig.debug | 28 +++
mm/Makefile | 2 +-
mm/kmemtrace.c | 244 +++++++++++++++++++++++++++
9 files changed, 585 insertions(+), 1 deletions(-)
create mode 100644 Documentation/ABI/testing/debugfs-kmemtrace
create mode 100644 Documentation/vm/kmemtrace.txt
create mode 100644 include/linux/kmemtrace.h
create mode 100644 mm/kmemtrace.c

diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace
new file mode 100644
index 0000000..466c2bb
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-kmemtrace
@@ -0,0 +1,58 @@
+What: /sys/kernel/debug/kmemtrace/
+Date: July 2008
+Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
+Description:
+
+In kmemtrace-enabled kernels, the following files are created:
+
+/sys/kernel/debug/kmemtrace/
+ cpu<n> (0400) Per-CPU tracing data, see below. (binary)
+ total_overruns (0400) Total number of bytes which were dropped from
+ cpu<n> files because of full buffer condition,
+ non-binary. (text)
+ abi_version (0400) Kernel's kmemtrace ABI version. (text)
+
+Each per-CPU file should be re...

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>, <mathieu.desnoyers@...>
Date: Monday, July 28, 2008 - 5:24 am

Hi,

[I'm cc'ing Mathieu if he wants to comment on this.]

--

To: Pekka Enberg <penberg@...>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>, Steven Rostedt <rostedt@...>, Thomas Gleixner <tglx@...>
Date: Monday, July 28, 2008 - 12:29 pm

Hmm ? why record an invalid event ?? I see it's not used in the code, is

If only valid for event ID 1 and only in NUMA case, please don't waste

8 bytes for GFP flags ?? Whoah, that's a lot of one-hot bits ! :) I knew

Using a magic number in the trace header lets you deal with
cross-endianness.

Saving the type sizes in the trace header lets you deal with different

LTTng will be faster though : per-cpu atomic ops instead of interrupt

Automatic description of markers and dynamic assignation of IDs to

Change the documentation to prefix a root command line by "#" instead of

What in the world can be causing that ? Shouldn't it be fixed ? It might
be due to unexpected allocator behavior, non-instrumented alloc/free

See below for detail, but this event record is way too big and not

Isn't this overridable by a command line param ? Shouldn't it be called

Hrm, I'd leave that as a kernel command line option, not config option.

Argh, and you do a supplementary copy here. You could simply alias the
buffers and write directly to them after reserving the correct amount of

ktime_get is monotonic, but with potentially coarse granularity. I see
that you use ktime_to_ns here, which gives you a resolution of 1 timer
tick in the case where the TSCs are not synchronized. While it should be
"good enough" for the scheduler, I doubt it's enough for a tracer.

It also takes the xtime seqlock, which adds a potentially big delay to
the tracing code (if you read the clock while the writer lock is taken).

Also, when NTP modifies the clock, although it stays monotonic, the rate
at which it increments can dramatically change. I doubt you want to use

I think the standard is to use =0, =1 here, not =yes, =no ?

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>, Steven Rostedt <rostedt@...>, Thomas Gleixner <tglx@...>
Date: Monday, July 28, 2008 - 1:35 pm

The relay interface is really inconsistent and produces erroneous output
if it's not used in a specific way. It's nice to be able to catch these

Pekka suggested we use types that have constant size on every arch. I

This could change too, but if the number of GFP flags is too close to
32, I'd rather keep the ABI stable, providing for a larger number of GFP

Hmm, I'll look at lttng's code and see what exactly you are talking

ASCII/text? :)
I'm not sure what you meant, but non-binary traces would result in huge

Sure, cross-endianness is not even currently implemented in the

I'm not sure how one could record a timestamp orderly into relay buffers

Dynamic assignation makes it hard to preserve ABI compatibility with the
userspace app. And Pekka suggested it's important to preserve it in
order to allow distros to include kmemtrace.

Yes, it's probably better. I just wanted to avoid a user taking that as

Of course it will be fixed. But this FAQ entry also serves as a warning
that future allocator patches could introduce untraced functions.

Okay, will rebase my patches on a tracepoints-enabled branch. How close

Not quite true. I saw a few kernel subsystems that provide compile-time opt=
ions

What would you suggest instead?

Please keep in mind timer resolution is not that critical, we're not
benchmarking the allocators cycle-wise; instead we're merely looking at
allocation lifetimes, fragmentation, patterns etc.. Timestamps are most

68

Thanks for your comments.

Cheers,
Eduard

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>, Steven Rostedt <rostedt@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, July 29, 2008 - 4:25 am

Hi Eduard-Gabriel,

Mathieu does have a good point of optimizing the memory use of an
individual event so I'm okay with that. But we really don't want to
force people the analyze the dump on same architecture where we captured
it. So as long as that is taken care of, I'm happy.

Pekka

--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Eduard - Gabriel Munteanu <eduard.munteanu@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, Steven Rostedt <rostedt@...>, Thomas Gleixner <tglx@...>, Michael Kerrisk <michael.kerrisk@...>
Date: Monday, July 28, 2008 - 1:09 pm

We've also a new Documentation/ maintainer these days (in addition to
Randy) who should get cc:ed:

--
Mathematics is the supreme nostalgia of our time.

--

To: <penberg@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Tuesday, July 22, 2008 - 2:36 pm

This adds hooks for the SLAB allocator, to allow tracing with kmemtrace.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
include/linux/slab_def.h | 68 ++++++++++++++++++++++++++++++++++++++------
mm/slab.c | 71 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 123 insertions(+), 16 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 39c3a5e..7555ce9 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -14,6 +14,7 @@
#include <asm/page.h> /* kmalloc_sizes.h needs PAGE_SIZE */
#include <asm/cache.h> /* kmalloc_sizes.h needs L1_CACHE_BYTES */
#include <linux/compiler.h>
+#include <linux/kmemtrace.h>

/* Size description struct for general caches. */
struct cache_sizes {
@@ -28,8 +29,26 @@ extern struct cache_sizes malloc_sizes[];
void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
void *__kmalloc(size_t size, gfp_t flags);

-static inline void *kmalloc(size_t size, gfp_t flags)
+#ifdef CONFIG_KMEMTRACE
+extern void *kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags);
+extern size_t slab_buffer_size(struct kmem_cache *cachep);
+#else
+static __always_inline void *
+kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags)
{
+ return kmem_cache_alloc(cachep, flags);
+}
+static inline size_t slab_buffer_size(struct kmem_cache *cachep)
+{
+ return 0;
+}
+#endif
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+ struct kmem_cache *cachep;
+ void *ret;
+
if (__builtin_constant_p(size)) {
int i = 0;

@@ -50,10 +69,17 @@ static inline void *kmalloc(size_t size, gfp_t flags)
found:
#ifdef CONFIG_ZONE_DMA
if (flags & GFP_DMA)
- return kmem_cache_alloc(malloc_sizes[i].cs_dmacachep,
- flags);
+ cachep = malloc_sizes[i].cs_dmacachep;
+ else
#endif
- return kmem_cache_alloc(malloc_sizes[i].cs_cachep, flags);
+ cachep = malloc_sizes[i].cs_cachep;
+
...

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>, <vegard.nossum@...>
Date: Monday, July 28, 2008 - 5:37 am

You're introducing this to work around the fact that struct kmem_cache
is defined in mm/slab.c? Please note that for kmemcheck we already moved
struct kmem_cache to a header file so we should be able to use the same

--

To: <penberg@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Tuesday, July 22, 2008 - 2:36 pm

This adds hooks for the SLUB allocator, to allow tracing with kmemtrace.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
include/linux/slub_def.h | 53 ++++++++++++++++++++++++++++++++++--
mm/slub.c | 66 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d117ea2..d77012a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,6 +10,7 @@
#include <linux/gfp.h>
#include <linux/workqueue.h>
#include <linux/kobject.h>
+#include <linux/kmemtrace.h>

enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu slab */
@@ -203,13 +204,31 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
void *__kmalloc(size_t size, gfp_t flags);

+#ifdef CONFIG_KMEMTRACE
+extern void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags);
+#else
+static __always_inline void *
+kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags)
+{
+ return kmem_cache_alloc(s, gfpflags);
+}
+#endif
+
static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
{
- return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
+ unsigned int order = get_order(size);
+ void *ret = (void *) __get_free_pages(flags, order);
+
+ kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret,
+ size, PAGE_SIZE << order, flags);
+
+ return ret;
}

static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
+ void *ret;
+
if (__builtin_constant_p(size)) {
if (size > PAGE_SIZE)
return kmalloc_large(size, flags);
@@ -220,7 +239,13 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
if (!s)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc(s, flags);
+ ret = kmem_cache_alloc_notrace(s, flags);
+
+ kmemtrace_mark_alloc(KME...

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Monday, July 28, 2008 - 5:40 am

--

To: <penberg@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Tuesday, July 22, 2008 - 2:36 pm

This adds hooks for the SLOB allocator, to allow tracing with kmemtrace.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
include/linux/slob_def.h | 9 +++++----
mm/slob.c | 37 +++++++++++++++++++++++++++++++------
2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
index 59a3fa4..0ec00b3 100644
--- a/include/linux/slob_def.h
+++ b/include/linux/slob_def.h
@@ -3,14 +3,15 @@

void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);

-static inline void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static __always_inline void *kmem_cache_alloc(struct kmem_cache *cachep,
+ gfp_t flags)
{
return kmem_cache_alloc_node(cachep, flags, -1);
}

void *__kmalloc_node(size_t size, gfp_t flags, int node);

-static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
{
return __kmalloc_node(size, flags, node);
}
@@ -23,12 +24,12 @@ static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
* kmalloc is the normal method of allocating memory
* in the kernel.
*/
-static inline void *kmalloc(size_t size, gfp_t flags)
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
return __kmalloc_node(size, flags, -1);
}

-static inline void *__kmalloc(size_t size, gfp_t flags)
+static __always_inline void *__kmalloc(size_t size, gfp_t flags)
{
return kmalloc(size, flags);
}
diff --git a/mm/slob.c b/mm/slob.c
index a3ad667..23375ed 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -65,6 +65,7 @@
#include <linux/module.h>
#include <linux/rcupdate.h>
#include <linux/list.h>
+#include <linux/kmemtrace.h>
#include <asm/atomic.h>

/*
@@ -463,27 +464,38 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node)
{
unsigned int *m;
int align = max(ARCH_KMALLOC_MINA...

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>, <mpm@...>
Date: Monday, July 28, 2008 - 5:41 am

The patch description could be little less terse and maybe explain the
__always_inline changes but:

--

To: Eduard - Gabriel Munteanu <eduard.munteanu@...>
Cc: <penberg@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>
Date: Tuesday, July 22, 2008 - 4:53 pm

Why is this needed? builtin_return?

--
Mathematics is the supreme nostalgia of our time.

--

To: Matt Mackall <mpm@...>
Cc: <penberg@...>, <cl@...>, <linux-mm@...>, <linux-kernel@...>, <rdunlap@...>
Date: Tuesday, July 22, 2008 - 5:07 pm

If we don't use __always_inline, we can't be sure whether it's inlined
or not. And we don't know if we need _THIS_IP_ or _RET_IP_ (equivalent
to __builtin_return_address(0)). Simple, plain 'inline' does not
guarantee GCC will inline that function, nor does it warn us if it is
not inlined.

--

Previous thread: measuring memory pressure, expose freepages and zone watermarks under /proc perhaps? by Buddy Lumpkin on Tuesday, July 22, 2008 - 2:36 pm. (1 message)

Next thread: Re: pv_ops - 2.6.26 - unable to handle kernel paging request by Jeremy Fitzhardinge on Tuesday, July 22, 2008 - 2:46 pm. (1 message)