Re: LatencyTOP infrastructure patch

Previous thread: [PATCH] [0/7] Add noclflush option and related cleanup by Andi Kleen on Friday, January 18, 2008 - 1:27 pm. (14 messages)

Next thread: CONFIG_PREEMPT and real time kthreads by Miles Bintz on Friday, January 18, 2008 - 2:03 pm. (1 message)
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 1:36 pm

The Intel Open Source Technology Center is pleased to announce the
release of version 0.1 of LatencyTOP, a tool for developers to visualize
system latencies.

http://www.latencytop.org

Slow servers, Skipping audio, Jerky video --everyone knows the symptoms
of latency. But to know what's really going on in the system, what's causing
the latency, and how to fix it... those are difficult questions without
good answers right now.

LatencyTOP is a Linux tool for software developers (both kernel and
userspace), aimed at identifying where system latency occurs, and what
kind of operation/action is causing the latency to happen. By identifying
this, developers can then change the code to avoid the worst latency
hiccups.

There are many types and causes of latency, and LatencyTOP focus on type
that causes audio skipping and desktop stutters. Specifically, LatencyTOP
focuses on the cases where the applications want to run and execute useful
code, but there's some resource that's not currently available (and the
kernel then blocks the process). This is done both on a system level and
on a per process level, so that you can see what's happening to the system,
and which process is suffering and/or causing the delays.

Rather than spending a thousand words on describing what LatencyTOP is,
I'll let the numbers speak..

Here's some output of LatencyTOP, collected for a make -j4 of a kernel
on a quad core system:

Cause Maximum Average
process fork 1097.7 msec 2.5 msec
Reading from file 1097.0 msec 0.1 msec
updating atime 850.4 msec 60.1 msec
Locking buffer head 433.1 msec 94.3 msec
Writing to file 381.8 msec 0.6 msec
Synchronous bufferhead read 318.5 msec 16.3 msec
Waiting for buffer IO 298.8 msec 7.8 msec

This sho...

To: Arjan van de Ven <arjan@...>
Cc: <kosaki.motohiro@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Sunday, January 20, 2008 - 8:52 pm

Interesting.

at least, I will use for fixed page lock contension :)

if possible, I want some /proc/lock_stat feature.
i.e. holding time(avg, max), contention target ip...

and, this patche become more powerful if spinlock change sleepable patch applied.
(may be, rt guy have it)

- kosaki

--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:07 am

Just for completeness -- Linux already had a way to profile latencies
since quite some time. It's little known unfortunately and doesn't
work for modules since it's a special mode in the old non modular kernel
profiler.

You enable CONFIG_SCHEDSTATS and boot with profile=sleep and then you can
use the readprofile command to read the data. Information can be reset with
echo > /proc/profile

There's also a profile=sched to profile the scheduler which works even
without CONFIG_SCHEDSTATS

Latencytop will be probably a little more user friendly though.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:16 am

yes indeed; I sort of use the same infrastructure inside the scheduler; the biggest
reason I felt I had to do something different was that I wanted to do per process
data collection, so that you can see for a specific process what was going on.
--

To: Arjan van de Ven <arjan@...>
Cc: Andi Kleen <andi@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:27 am

Wouldn't it have been easier then to just extend the sleep profiler to
oprofile? oprofile already has pid filters and can do per process
profiling.

On the other hand I'm not fully sure only doing per pid profiling
is that useful. After all often latencies come from asynchronous
threads (like kblockd). So a system level view is probably better
anyways.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:25 am

another thing that the current profiling can't do, is to show what the system is doing
when it hits the latency.. so someone calling fsync() will show up in the waiting for
IO function, but not that it was due to an fsync().
--

To: Arjan van de Ven <arjan@...>
Cc: Andi Kleen <andi@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:33 am

Hmm so how about extending oprofile to always log the syscall number
in the event logs (can be gotten from top of stack). I think given
that you could reconstruct that data in the userland at least
for single threads (not for work done on behalf of them in other
threads; but I'm not sure you tried to solve that problem at all)

The advantage is that it would be an generic mechanism that would work
for all types of profiling.

--

To: Andi Kleen <andi@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:34 am

syscall nr and pid at minimum then.
Still doesn't work for modules either.

what it ends up doing is using an entirely different interface for basically the
same code / operation inside the kernel.
The current interface code is maybe 80 lines of /proc code... and very simple to
use (unlike the oprofile interface)
--

To: Arjan van de Ven <arjan@...>
Cc: Andi Kleen <andi@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 2:08 am

> syscall nr and pid at minimum then.

oprofile already supports logging the pid I believe. Otherwise

Well rather it uses an existing framework for something that fits
it well.

Also the way I proposed is very cheap and would be possible

The oprofile interface is per CPU (so you wouldn't need to reinvent
that to fix your locking) and if you add the syscall
logging feature to it it would apply to all profile events
e.g. you could then do things like "matching cache misses to syscalls"

-andi

--

To: Andi Kleen <andi@...>
Cc: Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Saturday, January 19, 2008 - 1:37 am

Ok to handle exceptions like page faults this way you would need to save
the vector somewhere on entry, but that shouldn't be very costly
or difficult and could probably even be done unconditionally.

-Andi
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 2:27 pm

I really like this patch :-)! Just a little note, why don't make
the parameter 'char *reason' as simple integer (reason_t)?
Making it as integer will automatically drop the strncmp()
and speeding up all the things. Could be also interesting to
define _externally_ the mapping of the reason so the
userspace tool could handle it easily.

Roberto Fichera.

--

To: Roberto Fichera <kernel@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 2:35 pm

I thought about that, but the strncmp is still somewhat needed to deal with the argument
(the instrumentation above doesn't use that, but the mutex one does for example)
If we get rid of the argument entirely it'd be easier
(but then we get "blocking on mutex" rather than "blocking on inode->mutex")
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 2:46 pm

Roberto Fichera.

--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 1:42 pm

A few more high level annotations for LatencyTOP;
split into a separate patch since they're at a higher level
than the first part (and thus hide some lower level details
potentially, but could also obsolete several other low level
instrumentations)
---
fs/read_write.c | 7 ++++++-
fs/readdir.c | 4 +++-
lib/kernel_lock.c | 3 +++
3 files changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6.24-rc7/fs/read_write.c
===================================================================
--- linux-2.6.24-rc7.orig/fs/read_write.c
+++ linux-2.6.24-rc7/fs/read_write.c
@@ -255,6 +255,7 @@ EXPORT_SYMBOL(do_sync_read);

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
+ struct latency_entry reason;
ssize_t ret;

if (!(file->f_mode & FMODE_READ))
@@ -264,6 +265,7 @@ ssize_t vfs_read(struct file *file, char
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
return -EFAULT;

+ set_latency_reason("Reading from file", &reason);
ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
count = ret;
@@ -281,6 +283,7 @@ ssize_t vfs_read(struct file *file, char
}
}

+ restore_latency_reason(&reason);
return ret;
}

@@ -314,6 +317,7 @@ EXPORT_SYMBOL(do_sync_write);
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
+ struct latency_entry reason;

if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
@@ -322,6 +326,7 @@ ssize_t vfs_write(struct file *file, con
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;

+ set_latency_reason("Writing to file", &reason);
ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
count = ret;
@@ -338,7 +343,7 @@ ssize_t vfs_write(struct file *file, con
inc_syscw(current);
}
}
-
+ restore_latency_reason(&reason);
return ret;
}

Index: linux-2.6.24-rc7/fs/readdir.c
============================================...

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 1:40 pm

This patch contains the first set of instrumentations for latencytop;
each instrumentation needs to be evaluated for usefulness before this
can go into mainline; posting here just to show how the infrastructure
can be used

---
arch/x86/mm/fault_64.c | 4 ++++
block/ll_rw_blk.c | 3 +++
drivers/scsi/hosts.c | 1 -
drivers/scsi/scsi_error.c | 6 +++++-
drivers/scsi/scsi_lib.c | 4 ++++
drivers/scsi/sd.c | 5 +++++
drivers/scsi/sr.c | 12 +++++++++++-
fs/buffer.c | 22 +++++++++++++++++++++-
fs/eventpoll.c | 5 +++++
fs/ext3/inode.c | 30 ++++++++++++++++++++++++++++--
fs/inode.c | 13 +++++++++++++
fs/jbd/checkpoint.c | 6 ++++++
fs/jbd/commit.c | 5 +++++
fs/jbd/journal.c | 5 +++++
fs/jbd/transaction.c | 13 +++++++++++++
fs/pipe.c | 5 +++++
fs/select.c | 3 +++
fs/sync.c | 19 +++++++++++++++++--
kernel/fork.c | 4 ++++
kernel/futex.c | 8 +++++++-
kernel/mutex.c | 13 +++++++++++++
kernel/rwsem.c | 12 +++++++++++-
mm/filemap.c | 6 ++++++
mm/slub.c | 7 ++++++-
24 files changed, 200 insertions(+), 11 deletions(-)

Index: linux-2.6.24-rc7/arch/x86/mm/fault_64.c
===================================================================
--- linux-2.6.24-rc7.orig/arch/x86/mm/fault_64.c
+++ linux-2.6.24-rc7/arch/x86/mm/fault_64.c
@@ -303,6 +303,7 @@ asmlinkage void __kprobes do_page_fault(
int write, fault;
unsigned long flags;
siginfo_t info;
+ struct latency_entry reason;

/*
* We can fault from pretty much anywhere, with unknown IRQ state.
@@ -441,7 +442,10 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
+
+ set_latency_reason("page fault", &reason);
fault = handle_mm_fault(mm, vma, address, write);
+ restor...

To: Arjan van de Ven <arjan@...>
Cc: Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 6:26 pm

Can you suggest of some reason why all this instrumentation could
not be in the form of standard markers (perhaps conditionally
compiled out if necessary)?

- FChE
--

To: Frank Ch. Eigler <fche@...>
Cc: Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 6:33 pm

On Fri, 18 Jan 2008 17:26:09 -0500

sure. Every instrumentation you see is of the nested kind (since the lowest level
of nesting is already automatic via wchan).

If markers can provide me the following semantics, I'd be MORE than happy to use markers:

If the code path is like this

set_latency_reason("Reading file");
....
....
set_latency_reason("Allocating memory");
kmalloc() <--- blocks for 100 msec
restore_latency_reason()
....
restore_latency_reason();

via several layers of functions, the requirement is that the 100 msec is accounted
to "reading file" and not "Allocating memory". But if some other codepath hits the allocating memory function,
without an outer "set_latency_reason", it should be accounted to "Allocating memory".

If markers can provide that semantics ... you sold me.

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Frank Ch. Eigler <fche@...>, Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 7:11 pm

Hi -

Further to what acme said, markers are semantics-free. Callback
functions that implement your entry & exit semantics can be attached
at run time, at your pleasure. (So can systemtap probes, for that
matter.) The main difference would be that these callback functions
would have manage the per-thread LIFO data structures themselves,
instead of allocating backpointers on the kernel stack. (Bonus marks
for not modifying task_struct. :-)

- FChE
--

To: Frank Ch. Eigler <fche@...>
Cc: Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 7:20 pm

modifying task struct to have storage space is no big deal...
--

To: Arjan van de Ven <arjan@...>
Cc: Frank Ch. Eigler <fche@...>, Arjan van de Ven <arjan@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 6:47 pm

My limited understanding of markers is that you can attach something to
them that later work out any semantics you may want to associate with
them.

So I guess that as soon as the outermost marker is reached you can
just bump a counter and when the (several) inner markers are reached and
the counter is not zero you just bump it and go on with life, leaving
the latency measurement being done to the "owner" (outermost) marker.

Frank, am I talking bullshit?

- Arnaldo
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 1:39 pm

This patch is the core LatencyTOP kernel infrastructure;
it measures latencies in the scheduler and tracks it system
wide and per process
---
fs/proc/base.c | 61 +++++++++
include/linux/latencytop.h | 62 +++++++++
include/linux/sched.h | 6
kernel/Makefile | 1
kernel/fork.c | 1
kernel/latencytop.c | 287 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 8 +
lib/Kconfig.debug | 11 +
8 files changed, 436 insertions(+), 1 deletion(-)

Index: linux-2.6.24-rc7/include/linux/latencytop.h
===================================================================
--- /dev/null
+++ linux-2.6.24-rc7/include/linux/latencytop.h
@@ -0,0 +1,62 @@
+/*
+ * latencytop.h: Infrastructure for displaying latency
+ *
+ * (C) Copyright 2008 Intel Corporation
+ * Author: Arjan van de Ven <arjan@linux.intel.com>
+ *
+ */
+
+#ifndef _INCLUDE_GUARD_LATENCYTOP_H_
+#define _INCLUDE_GUARD_LATENCYTOP_H_
+
+#ifdef CONFIG_LATENCYTOP
+
+#define LT_ARGUMENTSIZE 24
+#define LT_SAVECOUNT 32
+
+struct latency_entry {
+ char *reason;
+ char argument[LT_ARGUMENTSIZE];
+ int maxtime; /* ignore latencies above this time as
+ "user requested" to deal with select() */
+};
+
+struct latency_record {
+ char *reason;
+ char argument[LT_ARGUMENTSIZE];
+ unsigned int count;
+ unsigned long time;
+ unsigned long max;
+};
+
+
+struct task_struct;
+
+struct latency_entry *set_latency_reason(char *reason,
+ struct latency_entry *entry);
+struct latency_entry *set_latency_reason_user(char *reason,
+ struct latency_entry *entry, unsigned int max);
+struct latency_entry *set_latency_reason_param(char *reason,
+ const char *param, struct latency_entry *entry);
+void restore_latency_reason(struct latency_entry *entry);
+void account_scheduler_latency(struct task_struct *task, int usecs, int inter);
+
+void clear_all_latency_tracing(struct task_struct *p);
+#else
+struct latency...

To: Arjan van de Ven <arjan@...>
Cc: <kosaki.motohiro@...>, Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Monday, January 21, 2008 - 4:06 am

Why do you ignore kernel thread?

may be, some network filesystem use kernel thread for local I/O.
What do you think it?

- kosaki

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Monday, January 21, 2008 - 12:16 pm

kernel threads (by themselves) don't tend to result in user visible latency...
at least that's my assumption so far... if you think I'm wrong... I'm open to be shown
that we should count them.
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Sunday, January 20, 2008 - 12:47 pm

Hello Arjan,

a few comments on the current locking scheme.

There is a single global lock and the locked section in
account_scheduler_latency() seems to be quite long:

(at worst case)

- get_wchan() ;
- account_global_scheduler_latency() which does up to MAXLR (128)
loops x2 strcmp() operations;
- up to LT_SAVECOUNT (32) x2 strcmp() operations by
account_scheduler_latency() itself.

That may induce a high latency for paths which call set_latency_reason_*().

Looking at the code, it looks like what really needs to be protected
is 'task->latency_reason'.

task->latency_record[] and a global latency_record[] are printed out
without any synchronization with account_scheduler_latency(). Is it
your intention?
That can be ok if we want to trade some preciseness for lower lock-contention.

If so,

- account_scheduler_latency() might take a snapshot of
'tsk->latency_reason' with the 'latency_lock' being held and do the
rest in a lock-less way.
Note, we have only a single writer to 'tsk->latency_record[]' at any
time due to the rq->lock to which 'tsk' belongs to being held ;

- yeah, this way the global 'latency_record[]' needs some sort of
protection as we may have concurrent writers here.

what do you think?

provided we hit a number of consequent "latencies" with explicitly
unspecified 'tsk->latency_reason', they all end up recorded in a
single 'tsk->latency_record' with "Unknown reason" and the _same_
'argument' which is the result of get_wchan() for the very _first_
"latency" in a row.

I think, it would make sense to record them separately with their

for (i = 0; i < LT_SAVECOUNT; i++) {

--
Best regards,
Dmitry Adamushko
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Sunday, January 20, 2008 - 1:18 pm

heh... now having read the first message in this series ("[Announce]
Development release 0.1 of the LatencyTOP tool"), I finally see that
"fine-grained locking" is still on your todo list :-)

--
Best regards,
Dmitry Adamushko
--

To: Arjan van de Ven <arjan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Friday, January 18, 2008 - 3:02 pm

[cut]

Just curious...

My biggest latency problems (as observed by me, the user) happen when a
program needs memory, or launching a new program, and the kernel begins
forces dirty memory to disk. This results in an unholy seek storm of
mixed writes (flushing, maybe a little swap) and reads (new program
loading). Streaming audio/video almost always starts freezing up during
this as well.

I don't suppose LatencyTop would help track anything down in that case,
would it?
--=20
Zan Lynx <zlynx@acm.org>

Previous thread: [PATCH] [0/7] Add noclflush option and related cleanup by Andi Kleen on Friday, January 18, 2008 - 1:27 pm. (14 messages)

Next thread: CONFIG_PREEMPT and real time kthreads by Miles Bintz on Friday, January 18, 2008 - 2:03 pm. (1 message)