[patch 3/3] LatencyTOP instrumentations part 2

Previous thread: [PATCH] [0/7] Add noclflush option and related cleanup by Andi Kleen on Friday, January 18, 2008 - 10:27 am. (12 messages)

Next thread: CONFIG_PREEMPT and real time kthreads by Miles Bintz on Friday, January 18, 2008 - 11:03 am. (1 message)
From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:36 am

The Intel Open Source Technology Center is pleased to announce the
release of version 0.1 of LatencyTOP, a tool for developers to visualize
system latencies.

http://www.latencytop.org

Slow servers, Skipping audio, Jerky video --everyone knows the symptoms
of latency. But to know what's really going on in the system, what's causing
the latency, and how to fix it... those are difficult questions without
good answers right now.

LatencyTOP is a Linux tool for software developers (both kernel and
userspace), aimed at identifying where system latency occurs, and what
kind of operation/action is causing the latency to happen. By identifying
this, developers can then change the code to avoid the worst latency
hiccups.

There are many types and causes of latency, and LatencyTOP focus on type
that causes audio skipping and desktop stutters. Specifically, LatencyTOP
focuses on the cases where the applications want to run and execute useful
code, but there's some resource that's not currently available (and the
kernel then blocks the process). This is done both on a system level and
on a per process level, so that you can see what's happening to the system,
and which process is suffering and/or causing the delays.

Rather than spending a thousand words on describing what LatencyTOP is,
I'll let the numbers speak..

Here's some output of LatencyTOP, collected for a make -j4 of a kernel
on a quad core system:

Cause                                        Maximum          Average
process fork                               1097.7 msec         2.5 msec
Reading from file                          1097.0 msec         0.1 msec
updating atime                             850.4 msec         60.1 msec
Locking buffer head                        433.1 msec         94.3 msec
Writing to file                            381.8 msec          0.6 msec
Synchronous bufferhead read                318.5 msec         16.3 msec
Waiting for buffer IO                      298.8 msec          7.8 msec


This ...
From: Zan Lynx
Date: Friday, January 18, 2008 - 12:02 pm

[cut]

Just curious...

My biggest latency problems (as observed by me, the user) happen when a
program needs memory, or launching a new program, and the kernel begins
forces dirty memory to disk.  This results in an unholy seek storm of
mixed writes (flushing, maybe a little swap) and reads (new program
loading).  Streaming audio/video almost always starts freezing up during
this as well.

I don't suppose LatencyTop would help track anything down in that case,
would it?
--=20
Zan Lynx <zlynx@acm.org>
From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:39 am

This patch is the core LatencyTOP kernel infrastructure;
it measures latencies in the scheduler and tracks it system
wide and per process
---
 fs/proc/base.c             |   61 +++++++++
 include/linux/latencytop.h |   62 +++++++++
 include/linux/sched.h      |    6 
 kernel/Makefile            |    1 
 kernel/fork.c              |    1 
 kernel/latencytop.c        |  287 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c        |    8 +
 lib/Kconfig.debug          |   11 +
 8 files changed, 436 insertions(+), 1 deletion(-)

Index: linux-2.6.24-rc7/include/linux/latencytop.h
===================================================================
--- /dev/null
+++ linux-2.6.24-rc7/include/linux/latencytop.h
@@ -0,0 +1,62 @@
+/*
+ * latencytop.h: Infrastructure for displaying latency
+ *
+ * (C) Copyright 2008 Intel Corporation
+ * Author: Arjan van de Ven <arjan@linux.intel.com>
+ *
+ */
+
+#ifndef _INCLUDE_GUARD_LATENCYTOP_H_
+#define _INCLUDE_GUARD_LATENCYTOP_H_
+
+#ifdef CONFIG_LATENCYTOP
+
+#define LT_ARGUMENTSIZE 24
+#define LT_SAVECOUNT 32
+
+struct latency_entry {
+	char	*reason;
+	char	argument[LT_ARGUMENTSIZE];
+	int	maxtime;         /* ignore latencies above this time as
+				    "user requested" to deal with select() */
+};
+
+struct latency_record {
+	char 	*reason;
+	char	argument[LT_ARGUMENTSIZE];
+	unsigned int count;
+	unsigned long time;
+	unsigned long max;
+};
+
+
+struct task_struct;
+
+struct latency_entry *set_latency_reason(char *reason,
+			struct latency_entry *entry);
+struct latency_entry *set_latency_reason_user(char *reason,
+			struct latency_entry *entry, unsigned int max);
+struct latency_entry *set_latency_reason_param(char *reason,
+			const char *param, struct latency_entry *entry);
+void restore_latency_reason(struct latency_entry *entry);
+void account_scheduler_latency(struct task_struct *task, int usecs, int inter);
+
+void clear_all_latency_tracing(struct task_struct *p);
+#else
+struct ...
From: Dmitry Adamushko
Date: Sunday, January 20, 2008 - 9:47 am

Hello Arjan,

a few comments on the current locking scheme.

There is a single global lock and the locked section in
account_scheduler_latency() seems to be quite long:

(at worst case)

- get_wchan() ;
- account_global_scheduler_latency() which does up to MAXLR (128)
loops x2 strcmp() operations;
- up to LT_SAVECOUNT (32) x2 strcmp() operations by
account_scheduler_latency() itself.

That may induce a high latency for paths which call set_latency_reason_*().

Looking at the code, it looks like what really needs to be protected
is 'task->latency_reason'.

task->latency_record[] and a global latency_record[] are printed out
without any synchronization with account_scheduler_latency(). Is it
your intention?
That can be ok if we want to trade some preciseness for lower lock-contention.

If so,

- account_scheduler_latency() might take a snapshot of
'tsk->latency_reason' with the 'latency_lock' being held and do the
rest in a lock-less way.
Note, we have only a single writer to 'tsk->latency_record[]' at any
time due to the rq->lock to which 'tsk' belongs to being held ;

- yeah, this way the global 'latency_record[]' needs some sort of
protection as we may have concurrent writers here.

what do you think?


provided we hit a number of consequent "latencies" with explicitly
unspecified 'tsk->latency_reason', they all end up recorded in a
single 'tsk->latency_record' with "Unknown reason" and the _same_
'argument' which is the result of get_wchan() for the very _first_
"latency" in a row.

I think, it would make sense to record them separately with their

for (i = 0; i < LT_SAVECOUNT; i++) {


-- 
Best regards,
Dmitry Adamushko
--

From: Dmitry Adamushko
Date: Sunday, January 20, 2008 - 10:18 am

heh... now having read the first message in this series ("[Announce]
Development release 0.1 of the LatencyTOP tool"), I finally see that
"fine-grained locking" is still on your todo list :-)


-- 
Best regards,
Dmitry Adamushko
--

From: KOSAKI Motohiro
Date: Monday, January 21, 2008 - 1:06 am

Why do you ignore kernel thread?

may be, some network filesystem use kernel thread for local I/O.
What do you think it?


- kosaki

--

From: Frank Ch. Eigler
Date: Friday, January 18, 2008 - 3:26 pm

Can you suggest of some reason why all this instrumentation could
not be in the form of standard markers (perhaps conditionally
compiled out if necessary)?

- FChE
--

From: Arnaldo Carvalho de Melo
Date: Friday, January 18, 2008 - 3:47 pm

My limited understanding of markers is that you can attach something to
them that later work out any semantics you may want to associate with
them.

So I guess that as soon as the outermost marker is reached you can
just bump a counter and when the (several) inner markers are reached and
the counter is not zero you just bump it and go on with life, leaving
the latency measurement being done to the "owner" (outermost) marker.

Frank, am I talking bullshit?

- Arnaldo
--

From: Frank Ch. Eigler
Date: Friday, January 18, 2008 - 4:11 pm

Hi -


Further to what acme said, markers are semantics-free.  Callback
functions that implement your entry & exit semantics can be attached
at run time, at your pleasure.  (So can systemtap probes, for that
matter.)  The main difference would be that these callback functions
would have manage the per-thread LIFO data structures themselves,
instead of allocating backpointers on the kernel stack.  (Bonus marks
for not modifying task_struct. :-)

- FChE
--

From: Arjan van de Ven
Date: Friday, January 18, 2008 - 4:20 pm

modifying task struct to have storage space is no big deal...
--

From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:42 am

A few more high level annotations for LatencyTOP;
split into a separate patch since they're at a higher level
than the first part (and thus hide some lower level details
potentially, but could also obsolete several other low level
instrumentations)
---
 fs/read_write.c   |    7 ++++++-
 fs/readdir.c      |    4 +++-
 lib/kernel_lock.c |    3 +++
 3 files changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6.24-rc7/fs/read_write.c
===================================================================
--- linux-2.6.24-rc7.orig/fs/read_write.c
+++ linux-2.6.24-rc7/fs/read_write.c
@@ -255,6 +255,7 @@ EXPORT_SYMBOL(do_sync_read);
 
 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
 {
+	struct latency_entry reason;
 	ssize_t ret;
 
 	if (!(file->f_mode & FMODE_READ))
@@ -264,6 +265,7 @@ ssize_t vfs_read(struct file *file, char
 	if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
 		return -EFAULT;
 
+	set_latency_reason("Reading from file", &reason);
 	ret = rw_verify_area(READ, file, pos, count);
 	if (ret >= 0) {
 		count = ret;
@@ -281,6 +283,7 @@ ssize_t vfs_read(struct file *file, char
 		}
 	}
 
+	restore_latency_reason(&reason);
 	return ret;
 }
 
@@ -314,6 +317,7 @@ EXPORT_SYMBOL(do_sync_write);
 ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
 {
 	ssize_t ret;
+	struct latency_entry reason;
 
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
@@ -322,6 +326,7 @@ ssize_t vfs_write(struct file *file, con
 	if (unlikely(!access_ok(VERIFY_READ, buf, count)))
 		return -EFAULT;
 
+	set_latency_reason("Writing to file", &reason);
 	ret = rw_verify_area(WRITE, file, pos, count);
 	if (ret >= 0) {
 		count = ret;
@@ -338,7 +343,7 @@ ssize_t vfs_write(struct file *file, con
 			inc_syscw(current);
 		}
 	}
-
+	restore_latency_reason(&reason);
 	return ret;
 }
 
Index: linux-2.6.24-rc7/fs/readdir.c
===================================================================
--- ...
From: Arjan van de Ven
Date: Friday, January 18, 2008 - 11:35 am

I thought about that, but the strncmp is still somewhat needed to deal with the argument
(the instrumentation above doesn't use that, but the mutex one does for example)
If we get rid of the argument entirely it'd be easier
(but then we get "blocking on mutex" rather than "blocking on inode->mutex")
--

From: Roberto Fichera
Date: Friday, January 18, 2008 - 11:46 am

Roberto Fichera. 

--

From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:16 pm

yes indeed; I sort of use the same infrastructure inside the scheduler; the biggest
reason I felt I had to do something different was that I wanted to do per process
data collection, so that you can see for a specific process what was going on.
--

From: Andi Kleen
Date: Friday, January 18, 2008 - 10:27 pm

Wouldn't it have been easier then to just extend the sleep profiler to
oprofile? oprofile already has pid filters and can do per process 
profiling.

On the other hand I'm not fully sure only doing per pid profiling
is that useful. After all often latencies come from asynchronous
threads (like kblockd). So a system level view is probably better
anyways. 

-Andi
--

From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:25 pm

another thing that the current profiling can't do, is to show what the system is doing
when it hits the latency.. so someone calling fsync() will show up in the waiting for
IO function, but not that it was due to an fsync().
--

From: Andi Kleen
Date: Friday, January 18, 2008 - 10:33 pm

Hmm so how about extending oprofile to always log the syscall number
in the event logs (can be gotten from top of stack). I think given
that you could reconstruct that data in the userland at least
for single threads (not for work done on behalf of them in other
threads; but I'm not sure you tried to solve that problem at all)

The advantage is that it would be an generic mechanism that would work
for all types of profiling.

--

From: Andi Kleen
Date: Friday, January 18, 2008 - 10:37 pm

Ok to handle exceptions like page faults this way you would need to save
the vector somewhere on entry, but that shouldn't be very costly 
or difficult and could probably even be done unconditionally.

-Andi
--

From: Arjan van de Ven
Date: Friday, January 18, 2008 - 10:34 pm

syscall nr and pid at minimum then.
Still doesn't work for modules either.

what it ends up doing is using an entirely different interface for basically the
same code / operation inside the kernel.
The current interface code is maybe 80 lines of /proc code... and very simple to
use (unlike the oprofile interface)
--

From: Andi Kleen
Date: Friday, January 18, 2008 - 11:08 pm

> syscall nr and pid at minimum then.

oprofile already supports logging the pid I believe. Otherwise


Well rather it uses an existing framework for something that fits
it well.

Also the way I proposed is very cheap and would be possible

The oprofile interface is per CPU (so you wouldn't need to reinvent
that to fix your locking) and if you add the syscall
logging feature to it it would apply to all profile events
e.g. you could then do things like "matching cache misses to syscalls" 

-andi

--

From: KOSAKI Motohiro
Date: Sunday, January 20, 2008 - 5:52 pm

Interesting.

at least, I will use for fixed page lock contension :)

if possible, I want some /proc/lock_stat feature.
i.e. holding time(avg, max), contention target ip...

and, this patche become more powerful if spinlock change sleepable patch applied.
(may be, rt guy have it)


- kosaki


--

Previous thread: [PATCH] [0/7] Add noclflush option and related cleanup by Andi Kleen on Friday, January 18, 2008 - 10:27 am. (12 messages)

Next thread: CONFIG_PREEMPT and real time kthreads by Miles Bintz on Friday, January 18, 2008 - 11:03 am. (1 message)