The Intel Open Source Technology Center is pleased to announce the release of version 0.1 of LatencyTOP, a tool for developers to visualize system latencies. http://www.latencytop.org Slow servers, Skipping audio, Jerky video --everyone knows the symptoms of latency. But to know what's really going on in the system, what's causing the latency, and how to fix it... those are difficult questions without good answers right now. LatencyTOP is a Linux tool for software developers (both kernel and userspace), aimed at identifying where system latency occurs, and what kind of operation/action is causing the latency to happen. By identifying this, developers can then change the code to avoid the worst latency hiccups. There are many types and causes of latency, and LatencyTOP focus on type that causes audio skipping and desktop stutters. Specifically, LatencyTOP focuses on the cases where the applications want to run and execute useful code, but there's some resource that's not currently available (and the kernel then blocks the process). This is done both on a system level and on a per process level, so that you can see what's happening to the system, and which process is suffering and/or causing the delays. Rather than spending a thousand words on describing what LatencyTOP is, I'll let the numbers speak.. Here's some output of LatencyTOP, collected for a make -j4 of a kernel on a quad core system: Cause Maximum Average process fork 1097.7 msec 2.5 msec Reading from file 1097.0 msec 0.1 msec updating atime 850.4 msec 60.1 msec Locking buffer head 433.1 msec 94.3 msec Writing to file 381.8 msec 0.6 msec Synchronous bufferhead read 318.5 msec 16.3 msec Waiting for buffer IO 298.8 msec 7.8 msec This ...
[cut] Just curious... My biggest latency problems (as observed by me, the user) happen when a program needs memory, or launching a new program, and the kernel begins forces dirty memory to disk. This results in an unholy seek storm of mixed writes (flushing, maybe a little swap) and reads (new program loading). Streaming audio/video almost always starts freezing up during this as well. I don't suppose LatencyTop would help track anything down in that case, would it? --=20 Zan Lynx <zlynx@acm.org>
This patch is the core LatencyTOP kernel infrastructure;
it measures latencies in the scheduler and tracks it system
wide and per process
---
fs/proc/base.c | 61 +++++++++
include/linux/latencytop.h | 62 +++++++++
include/linux/sched.h | 6
kernel/Makefile | 1
kernel/fork.c | 1
kernel/latencytop.c | 287 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 8 +
lib/Kconfig.debug | 11 +
8 files changed, 436 insertions(+), 1 deletion(-)
Index: linux-2.6.24-rc7/include/linux/latencytop.h
===================================================================
--- /dev/null
+++ linux-2.6.24-rc7/include/linux/latencytop.h
@@ -0,0 +1,62 @@
+/*
+ * latencytop.h: Infrastructure for displaying latency
+ *
+ * (C) Copyright 2008 Intel Corporation
+ * Author: Arjan van de Ven <arjan@linux.intel.com>
+ *
+ */
+
+#ifndef _INCLUDE_GUARD_LATENCYTOP_H_
+#define _INCLUDE_GUARD_LATENCYTOP_H_
+
+#ifdef CONFIG_LATENCYTOP
+
+#define LT_ARGUMENTSIZE 24
+#define LT_SAVECOUNT 32
+
+struct latency_entry {
+ char *reason;
+ char argument[LT_ARGUMENTSIZE];
+ int maxtime; /* ignore latencies above this time as
+ "user requested" to deal with select() */
+};
+
+struct latency_record {
+ char *reason;
+ char argument[LT_ARGUMENTSIZE];
+ unsigned int count;
+ unsigned long time;
+ unsigned long max;
+};
+
+
+struct task_struct;
+
+struct latency_entry *set_latency_reason(char *reason,
+ struct latency_entry *entry);
+struct latency_entry *set_latency_reason_user(char *reason,
+ struct latency_entry *entry, unsigned int max);
+struct latency_entry *set_latency_reason_param(char *reason,
+ const char *param, struct latency_entry *entry);
+void restore_latency_reason(struct latency_entry *entry);
+void account_scheduler_latency(struct task_struct *task, int usecs, int inter);
+
+void clear_all_latency_tracing(struct task_struct *p);
+#else
+struct ...Hello Arjan,
a few comments on the current locking scheme.
There is a single global lock and the locked section in
account_scheduler_latency() seems to be quite long:
(at worst case)
- get_wchan() ;
- account_global_scheduler_latency() which does up to MAXLR (128)
loops x2 strcmp() operations;
- up to LT_SAVECOUNT (32) x2 strcmp() operations by
account_scheduler_latency() itself.
That may induce a high latency for paths which call set_latency_reason_*().
Looking at the code, it looks like what really needs to be protected
is 'task->latency_reason'.
task->latency_record[] and a global latency_record[] are printed out
without any synchronization with account_scheduler_latency(). Is it
your intention?
That can be ok if we want to trade some preciseness for lower lock-contention.
If so,
- account_scheduler_latency() might take a snapshot of
'tsk->latency_reason' with the 'latency_lock' being held and do the
rest in a lock-less way.
Note, we have only a single writer to 'tsk->latency_record[]' at any
time due to the rq->lock to which 'tsk' belongs to being held ;
- yeah, this way the global 'latency_record[]' needs some sort of
protection as we may have concurrent writers here.
what do you think?
provided we hit a number of consequent "latencies" with explicitly
unspecified 'tsk->latency_reason', they all end up recorded in a
single 'tsk->latency_record' with "Unknown reason" and the _same_
'argument' which is the result of get_wchan() for the very _first_
"latency" in a row.
I think, it would make sense to record them separately with their
for (i = 0; i < LT_SAVECOUNT; i++) {
--
Best regards,
Dmitry Adamushko
--
heh... now having read the first message in this series ("[Announce] Development release 0.1 of the LatencyTOP tool"), I finally see that "fine-grained locking" is still on your todo list :-) -- Best regards, Dmitry Adamushko --
Why do you ignore kernel thread? may be, some network filesystem use kernel thread for local I/O. What do you think it? - kosaki --
Can you suggest of some reason why all this instrumentation could not be in the form of standard markers (perhaps conditionally compiled out if necessary)? - FChE --
My limited understanding of markers is that you can attach something to them that later work out any semantics you may want to associate with them. So I guess that as soon as the outermost marker is reached you can just bump a counter and when the (several) inner markers are reached and the counter is not zero you just bump it and go on with life, leaving the latency measurement being done to the "owner" (outermost) marker. Frank, am I talking bullshit? - Arnaldo --
Hi - Further to what acme said, markers are semantics-free. Callback functions that implement your entry & exit semantics can be attached at run time, at your pleasure. (So can systemtap probes, for that matter.) The main difference would be that these callback functions would have manage the per-thread LIFO data structures themselves, instead of allocating backpointers on the kernel stack. (Bonus marks for not modifying task_struct. :-) - FChE --
modifying task struct to have storage space is no big deal... --
A few more high level annotations for LatencyTOP;
split into a separate patch since they're at a higher level
than the first part (and thus hide some lower level details
potentially, but could also obsolete several other low level
instrumentations)
---
fs/read_write.c | 7 ++++++-
fs/readdir.c | 4 +++-
lib/kernel_lock.c | 3 +++
3 files changed, 12 insertions(+), 2 deletions(-)
Index: linux-2.6.24-rc7/fs/read_write.c
===================================================================
--- linux-2.6.24-rc7.orig/fs/read_write.c
+++ linux-2.6.24-rc7/fs/read_write.c
@@ -255,6 +255,7 @@ EXPORT_SYMBOL(do_sync_read);
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
+ struct latency_entry reason;
ssize_t ret;
if (!(file->f_mode & FMODE_READ))
@@ -264,6 +265,7 @@ ssize_t vfs_read(struct file *file, char
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
return -EFAULT;
+ set_latency_reason("Reading from file", &reason);
ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
count = ret;
@@ -281,6 +283,7 @@ ssize_t vfs_read(struct file *file, char
}
}
+ restore_latency_reason(&reason);
return ret;
}
@@ -314,6 +317,7 @@ EXPORT_SYMBOL(do_sync_write);
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
+ struct latency_entry reason;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
@@ -322,6 +326,7 @@ ssize_t vfs_write(struct file *file, con
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
+ set_latency_reason("Writing to file", &reason);
ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
count = ret;
@@ -338,7 +343,7 @@ ssize_t vfs_write(struct file *file, con
inc_syscw(current);
}
}
-
+ restore_latency_reason(&reason);
return ret;
}
Index: linux-2.6.24-rc7/fs/readdir.c
===================================================================
--- ...I thought about that, but the strncmp is still somewhat needed to deal with the argument (the instrumentation above doesn't use that, but the mutex one does for example) If we get rid of the argument entirely it'd be easier (but then we get "blocking on mutex" rather than "blocking on inode->mutex") --
yes indeed; I sort of use the same infrastructure inside the scheduler; the biggest reason I felt I had to do something different was that I wanted to do per process data collection, so that you can see for a specific process what was going on. --
Wouldn't it have been easier then to just extend the sleep profiler to oprofile? oprofile already has pid filters and can do per process profiling. On the other hand I'm not fully sure only doing per pid profiling is that useful. After all often latencies come from asynchronous threads (like kblockd). So a system level view is probably better anyways. -Andi --
another thing that the current profiling can't do, is to show what the system is doing when it hits the latency.. so someone calling fsync() will show up in the waiting for IO function, but not that it was due to an fsync(). --
Hmm so how about extending oprofile to always log the syscall number in the event logs (can be gotten from top of stack). I think given that you could reconstruct that data in the userland at least for single threads (not for work done on behalf of them in other threads; but I'm not sure you tried to solve that problem at all) The advantage is that it would be an generic mechanism that would work for all types of profiling. --
Ok to handle exceptions like page faults this way you would need to save the vector somewhere on entry, but that shouldn't be very costly or difficult and could probably even be done unconditionally. -Andi --
syscall nr and pid at minimum then. Still doesn't work for modules either. what it ends up doing is using an entirely different interface for basically the same code / operation inside the kernel. The current interface code is maybe 80 lines of /proc code... and very simple to use (unlike the oprofile interface) --
> syscall nr and pid at minimum then. oprofile already supports logging the pid I believe. Otherwise Well rather it uses an existing framework for something that fits it well. Also the way I proposed is very cheap and would be possible The oprofile interface is per CPU (so you wouldn't need to reinvent that to fix your locking) and if you add the syscall logging feature to it it would apply to all profile events e.g. you could then do things like "matching cache misses to syscalls" -andi --
Interesting. at least, I will use for fixed page lock contension :) if possible, I want some /proc/lock_stat feature. i.e. holding time(avg, max), contention target ip... and, this patche become more powerful if spinlock change sleepable patch applied. (may be, rt guy have it) - kosaki --
