Hello, I am looking for some tool/kernel machanism that enables me to track every schedule change on the CFS scheduler of the RT-kernel for some period of time. Thus a tool that gives me an overview after some time which task got "scheduled in/out" at "which timestamp" and at "which CPU". ( I need the raw data) In a far past (on Montavista kernels) I used LTT for this to log for some time with only the SCHED_CHANGE filter and text output. But, I cannot find any LTT(ng) support for any RT-kernel, and support for the new CFS (like in 2.6.22.1-rt4) is even harder. So I was wondering if anybody knows some tool/kernel mechanism which can do this? If not, I will build a kernel extension for it myself (new extension to 'latency_trace' ?) In that case can anybody with in depth CFS scheduler knowledge please point me which hooks are safe to use for this? I need it to get a detailed insight in my RT-system with its RT and non-RT applications. Thus to know when a certain task gets scheduled (and to calculate its per thread min/max/avg latencies), which task preempts another task, and to get a complete overview of what the RT system (and scheduler) is doing during time etc. Kind Regards, Remy Böhmer -
systemtap has been able to do such things for me in the past... -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org -
Hi,
Was trying to capture similar data as mentioned by Remy using Systemtap.
The tapset/systemtap script that I used is :
probe kernel.function("balance_rt_tasks").inline {
printf("%s (pid: %d, tid: %d argstr: %s ) \n", execname(),
pid(), tid(), argstr);
}
The probe point did get triggered, and soon after that I had the
following in dmesg, leading to system hang...
BUG: scheduling while atomic: softirq-rcu/3/0x00000004/52, CPU#3
Call Trace:
<#DB> [<ffffffff81033555>] __schedule_bug+0x4b/0x4f
[<ffffffff8128b414>] __sched_text_start+0xcc/0xaaa
[<ffffffff8100b574>] dump_trace+0x248/0x25d
[<ffffffff81068334>] print_traces+0x9/0xb
[<ffffffff8100b5e5>] show_trace+0x5c/0x64
[<ffffffff8128c1c2>] schedule+0xe4/0x104
[<ffffffff8128d10c>] rt_spin_lock_slowlock+0xfc/0x19e
[<ffffffff8128d9de>] __rt_spin_lock+0x1f/0x21
[<ffffffff8128d9e9>] rt_spin_lock+0x9/0xb
[<ffffffff8128b511>] __sched_text_start+0x1c9/0xaaa
[<ffffffff8128f8ee>] kprobe_handler+0x1b3/0x1f5
[<ffffffff8128f96b>] kprobe_exceptions_notify+0x3b/0x7f
[<ffffffff81290604>] notifier_call_chain+0x33/0x5b
[<ffffffff810461d5>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff810461e6>] raw_notifier_call_chain+0xf/0x11
[<ffffffff8105098b>] notify_die+0x2e/0x33
[<ffffffff8128ef6d>] do_int3+0x30/0x8d
[<ffffffff8128e8a3>] int3+0x93/0xb0
[<ffffffff8128b512>] __sched_text_start+0x1ca/0xaaa
<<EOE>> [<ffffffff8107b585>] __free_pages+0x18/0x21
[<ffffffff8107b5e3>] free_pages+0x55/0x5a
[<ffffffff8109945d>] kmem_freepages+0x112/0x11b
[<ffffffff8128c1c2>] schedule+0xe4/0x104
[<ffffffff8103edf5>] ksoftirqd+0xbc/0x26f
[<ffffffff8103ed39>] ksoftirqd+0x0/0x26f
[<ffffffff8103ed39>] ksoftirqd+0x0/0x26f
[<ffffffff8104c917>] kthread+0x49/0x76
[<ffffffff8100af18>] child_rip+0xa/0x12
[<ffffffff8128be67>] thread_return+0x75/0x1d5
[<ffffffff8104c8ce>] kthread+0x0/0x76
[<ffffffff8100af0e>] child_rip+0x0/0x12
Looks like printing the data in the tapset resulted in some lock
issues. The above script is ...I'd suggest to not put a probe into a preempt-off section - put it to the beginning and to the end of schedule() to capture context-switches. _stp_print_flush() is in the systemtap-generated module, right? Maybe the problem is resolved by changing that spinlock to use raw_spinlock_t / DEFINE_RAW_SPIN_LOCK. Ingo -
Yes, _stp_print_flush is in the systemtap-generated kprobe module. Placing the probe at the beginning of schedule() also has the same -- Regards, Ankita Garg (ankita@in.ibm.com) Linux Technology Center IBM India Systems & Technology Labs, Bangalore, India -
could you send us that module source ST generates? Perhaps there are preempt_disable() (or local_irq_disable()) calls in it too. Ingo -
Attaching the generated module as it is huge...
Looks like SystemTap makes use of relayfs for printing the buffer
contents. The _stp_print_flush() (line 269 in the attached module) implemented
in the systemtap library function as calling the following:
static int _stp_relay_write (const void *data, unsigned length)
{
unsigned long flags;
struct rchan_buf *buf;
if (unlikely(length == 0))
return 0;
local_irq_save(flags);
buf = _stp_chan->buf[smp_processor_id()];
if (unlikely(buf->offset + length > _stp_chan->subbuf_size))
length = relay_switch_subbuf(buf, length);
memcpy(buf->data + buf->offset, data, length);
buf->offset += length;
local_irq_restore(flags);
if (unlikely(length == 0))
return -1;
return length;
}
The above does a local_irq_save().
--
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India
oh, what a fine piece of s^H^H :-/ Who in their right mind calls this
from _tracing_ code:
smp_mb();
if (waitqueue_active(&buf->read_wait))
/*
* Calling wake_up_interruptible() from here
* will deadlock if we happen to be logging
* from the scheduler (trying to re-grab
* rq->lock), so defer it.
*/
__mod_timer(&buf->timer, jiffies + 1);
and the comment is utter rubbish: __mod_timer() can lock up just as
much. Just use an adaptive-polling method to drive the draining of the
relay buffer, instead of mucking with timers from within the tracing
code. Whoever implemented this has absolutely zero clue i have to say
...
the smp_mb() is rubbish too.
could you try the patch below, does it fix the problem?
Ingo
------------------------------------->
Subject: relay: fix timer madness
From: Ingo Molnar <mingo@elte.hu>
remove timer calls (!!!) from deep within the tracing infrastructure.
This was totally bogus code that can cause lockups and worse.
Poll the buffer every 2 jiffies for now.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/relay.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
Index: linux-rt-rebase.q/kernel/relay.c
===================================================================
--- linux-rt-rebase.q.orig/kernel/relay.c
+++ linux-rt-rebase.q/kernel/relay.c
@@ -319,6 +319,10 @@ static void wakeup_readers(unsigned long
{
struct rchan_buf *buf = (struct rchan_buf *)data;
wake_up_interruptible(&buf->read_wait);
+ /*
+ * Stupid polling for now:
+ */
+ mod_timer(&buf->timer, jiffies + 1);
}
/**
@@ -336,6 +340,7 @@ static void __relay_reset(struct rchan_b
init_waitqueue_head(&buf->read_wait);
kref_init(&buf->kref);
setup_timer(&buf->timer, wakeup_readers, (unsigned ...Hi Ingo, I did not have this problem with LTTng, since I only touch atomic variables in tracing code. However, I iterate on a rcu list of active traces in a timer interrupt to see if subbuffers must be read and, if yes, I send a wakeup to my user-space daemon. I had to change the protection around this rcu list read from preempt disable to mutex lock in this timer because try_to_wakeup is called in it, which takes a spinlock. Note: I don't use relay code at my tracing site. I guess they might have to switch to such an asynchronous delivery system if they want to do this properly. Simply put, your polling solution is exactly what I do, but I check a flag set by the writer instead of waking up the readers unconditionally. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Ingo's solution could call waitqueue_active() inside wakeup_readers() to -
Yes and no.. What would be part of it, but also checking if there are buffers ready to be read would reduce the unnecessary polling even more. Just checking for waiting readers would be rather inefficient in a scenario of infrequent events where the readers are always waiting for data; they would be awakened at every timer tick. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Hi Ingo, -- Regards, Ankita Garg (ankita@in.ibm.com) Linux Technology Center IBM India Systems & Technology Labs, Bangalore, India -
See below,
The problem is also in _stp_print_flush, not *only* in relay code:
void _stp_print_flush (void)
...
spin_lock(&_stp_print_lock);
...
spin_unlock(&_stp_print_lock);
Those will turn into mutexes with -rt.
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Indeed, plus systemtap-generated locking code uses rwlocks, local_irq_save/restore or preempt_disable, in various places. Could someone point to a place that spells out what would be more appropriate way of ensuring atomicity while being compatible with -rt? - FChE -
AFAIK, for your needs either: - Use atomic ops - Use per-cpu data with preempt disabling/irq disabling - Use the original "real" spin locks/rwlocks (raw_*). - Don't play with timers or wakeups, since this kernel code uses the "standard" spin locks (sleepable in -rt). You just don't want to sleep in the tracing code. Make sure that the sub-buffer switch code respects that too: it is the most tricky and yet less executed part of the tracing code, so it's easy for bugs to slip there and yet be undetected for a while. Since you will likely disable preemption, make sure your tracing code executes in a deterministic time. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Hi - It was unclear from the OLS paper whether the spin_lock_irq* family of We will review that part of the related code. - FChE -
By making the locks of the raw_* type, spin_lock_irq* functions would -- Regards, Ankita Garg (ankita@in.ibm.com) Linux Technology Center IBM India Systems & Technology Labs, Bangalore, India -
Yes, you have them to move them too: linux/spinlock.h: #ifdef CONFIG_PREEMPT_RT # define _spin_lock(l) rt_spin_lock(l) # define _spin_lock_nested(l, s) rt_spin_lock_nested(l, s) # define _spin_lock_bh(l) rt_spin_lock(l) # define _spin_lock_irq(l) rt_spin_lock(l) # define _spin_unlock(l) rt_spin_unlock(l) # define _spin_unlock_no_resched(l) rt_spin_unlock(l) # define _spin_unlock_bh(l) rt_spin_unlock(l) # define _spin_unlock_irq(l) rt_spin_unlock(l) # define _spin_unlock_irqrestore(l, f) rt_spin_unlock(l) .. And below, both spin_lock and spin_lock_irqsave use PICK_OP to turn into their _spin_lock_* equivalent, which are both mapped to rt_spin_lock in -rt. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
https://ols2006.108.redhat.com/2007/Reprints/rostedt-Reprint.pdf And his slides too, haven't checked if they are already only at the OLS site. - Arnaldo -
Hello Arjan, Thanks for this suggestion. But I looked at Systemtap before, and as I remember, it is very flexible, but it only traces function calls. (or am I missing something?) I also need this scheduler-tracer, besides for fully fledged X86 systems, also for ARM based boards also with a very minimal userland, and limited flash. So, it has to be lightweight also. I can only use cross-compilers to build the tools... (something Systemtap is not good at as I remember...) So, I am missing something here, or it is unfortunately not what I am looking for... Kind Regards, Remy -
(please dont top-post, ever. See: systemtap it should be able to trace variables as well. (as long as those variables are not eliminated by gcc) Ingo -
Hi Remy, I think Thomas may have worked on a port of recent LTTng to the -RT kernel. If not, I think you needs are quite easy to fulfill with a small amount of work. Basically, you could take a recent copy of LTTng, which comes in a set of logically separated patches: http://ltt.polymtl.ca/lttng/patch-2.6.22-rc4-mm2-lttng-0.9.10.tar.bz2 Follow the Compatibility List to see which other packages you need (ltt-control and lttv) : http://ltt.polymtl.ca/svn/ltt/branches/poly/doc/developer/lttng-lttv-compatibility.html Then, when applying the LTTng patches to the -RT kernel, a few things must be taken care of: 1 - "instrumentation" patches : you won't care about most of the rejects, since you only care about the schedule() calls. 2 - You'll have to see how schedule() is instrumented and port that to the CFS. 3 - Make sure the Linux Kernel Markers (linux/marker.h) _really_ call preempt_disable()/preempt_enable() (is it a underscored function in -rt ?). It is used for correct teardown of probe handlers with synchronize_sched(). Probe sites execute very quickly, do it won't add significant latency to your system. 4 - Ping me for problematic rejects. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Hi Remy, Due to popular demand, I just ported LTTng 0.9.10 to 2.6.22.1-rt4. It may deserve some more testing, but it runs fine on my x86_32. Please note that I ported my last stable LTTng version, so some code is not as shiny as the new one (work in progress), but it works. Please refer to http://ltt.polymtl.ca, mostly the QUICKSTART GUIDE and the compatibility list. On X86, everything should work fine. Just make sure that your architecture has a stable and coherent TSC across CPUs. See ltt/ltt-test-tsc.c (it printk a message at boot time) to see if yours is ok. See my documentation on my website on how to tweak different Intel/AMD if needed. For your ARM needs, you may have to tweak include/asm-arm/ltt.h so it reads a time base register (or the equivalent). Right now, it defaults to use a logical clock, which is clearly not what you need. I guess you'll want to disable the "userspace tracing", since I did not port ltt-usertrace, so it's not very userful then, and also disable the "locking instrumentation", which generates a _lot_ of output. I know Thomas and Steven were also interested in this port. Direct link to the patches: http://ltt.polymtl.ca/lttng/patch-2.6.22.1-rt4-lttng-0.9.10.tar.gz Mathieu P.S. : To give you an idea of the output: (you know the CPU id with processes_1 or processes_0) ./lttv -m textDump -e "event.name=sched_schedule" -t /tmp/trace2 (System at 250HZ with PREEMPT_RT) A little legend: group.event : time since boot, in s.ns (tracefile name, ending with cpu ID), Thread ID ("pid"), Thread group ID (posix pid), process name, thread name (need userspace instrumentation), Parent PID, Userspace function (need userspace instrumentation), execution mode (SYSCALL/USERMODE/TRAP/IRQ...) { event specific data } The event specific (raw) data for sched_schedule is: PID out PID in previous state : State of the process we are scheduling out : -1 unrunnable, 0 runnable, >0 ...
Hi Remy, On Mon, 16 Jul 2007 21:46:46 +0200 "Remy Bohmer" <l.pinguin@gmail.com> wrot= Have a look at Steven Rostedt's logdev, it's very lightweight and does what you want. I'm currently using it doing the same thing you want to achieve. http://rostedt.homelinux.com/logdev/ I ported it over to 2.6.21.5-rt20 (patches attached). The way I set it up is: - the patches apply in the following order: . relay-update.patch . logdev-0.5.4.patch . logdev-markers-0.5.4.patch . logdev-trace-ctx-switch.patch . logdev-trace-sys_clock_getres.patch - be sure to enable debugfs in the kernel - the userspace tool, you can find at http://rostedt.homelinux.com/logdev/ I modified the dump_log.h and log_marker.c to suit my needs (attached) = and match the logdev-trace-ctx-switch.patch and logdev-trace-sys_clock_getr= es.patch - start tracing with 'echo 1 > /mnt/debugfs/logdev/mark' stop tracing with 'echo 0 > /mnt/debugfs/logdev/mark' dump trace buffers with 'logdev/logread > dump.log' The clock_getres trace is for me to mark some transitions in my test program so that I can discard non interesting traces. Hope this helps, feel free to ask for more details. S=C3=A9bastien.
