Re: [Question] Hooks for scheduler tracing (CFS)

Previous thread: [PATCH] Fix memory leak in dm_create_persistent() when starting metadata update thread fails. by Jesper Juhl on Monday, July 16, 2007 - 12:41 pm. (1 message)

Next thread: [PATCH] Remove pointless conditional in drivers/usb/serial/io_ti.c::edge_shutdown() by Jesper Juhl on Monday, July 16, 2007 - 1:17 pm. (1 message)
From: Remy Bohmer
Date: Monday, July 16, 2007 - 12:46 pm

Hello,

I am looking for some tool/kernel machanism that enables me to track
every schedule change on the CFS scheduler of the RT-kernel for some
period of time.
Thus a tool that gives me an overview after some time which task got
"scheduled in/out" at "which timestamp" and at "which CPU". ( I need
the raw data)

In a far past (on Montavista kernels) I used LTT for this to log for
some time with only the SCHED_CHANGE filter and text output. But, I
cannot find any LTT(ng) support for any RT-kernel, and support for the
new CFS (like in 2.6.22.1-rt4) is even harder.

So I was wondering if anybody knows some tool/kernel mechanism which
can do this?
If not, I will build a kernel extension for it myself (new extension
to 'latency_trace' ?)
In that case can anybody with in depth CFS scheduler knowledge please
point me which hooks are safe to use for this?

I need it to get a detailed insight in my RT-system with its RT and
non-RT applications.
Thus to know when a certain task gets scheduled (and to calculate its
per thread min/max/avg latencies), which task preempts another task,
and to get a complete overview of what the RT system (and scheduler)
is doing during time etc.


Kind Regards,

Remy Böhmer
-

From: Arjan van de Ven
Date: Monday, July 16, 2007 - 12:52 pm

systemtap has been able to do such things for me in the past...

-- 
if you want to mail me at work (you don't), use arjan (at)
linux.intel.com
Test the interaction between Linux and your BIOS via
http://www.linuxfirmwarekit.org

-

From: Ankita Garg
Date: Thursday, July 26, 2007 - 12:28 am

Hi,


Was trying to capture similar data as mentioned by Remy using Systemtap.
The tapset/systemtap script that I used is :

probe kernel.function("balance_rt_tasks").inline {
        printf("%s (pid: %d, tid: %d argstr: %s ) \n", execname(),
pid(), tid(), argstr);
}

The probe point did get triggered, and soon after that I had the
following in dmesg, leading to system hang...

BUG: scheduling while atomic: softirq-rcu/3/0x00000004/52, CPU#3

Call Trace:
 <#DB>  [<ffffffff81033555>] __schedule_bug+0x4b/0x4f
 [<ffffffff8128b414>] __sched_text_start+0xcc/0xaaa
 [<ffffffff8100b574>] dump_trace+0x248/0x25d
 [<ffffffff81068334>] print_traces+0x9/0xb
 [<ffffffff8100b5e5>] show_trace+0x5c/0x64
 [<ffffffff8128c1c2>] schedule+0xe4/0x104
 [<ffffffff8128d10c>] rt_spin_lock_slowlock+0xfc/0x19e
 [<ffffffff8128d9de>] __rt_spin_lock+0x1f/0x21
 [<ffffffff8128d9e9>] rt_spin_lock+0x9/0xb
 [<ffffffff8128b511>] __sched_text_start+0x1c9/0xaaa
 [<ffffffff8128f8ee>] kprobe_handler+0x1b3/0x1f5
 [<ffffffff8128f96b>] kprobe_exceptions_notify+0x3b/0x7f
 [<ffffffff81290604>] notifier_call_chain+0x33/0x5b
 [<ffffffff810461d5>] __raw_notifier_call_chain+0x9/0xb
 [<ffffffff810461e6>] raw_notifier_call_chain+0xf/0x11
 [<ffffffff8105098b>] notify_die+0x2e/0x33
 [<ffffffff8128ef6d>] do_int3+0x30/0x8d
 [<ffffffff8128e8a3>] int3+0x93/0xb0
 [<ffffffff8128b512>] __sched_text_start+0x1ca/0xaaa
 <<EOE>>  [<ffffffff8107b585>] __free_pages+0x18/0x21
 [<ffffffff8107b5e3>] free_pages+0x55/0x5a
 [<ffffffff8109945d>] kmem_freepages+0x112/0x11b
 [<ffffffff8128c1c2>] schedule+0xe4/0x104
 [<ffffffff8103edf5>] ksoftirqd+0xbc/0x26f
 [<ffffffff8103ed39>] ksoftirqd+0x0/0x26f
 [<ffffffff8103ed39>] ksoftirqd+0x0/0x26f
 [<ffffffff8104c917>] kthread+0x49/0x76
 [<ffffffff8100af18>] child_rip+0xa/0x12
 [<ffffffff8128be67>] thread_return+0x75/0x1d5
 [<ffffffff8104c8ce>] kthread+0x0/0x76
 [<ffffffff8100af0e>] child_rip+0x0/0x12

Looks like printing the data in the tapset resulted in some lock
issues. The above script is ...
From: Ingo Molnar
Date: Thursday, July 26, 2007 - 12:35 am

I'd suggest to not put a probe into a preempt-off section - put it to 
the beginning and to the end of schedule() to capture context-switches. 
_stp_print_flush() is in the systemtap-generated module, right? Maybe 
the problem is resolved by changing that spinlock to use raw_spinlock_t 
/ DEFINE_RAW_SPIN_LOCK.


	Ingo
-

From: Ankita Garg
Date: Thursday, July 26, 2007 - 12:49 am

Yes, _stp_print_flush is in the systemtap-generated kprobe module.
Placing the probe at the beginning of schedule() also has the same

-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   
-

From: Ingo Molnar
Date: Thursday, July 26, 2007 - 12:53 am

could you send us that module source ST generates? Perhaps there are 
preempt_disable() (or local_irq_disable()) calls in it too.

	Ingo
-

From: Ankita Garg
Date: Thursday, July 26, 2007 - 2:59 am

Attaching the generated module as it is huge...

Looks like SystemTap makes use of relayfs for printing the buffer
contents. The _stp_print_flush() (line 269 in the attached module) implemented 
in the systemtap library function as calling the following:

static int _stp_relay_write (const void *data, unsigned length)
{
        unsigned long flags;
        struct rchan_buf *buf;

        if (unlikely(length == 0))
                return 0;

        local_irq_save(flags);
        buf = _stp_chan->buf[smp_processor_id()];
        if (unlikely(buf->offset + length > _stp_chan->subbuf_size))
                length = relay_switch_subbuf(buf, length);
        memcpy(buf->data + buf->offset, data, length);
        buf->offset += length;
        local_irq_restore(flags);

        if (unlikely(length == 0))
                return -1;

        return length;
}

The above does a local_irq_save().

-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   
From: Ingo Molnar
Date: Thursday, July 26, 2007 - 4:05 am

oh, what a fine piece of s^H^H :-/ Who in their right mind calls this 
from _tracing_ code:

                smp_mb();
                if (waitqueue_active(&buf->read_wait))
                        /*
                         * Calling wake_up_interruptible() from here
                         * will deadlock if we happen to be logging
                         * from the scheduler (trying to re-grab
                         * rq->lock), so defer it.
                         */
                        __mod_timer(&buf->timer, jiffies + 1);

and the comment is utter rubbish: __mod_timer() can lock up just as 
much. Just use an adaptive-polling method to drive the draining of the 
relay buffer, instead of mucking with timers from within the tracing 
code. Whoever implemented this has absolutely zero clue i have to say 
...

the smp_mb() is rubbish too.

could you try the patch below, does it fix the problem?

	Ingo

------------------------------------->
Subject: relay: fix timer madness
From: Ingo Molnar <mingo@elte.hu>

remove timer calls (!!!) from deep within the tracing infrastructure.
This was totally bogus code that can cause lockups and worse.
Poll the buffer every 2 jiffies for now.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/relay.c |   14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

Index: linux-rt-rebase.q/kernel/relay.c
===================================================================
--- linux-rt-rebase.q.orig/kernel/relay.c
+++ linux-rt-rebase.q/kernel/relay.c
@@ -319,6 +319,10 @@ static void wakeup_readers(unsigned long
 {
 	struct rchan_buf *buf = (struct rchan_buf *)data;
 	wake_up_interruptible(&buf->read_wait);
+	/*
+	 * Stupid polling for now:
+	 */
+	mod_timer(&buf->timer, jiffies + 1);
 }
 
 /**
@@ -336,6 +340,7 @@ static void __relay_reset(struct rchan_b
 		init_waitqueue_head(&buf->read_wait);
 		kref_init(&buf->kref);
 		setup_timer(&buf->timer, wakeup_readers, (unsigned ...
From: Mathieu Desnoyers
Date: Thursday, July 26, 2007 - 6:06 am

Hi Ingo,

I did not have this problem with LTTng, since I only touch atomic
variables in tracing code. However, I iterate on a rcu list of active
traces in a timer interrupt to see if subbuffers must be read and, if
yes, I send a wakeup to my user-space daemon. I had to change the
protection around this rcu list read from preempt disable to mutex lock
in this timer because try_to_wakeup is called in it, which takes a
spinlock.

Note: I don't use relay code at my tracing site.

I guess they might have to switch to such an asynchronous delivery
system if they want to do this properly. Simply put, your polling
solution is exactly what I do, but I check a flag set by the writer
instead of waking up the readers unconditionally.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: David J. Wilder
Date: Thursday, July 26, 2007 - 10:45 am

Ingo's solution could call waitqueue_active() inside wakeup_readers() to 

-

From: Mathieu Desnoyers
Date: Thursday, July 26, 2007 - 11:30 am

Yes and no.. What would be part of it, but also checking if there are
buffers ready to be read would reduce the unnecessary polling even more.
Just checking for waiting readers would be rather inefficient in a
scenario of infrequent events where the readers are always waiting for
data; they would be awakened at every timer tick.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Ankita Garg
Date: Thursday, July 26, 2007 - 6:20 am

Hi Ingo,



-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   
-

From: Mathieu Desnoyers
Date: Thursday, July 26, 2007 - 6:31 am

See below,


The problem is also in _stp_print_flush, not *only* in relay code:

void _stp_print_flush (void)
...
                spin_lock(&_stp_print_lock);
                ...
                spin_unlock(&_stp_print_lock);

Those will turn into mutexes with -rt.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Frank Ch. Eigler
Date: Thursday, July 26, 2007 - 7:47 am

Indeed, plus systemtap-generated locking code uses rwlocks,
local_irq_save/restore or preempt_disable, in various places.  Could
someone point to a place that spells out what would be more
appropriate way of ensuring atomicity while being compatible with -rt?

- FChE
-

From: Mathieu Desnoyers
Date: Thursday, July 26, 2007 - 8:02 am

AFAIK, for your needs either:
- Use atomic ops
- Use per-cpu data with preempt disabling/irq disabling
- Use the original "real" spin locks/rwlocks (raw_*).
- Don't play with timers or wakeups, since this kernel code uses the
  "standard" spin locks (sleepable in -rt).

You just don't want to sleep in the tracing code.

Make sure that the sub-buffer switch code respects that too: it is the
most tricky and yet less executed part of the tracing code, so it's easy
for bugs to slip there and yet be undetected for a while.

Since you will likely disable preemption, make sure your tracing code
executes in a deterministic time.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Frank Ch. Eigler
Date: Thursday, July 26, 2007 - 9:22 am

Hi -




It was unclear from the OLS paper whether the spin_lock_irq* family of


We will review that part of the related code.

- FChE
-

From: Ankita Garg
Date: Thursday, July 26, 2007 - 9:32 am

By making the locks of the raw_* type, spin_lock_irq* functions would

-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   
-

From: Mathieu Desnoyers
Date: Thursday, July 26, 2007 - 11:25 am

Yes, you have them to move them too:

linux/spinlock.h:
#ifdef CONFIG_PREEMPT_RT
# define _spin_lock(l)                  rt_spin_lock(l)
# define _spin_lock_nested(l, s)        rt_spin_lock_nested(l, s)
# define _spin_lock_bh(l)               rt_spin_lock(l)
# define _spin_lock_irq(l)              rt_spin_lock(l)
# define _spin_unlock(l)                rt_spin_unlock(l)
# define _spin_unlock_no_resched(l)     rt_spin_unlock(l)
# define _spin_unlock_bh(l)             rt_spin_unlock(l)
# define _spin_unlock_irq(l)            rt_spin_unlock(l)
# define _spin_unlock_irqrestore(l, f)  rt_spin_unlock(l)
..

And below, both spin_lock and spin_lock_irqsave use PICK_OP to turn into
their _spin_lock_* equivalent, which are both mapped to rt_spin_lock in
-rt.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Arnaldo Carvalho de Melo
Date: Thursday, July 26, 2007 - 8:17 am

https://ols2006.108.redhat.com/2007/Reprints/rostedt-Reprint.pdf

And his slides too, haven't checked if they are already only at the OLS 
site.

- Arnaldo
-

From: Remy Bohmer
Date: Monday, July 16, 2007 - 1:17 pm

Hello Arjan,

Thanks for this suggestion.
But I looked at Systemtap before, and as I remember, it is very
flexible, but it only traces function calls. (or am I missing
something?)
I also need this scheduler-tracer, besides for fully fledged X86
systems, also for ARM based boards also with a very minimal userland,
and limited flash. So, it has to be lightweight also. I can only use
cross-compilers to build the tools... (something Systemtap is not good
at as I remember...)

So, I am missing something here, or it is unfortunately not what I am
looking for...


Kind Regards,

Remy


-

From: Ingo Molnar
Date: Monday, July 16, 2007 - 2:12 pm

(please dont top-post, ever. See:

systemtap it should be able to trace variables as well. (as long as 
those variables are not eliminated by gcc)

	Ingo
-

From: Mathieu Desnoyers
Date: Monday, July 16, 2007 - 1:07 pm

Hi Remy,

I think Thomas may have worked on a port of recent LTTng to the -RT
kernel.

If not, I think you needs are quite easy to fulfill with a small amount
of work.

Basically, you could take a recent copy of LTTng, which comes in a set
of logically separated patches:

http://ltt.polymtl.ca/lttng/patch-2.6.22-rc4-mm2-lttng-0.9.10.tar.bz2

Follow the Compatibility List to see which other packages you need
(ltt-control and lttv) :
http://ltt.polymtl.ca/svn/ltt/branches/poly/doc/developer/lttng-lttv-compatibility.html

Then, when applying the LTTng patches to the -RT kernel, a few things
must be taken care of:

1 - "instrumentation" patches : you won't care about most of the
rejects, since you only care about the schedule() calls.

2 - You'll have to see how schedule() is instrumented and port that to
the CFS.

3 - Make sure the Linux Kernel Markers (linux/marker.h) _really_ call
preempt_disable()/preempt_enable() (is it a underscored function in
-rt ?). It is used for correct teardown of probe handlers with
synchronize_sched(). Probe sites execute very quickly, do it won't add
significant latency to your system.

4 - Ping me for problematic rejects.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Mathieu Desnoyers
Date: Monday, July 16, 2007 - 4:03 pm

Hi Remy,

Due to popular demand, I just ported LTTng 0.9.10 to 2.6.22.1-rt4. It
may deserve some more testing, but it runs fine on my x86_32. Please
note that I ported my last stable LTTng version, so some code is not as
shiny as the new one (work in progress), but it works. Please refer to
http://ltt.polymtl.ca, mostly the QUICKSTART GUIDE and the compatibility
list.

On X86, everything should work fine. Just make sure that your
architecture has a stable and coherent TSC across CPUs. See
ltt/ltt-test-tsc.c (it printk a message at boot time) to see if yours is
ok. See my documentation on my website on how to tweak different
Intel/AMD if needed.

For your ARM needs, you may have to tweak include/asm-arm/ltt.h so it
reads a time base register (or the equivalent). Right now, it defaults
to use a logical clock, which is clearly not what you need.

I guess you'll want to disable the "userspace tracing", since I did not
port ltt-usertrace, so it's not very userful then, and also disable the
"locking instrumentation", which generates a _lot_ of output.

I know Thomas and Steven were also interested in this port.

Direct link to the patches:
http://ltt.polymtl.ca/lttng/patch-2.6.22.1-rt4-lttng-0.9.10.tar.gz

Mathieu

P.S. : To give you an idea of the output: (you know the CPU id with
processes_1 or processes_0)

./lttv -m textDump -e "event.name=sched_schedule" -t /tmp/trace2
(System at 250HZ with PREEMPT_RT)

A little legend:

group.event          : time since boot, in s.ns (tracefile name, ending
with cpu ID), Thread ID ("pid"), Thread group ID (posix pid), process
name, thread name (need userspace instrumentation), Parent PID,
Userspace function (need userspace instrumentation), execution mode
(SYSCALL/USERMODE/TRAP/IRQ...) { event specific data }

The event specific (raw) data for sched_schedule is:
PID out
PID in
previous state : State of the process we are scheduling out : -1
                 unrunnable, 0 runnable, >0 ...
From: Sébastien Dugué
Date: Tuesday, July 17, 2007 - 12:23 am

Hi Remy,

On Mon, 16 Jul 2007 21:46:46 +0200 "Remy Bohmer" <l.pinguin@gmail.com> wrot=

  Have a look at Steven Rostedt's logdev, it's very lightweight
and does what you want. I'm currently using it doing the same
thing you want to achieve.

  http://rostedt.homelinux.com/logdev/

  I ported it over to 2.6.21.5-rt20 (patches attached). The way I set it up
is:

  - the patches apply in the following order:

	. relay-update.patch
	. logdev-0.5.4.patch
	. logdev-markers-0.5.4.patch
	. logdev-trace-ctx-switch.patch
	. logdev-trace-sys_clock_getres.patch

  - be sure to enable debugfs in the kernel

  - the userspace tool, you can find at http://rostedt.homelinux.com/logdev/
    I modified the dump_log.h and log_marker.c to suit my needs (attached) =
and
    match the logdev-trace-ctx-switch.patch and logdev-trace-sys_clock_getr=
es.patch

  - start tracing with 'echo 1 > /mnt/debugfs/logdev/mark'
    stop tracing with 'echo 0 > /mnt/debugfs/logdev/mark'
    dump trace buffers with 'logdev/logread > dump.log'

  The clock_getres trace is for me to mark some transitions in my
test program so that I can discard non interesting traces.

  Hope this helps, feel free to ask for more details.

  S=C3=A9bastien.








Previous thread: [PATCH] Fix memory leak in dm_create_persistent() when starting metadata update thread fails. by Jesper Juhl on Monday, July 16, 2007 - 12:41 pm. (1 message)

Next thread: [PATCH] Remove pointless conditional in drivers/usb/serial/io_ti.c::edge_shutdown() by Jesper Juhl on Monday, July 16, 2007 - 1:17 pm. (1 message)