These two files appear to be identical. Is this intentional?
Anyway after following your instructions a putting together a small
script to dice the output, I collated the 10 switches which took the
longest:
# Top ten longest switches
# Rel TS Process Abs TS
0.122161 hald-3423 1867.821170 ***
0.039438 <idle>-0 1867.379054
0.036318 hald-3423 1867.669009
0.031362 <idle>-0 1868.002762
0.030000 hald-3423 1867.699009
0.028933 <idle>-0 1867.529238
0.028539 <idle>-0 1867.228861
0.028196 <idle>-0 1867.128731
0.027763 <idle>-0 1868.101449
0.027513 <idle>-0 1867.028606
# tracer: sched_switch from around longest switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [00] 1867.608017: 0:140:R + 3:115:S
<idle>-0 [00] 1867.608038: 0:140:R + 3423:120:D
<idle>-0 [00] 1867.608045: 0:140:R ==> 3:115:R
ksoftirqd/0-3 [00] 1867.608048: 3:115:S ==> 3423:120:R
hald-3423 [00] 1867.629350: 3423:120:R + 6096:120:S
hald-3423 [00] 1867.632691: 3423:120:R + 3827:120:S
hald-3423 [00] 1867.669009: 3423:120:R + 3998:120:S
hald-3423 [00] 1867.699009: 3423:120:R + 6097:120:S
***hald-3423 [00] 1867.821170: 3423:120:R ==> 6096:120:R
rhythmbox-6096 [00] 1867.821219: 6096:120:S ==> 6097:120:R
rhythmbox-6097 [00] 1867.821262: 6097:120:R + 3826:120:S
rhythmbox-6097 [00] 1867.821289: 6097:120:S ==> 3826:120:R
pulseaudio-3826 [00] 1867.821332: 3826:120:R + 6097:120:S
pulseaudio-3826 [00] 1867.821374: 3826:120:S ==> 6097:120:R
rhythmbox-6097 [00] 1867.821380: 6097:120:S ==> 3998:120:R
rhythmbox-3998 [00] 1867.821709: 3998:120:S ==> 3827:120:R
I actually have both of these enabled but there's still no wakeup tracer
(as mentioned in the git kernel documentation ...Its actually function tracer output I'm interested in.. that shows what all its doing to make it take 120+ms. I thought we had a wakeup latency tracer exacty like we have preempt and Yeah, SMM/SMI is nasty stuff :-( --
well, since they went away after you enabled CONFIG_PREEMPT=y, they are definitely in-kernel latencies, not any external SMM latencies. I.e. they are inherently fixable. Could you enable: CONFIG_DYNAMIC_FTRACE=y CONFIG_FTRACE_MCOUNT_RECORD=y that should make the traces a lot more verbose - every kernel function executed in the latency path will be logged. That way we'll be able to say which one takes that long. note, you might have to increase /debug/tracing/trace_entries to get a long enough trace to capture the relevant portion of the latency. Ingo --
Also note, to modify trace_entries, you must be in the none (nop?) tracer, otherwise the size will not be effected. If you find the trace is also too big, you can echo a list of functions into: /debug/tracing/set_ftrace_notrace to not trace those functions. using '>' will remove any existing function in that file, but using '>>' will append functions to the file. For a list of functions that you can add, see: /debug/tracing/available_filter_functions -- Steve --
I do not appear to have the CONFIG_FTRACE_MCOUNT_RECORD option in 2.6.27rc7. Is it an option that is only in -tip ? -- Sitsofe | http://sucs.org/~sits/ --
yeah - it's a new ftrace feature queued up for v2.6.28. Ingo --
I've been struggling to boot -tip/master - currently it blows up just after printing SLUB information saying: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<c0120078>] account_system_time+0x48/0x120 *pde = 00000000 Thread overran stack, or stack corrupted Oops: 0002 [#1] PREEMPT -- Sitsofe | http://sucs.org/~sits/ --
OK this BUG seems to have gone away in the past few hours. Ingo when you were asking for the ftrace report I presume that would be for a non preempt kernel (as a preemptive one only showed a very worst latency of 19657 us in one exceptional case)? -- Sitsofe | http://sucs.org/~sits/ --
With a non preempt kernel I found the logs would simply become too big without filtering. On Peter's suggestion I used the following to remove the most frequently called functions: echo acpi_os_release_object > set_ftrace_notrace && echo kmem_cache_* >> set_ftrace_notrace && echo acpi_ut_* >> set_ftrace_notrace By doing counts across multiple runs I would say that the most frequently called functions are the following (in most frequently called order). The counts are definitely approximate but are reasonable relative to each other. 475325 acpi_os_release_object (acpi_ut_delete_generic_state) 406895 kmem_cache_free (acpi_os_release_object) 402838 kmem_cache_alloc (acpi_ut_create_generic_state) 132968 acpi_ut_update_ref_count (acpi_ut_update_object_reference) 131041 acpi_ut_pop_generic_state (acpi_ut_update_object_reference) 131036 acpi_ut_delete_generic_state (acpi_ut_update_object_reference) 131025 acpi_ut_create_generic_state (acpi_ut_create_update_state) 131023 acpi_ut_create_update_state_and_push (acpi_ut_update_object_reference) 131020 acpi_ut_create_update_state (acpi_ut_create_update_state_and_push) 131018 acpi_ut_push_generic_state (acpi_ut_create_update_state_and_push) 60147 acpi_ns_get_next_node (acpi_ns_delete_namespace_by_owner) 28974 acpi_ns_get_next_valid_node (acpi_ns_get_next_node) Logs with the filtering on can be seen here (15Mbytes decompressed each): http://sucs.org/~sits/test/eeepc-debug/20080923/latency_trace.gz http://sucs.org/~sits/test/eeepc-debug/20080923/trace.txt.gz It looks like lots of acpi state is created and deleted... -- Sitsofe | http://sucs.org/~sits/ --
yeah. The latency starts here:
cat-5901 0dNh. 1us : default_wake_function (__wake_up_common)
cat-5901 0.Nh. 2us : kill_fasync (snd_pcm_period_elapsed)
[...]
and ends here:
[...]
cat-5901 0.N.. 270501us+: mutex_lock (acpi_ec_transaction)
cat-5901 0d... 270507us : __cond_resched (_cond_resched)
270 _milliseconds_ later. That's excessive.
The main overhead is starting here:
cat-5901 0.N.. 167us : acpi_ds_result_push (acpi_ds_exec_end_op)
lots of ACPI code executed ...
does it get better if you have CONFIG_PREEMPT_VOLUNTARY=y enabled? That
_should_ break up this section neatly. If it doesnt then please add a
might_sleep() check to kernel/kernel/semaphore.c's down_timeout()
function - that is called a number of times in this trace.
Ingo
--
I added might_sleep() to the start of down_timeout() but it neither printed anything to dmesg nor changed the latency issue... Strangely stalling only seems to turn up in linux-tip kernels with very little debugging options set within them. I have a linux-tip that has lots of extra debugging options set and this problem doesn't show up... I've also asked about this on the ACPI mailing list and had a huge amount of help from finding a good point for a cond_resched - http://marc.info/?l=linux-acpi&m=122236450105747&w=2 . Certain things (like games) still stutter when the battery is read but it's enough to play back sound in rhythmbox without stuttering. Any ideas why the issue would go away with a debug kernel though? -- Sitsofe | http://sucs.org/~sits/ --
hm, that's weird indeed. You could try to trace the full battery readout itself, via a script like this: cat /debug/tracing/trace > /dev/null # drain trace cat /proc/acpi/whatever cat /debug/tracing/trace > trace.txt this way you should be seeing a full trace of the whole thing. (with the 'ftrace' tracer plugin set in current_tracer) then, assuming the trace.txt is complete, you can check where it reschedules and why. Search for 'schedule' calls. Compare the debug versus non-debug traces. You can even turn on stack backtrace tracing feature, via: echo stacktrace > /debug/tracing/iter_ctrl this adds extra trace entries that show the call path of every reschedule. [this attribute is default-disabled.] Ingo --
(Something bad seems to be happening with my kernels as I have been finding strange problems like network-manager detecting the wifi as a wired interface unless I did a make clean on my kernel sources before compiling). The stalling issue seems to keep coming and going and is currently showing up in both debug and non debug kernels. There are traces when looking at battery information on both AC (where the problem is never there) and battery (where the problem always is there) here: http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-ac.txt.bz2 http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-battery.txt.bz2 (15Mbytes uncompressed) I wasn't quite sure how to tell when a reschedule was done. I simply did a grep schedule on the file and most of it seemed reasonable. One part that caught my eye was the following: speaker-test-3891 [000] 1392.751699: __switch_to <-schedule <idle>-0 [000] 1392.752070: account_scheduler_latency <-enqueue_task_fair <idle>-0 [000] 1392.752091: __switch_to <-schedule cat-7717 [000] 1392.752092: del_timer <-schedule_timeout cat-7717 [000] 1392.772263: account_scheduler_latency <-enqueue_task_fair cat-7717 [000] 1392.773224: __tasklet_schedule <-kbd_event cat-7717 [000] 1392.773225: schedule_console_callback <-kbd_event cat-7717 [000] 1392.773225: schedule_work <-schedule_console_callback cat-7717 [000] 1392.773226: queue_work <-schedule_work cat-7717 [000] 1392.773232: account_scheduler_latency <-enqueue_task_fair cat-7717 [000] 1392.773240: account_scheduler_latency <-enqueue_task_fair cat-7717 [000] 1392.773251: schedule_delayed_work <-put_queue cat-7717 [000] 1392.773251: queue_delayed_work <-schedule_delayed_work cat-7717 [000] 1392.773258: schedule_console_callback <-kbd_event ...
that's about 60 msecs. ath5k_intr IRQ handling overhead? IRQ handlers are non-preemptible. (even under CONFIG_PREEMPT=y) Ingo --
What about the piece leading up to that:
cat-7717 [000] 1392.752092: del_timer <-schedule_timeout
cat-7717 [000] 1392.772263: account_scheduler_latency
<-enqueue_task_fair
cat-7717 [000] 1392.773224: __tasklet_schedule <-kbd_event
cat-7717 [000] 1392.773225: schedule_console_callback
<-kbd_event
cat-7717 [000] 1392.773225: schedule_work
<-schedule_console_callback
cat-7717 [000] 1392.773226: queue_work <-schedule_work
cat-7717 [000] 1392.773232: account_scheduler_latency
<-enqueue_task_fair
cat-7717 [000] 1392.773240: account_scheduler_latency
<-enqueue_task_fair
cat-7717 [000] 1392.773251: schedule_delayed_work
<-put_queue
cat-7717 [000] 1392.773251: queue_delayed_work
<-schedule_delayed_work
cat-7717 [000] 1392.773258: schedule_console_callback
<-kbd_event
cat-7717 [000] 1392.773258: schedule_work
<-schedule_console_callback
cat-7717 [000] 1392.773259: queue_work <-schedule_work
cat-7717 [000] 1392.773267: schedule_console_callback
<-kbd_event
cat-7717 [000] 1392.773268: schedule_work
<-schedule_console_callback
cat-7717 [000] 1392.773268: queue_work <-schedule_work
cat-7717 [000] 1392.807931: account_scheduler_latency
<-enqueue_task_fair
cat-7717 [000] 1392.839464: __tasklet_schedule <-ath5k_intr
Here's a snippet from
http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-battery.txt.bz2 :
cat-7717 [000] 1392.752092: del_timer <-schedule_timeout
cat-7717 [000] 1392.752092: acpi_ec_check_status
<-acpi_ec_wait
cat-7717 [000] 1392.752093: constant_test_bit
<-acpi_ec_check_status
cat-7717 [000] 1392.752095: mutex_unlock
<-acpi_ec_transaction
cat-7717 [000] 1392.752096: ...After weeks of head scratching as to why these latencies didn't occur using the xandros 2.6.21.4 kernel (but keeping the same userspace) when my own kernels would always show this problem I finally found the answer after reading http://tservice.net.ru/~s0mbre/blog/devel/other/2008_10_01 on kernel planet - SLUB can cause regressions compared to SLAB. Switching from SLUB to SLAB made the problem more or less disappear (which I guess makes sense given the large number of kmem_* calls that are made when reading the battery). My understanding with the memory allocators is that SLOB is the smallest and simplest. Its forté is that it uses very little memory which makes it ideal for embedded systems and is the easiest allocator to understand. The downside is that it might tend towards memory fragmentation the longer a system runs and is apparently a little bit slower than SLAB. The SLAB allocator is something that the linux kernel has had for many years and was a good performer until the pressures to perform with large SMP systems started to come in to play (at which point lots of memory would be used up on fixed structures - http://lwn.net/Articles/229984/ ). SLUB is the newest allocator, scales up well and has finer grained diagnostics that SLAB. Prior to today my understanding was that SLUB would be able to replace SLAB in all scenarios and perform just as well or better. However now I'm not so sure (SLAB appears to be less latent than SLUB). Other than SLUB's newness are there cases where SLAB should be chosen instead of SLUB (e.g. uniprocessor desktop systems with hundreds of megabytes of RAM)? Will SLAB exist as an alternative to SLUB for certain setups? -- Sitsofe | http://sucs.org/~sits --
