Re: How how latent should non-preemptive scheduling be?

Previous thread: [PATCH 0/5] bio-cgroup: Introduction by Ryo Tsuruta on Friday, September 19, 2008 - 4:01 am. (7 messages)

Next thread: Re: [PATCH 4/4 v2] Add documentation for hard disk shock protection interface by Elias Oltmanns on Friday, September 19, 2008 - 5:08 am. (1 message)
From: Sitsofe Wheeler
Date: Friday, September 19, 2008 - 4:54 am

These two files appear to be identical. Is this intentional?

Anyway after following your instructions a putting together a small 
script to dice the output, I collated the 10 switches which took the 
longest:

# Top ten longest switches
# Rel TS Process   Abs TS
0.122161 hald-3423 1867.821170 ***
0.039438 <idle>-0  1867.379054
0.036318 hald-3423 1867.669009
0.031362 <idle>-0  1868.002762
0.030000 hald-3423 1867.699009
0.028933 <idle>-0  1867.529238
0.028539 <idle>-0  1867.228861
0.028196 <idle>-0  1867.128731
0.027763 <idle>-0  1868.101449
0.027513 <idle>-0  1867.028606

# tracer: sched_switch from around longest switch
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
           <idle>-0     [00]  1867.608017:      0:140:R   +     3:115:S
           <idle>-0     [00]  1867.608038:      0:140:R   +  3423:120:D
           <idle>-0     [00]  1867.608045:      0:140:R ==>     3:115:R
      ksoftirqd/0-3     [00]  1867.608048:      3:115:S ==>  3423:120:R
             hald-3423  [00]  1867.629350:   3423:120:R   +  6096:120:S
             hald-3423  [00]  1867.632691:   3423:120:R   +  3827:120:S
             hald-3423  [00]  1867.669009:   3423:120:R   +  3998:120:S
             hald-3423  [00]  1867.699009:   3423:120:R   +  6097:120:S
          ***hald-3423  [00]  1867.821170:   3423:120:R ==>  6096:120:R
        rhythmbox-6096  [00]  1867.821219:   6096:120:S ==>  6097:120:R
        rhythmbox-6097  [00]  1867.821262:   6097:120:R   +  3826:120:S
        rhythmbox-6097  [00]  1867.821289:   6097:120:S ==>  3826:120:R
       pulseaudio-3826  [00]  1867.821332:   3826:120:R   +  6097:120:S
       pulseaudio-3826  [00]  1867.821374:   3826:120:S ==>  6097:120:R
        rhythmbox-6097  [00]  1867.821380:   6097:120:S ==>  3998:120:R
        rhythmbox-3998  [00]  1867.821709:   3998:120:S ==>  3827:120:R

I actually have both of these enabled but there's still no wakeup tracer 
(as mentioned in the git kernel documentation ...
From: Peter Zijlstra
Date: Friday, September 19, 2008 - 7:20 am

Its actually function tracer output I'm interested in.. that shows what
all its doing to make it take 120+ms.

I thought we had a wakeup latency tracer exacty like we have preempt and

Yeah, SMM/SMI is nasty stuff :-(

--

From: Ingo Molnar
Date: Monday, September 22, 2008 - 4:57 am

well, since they went away after you enabled CONFIG_PREEMPT=y, they are 
definitely in-kernel latencies, not any external SMM latencies.

I.e. they are inherently fixable. Could you enable:

  CONFIG_DYNAMIC_FTRACE=y
  CONFIG_FTRACE_MCOUNT_RECORD=y

that should make the traces a lot more verbose - every kernel function 
executed in the latency path will be logged. That way we'll be able to 
say which one takes that long.

note, you might have to increase /debug/tracing/trace_entries to get a 
long enough trace to capture the relevant portion of the latency.

	Ingo
--

From: Steven Rostedt
Date: Monday, September 22, 2008 - 5:07 am

Also note, to modify trace_entries, you must be in the none (nop?) tracer, 
otherwise the size will not be effected.

If you find the trace is also too big, you can echo a list of functions 
into:

  /debug/tracing/set_ftrace_notrace

to not trace those functions. using '>' will remove any existing function 
in that file, but using '>>' will append functions to the file.

For a list of functions that you can add, see:

  /debug/tracing/available_filter_functions

-- Steve

--

From: Sitsofe Wheeler
Date: Monday, September 22, 2008 - 11:33 pm

I do not appear to have the CONFIG_FTRACE_MCOUNT_RECORD option in 
2.6.27rc7. Is it an option that is only in -tip ?

-- 
Sitsofe | http://sucs.org/~sits/
--

From: Ingo Molnar
Date: Tuesday, September 23, 2008 - 4:53 am

yeah - it's a new ftrace feature queued up for v2.6.28.

	Ingo
--

From: Sitsofe Wheeler
Date: Tuesday, September 23, 2008 - 9:30 am

I've been struggling to boot -tip/master - currently it blows up just 
after printing SLUB information saying:

BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<c0120078>] account_system_time+0x48/0x120
*pde = 00000000
Thread overran stack, or stack corrupted
Oops: 0002 [#1] PREEMPT

-- 
Sitsofe | http://sucs.org/~sits/
--

From: Sitsofe Wheeler
Date: Tuesday, September 23, 2008 - 12:39 pm

OK this BUG seems to have gone away in the past few hours.

Ingo when you were asking for the ftrace report I presume that would be 
for a non preempt kernel (as a preemptive one only showed a very worst 
latency of 19657 us in one exceptional case)?

-- 
Sitsofe | http://sucs.org/~sits/

--

From: Sitsofe Wheeler
Date: Tuesday, September 23, 2008 - 3:01 pm

With a non preempt kernel I found the logs would simply become too big 
without filtering. On Peter's suggestion I used the following to remove 
the most frequently called functions:

echo acpi_os_release_object > set_ftrace_notrace && echo kmem_cache_* >> 
set_ftrace_notrace && echo acpi_ut_* >> set_ftrace_notrace

By doing counts across multiple runs I would say that the most 
frequently called functions are the following (in most frequently called 
order). The counts are definitely approximate but are reasonable 
relative to each other.

  475325 acpi_os_release_object (acpi_ut_delete_generic_state)
  406895 kmem_cache_free (acpi_os_release_object)
  402838 kmem_cache_alloc (acpi_ut_create_generic_state)
  132968 acpi_ut_update_ref_count (acpi_ut_update_object_reference)
  131041 acpi_ut_pop_generic_state (acpi_ut_update_object_reference)
  131036 acpi_ut_delete_generic_state (acpi_ut_update_object_reference)
  131025 acpi_ut_create_generic_state (acpi_ut_create_update_state)
  131023 acpi_ut_create_update_state_and_push 
(acpi_ut_update_object_reference)
  131020 acpi_ut_create_update_state (acpi_ut_create_update_state_and_push)
  131018 acpi_ut_push_generic_state (acpi_ut_create_update_state_and_push)
   60147 acpi_ns_get_next_node (acpi_ns_delete_namespace_by_owner)
   28974 acpi_ns_get_next_valid_node (acpi_ns_get_next_node)

Logs with the filtering on can be seen here (15Mbytes decompressed each):
http://sucs.org/~sits/test/eeepc-debug/20080923/latency_trace.gz
http://sucs.org/~sits/test/eeepc-debug/20080923/trace.txt.gz

It looks like lots of acpi state is created and deleted...

-- 
Sitsofe | http://sucs.org/~sits/
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 1:48 pm

yeah. The latency starts here:

     cat-5901    0dNh.    1us : default_wake_function (__wake_up_common)
     cat-5901    0.Nh.    2us : kill_fasync (snd_pcm_period_elapsed)
     [...]

and ends here:

     [...]
     cat-5901    0.N.. 270501us+: mutex_lock (acpi_ec_transaction)
     cat-5901    0d... 270507us : __cond_resched (_cond_resched)

270 _milliseconds_ later. That's excessive.

The main overhead is starting here:

     cat-5901    0.N..  167us : acpi_ds_result_push (acpi_ds_exec_end_op)

lots of ACPI code executed ...

does it get better if you have CONFIG_PREEMPT_VOLUNTARY=y enabled? That 
_should_ break up this section neatly. If it doesnt then please add a 
might_sleep() check to kernel/kernel/semaphore.c's down_timeout() 
function - that is called a number of times in this trace.

	Ingo
--

From: Sitsofe Wheeler
Date: Sunday, September 28, 2008 - 1:56 pm

I added might_sleep() to the start of down_timeout() but it neither 
printed anything to dmesg nor changed the latency issue...

Strangely stalling only seems to turn up in linux-tip kernels with very 
little debugging options set within them. I have a linux-tip that has 
lots of extra debugging options set and this problem doesn't show up...

I've also asked about this on the ACPI mailing list and had a huge 
amount of help from finding a good point for a cond_resched - 
http://marc.info/?l=linux-acpi&m=122236450105747&w=2 . Certain things 
(like games) still stutter when the battery is read but it's enough to 
play back sound in rhythmbox without stuttering.

Any ideas why the issue would go away with a debug kernel though?

-- 
Sitsofe | http://sucs.org/~sits/
--

From: Ingo Molnar
Date: Monday, September 29, 2008 - 1:37 am

hm, that's weird indeed. You could try to trace the full battery readout 
itself, via a script like this:

  cat /debug/tracing/trace > /dev/null        # drain trace
  cat /proc/acpi/whatever  
  cat /debug/tracing/trace > trace.txt

this way you should be seeing a full trace of the whole thing. (with the 
'ftrace' tracer plugin set in current_tracer)

then, assuming the trace.txt is complete, you can check where it 
reschedules and why. Search for 'schedule' calls. Compare the debug 
versus non-debug traces.

You can even turn on stack backtrace tracing feature, via:

  echo stacktrace > /debug/tracing/iter_ctrl

this adds extra trace entries that show the call path of every 
reschedule. [this attribute is default-disabled.]

	Ingo
--

From: Sitsofe Wheeler
Date: Monday, September 29, 2008 - 4:11 pm

(Something bad seems to be happening with my kernels as I have been 
finding strange problems like network-manager detecting the wifi as a 
wired interface unless I did a make clean on my kernel sources before 
compiling). The stalling issue seems to keep coming and going and is 
currently showing up in both debug and non debug kernels.

There are traces when looking at battery information on both AC (where 
the problem is never there) and battery (where the problem always is 
there) here:
http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-ac.txt.bz2
http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-battery.txt.bz2 
(15Mbytes uncompressed)

I wasn't quite sure how to tell when a reschedule was done. I simply did 
a grep schedule on the file and most of it seemed reasonable. One part 
that caught my eye was the following:

     speaker-test-3891  [000]  1392.751699: __switch_to <-schedule
           <idle>-0     [000]  1392.752070: account_scheduler_latency 
<-enqueue_task_fair
           <idle>-0     [000]  1392.752091: __switch_to <-schedule
              cat-7717  [000]  1392.752092: del_timer <-schedule_timeout
              cat-7717  [000]  1392.772263: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773224: __tasklet_schedule <-kbd_event
              cat-7717  [000]  1392.773225: schedule_console_callback 
<-kbd_event
              cat-7717  [000]  1392.773225: schedule_work 
<-schedule_console_callback
              cat-7717  [000]  1392.773226: queue_work <-schedule_work
              cat-7717  [000]  1392.773232: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773240: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773251: schedule_delayed_work 
<-put_queue
              cat-7717  [000]  1392.773251: queue_delayed_work 
<-schedule_delayed_work
              cat-7717  [000]  1392.773258: schedule_console_callback 
<-kbd_event
              ...
From: Ingo Molnar
Date: Tuesday, September 30, 2008 - 4:22 am

that's about 60 msecs. ath5k_intr IRQ handling overhead? IRQ handlers 
are non-preemptible. (even under CONFIG_PREEMPT=y)

	Ingo
--

From: Sitsofe Wheeler
Date: Tuesday, September 30, 2008 - 6:18 am

What about the piece leading up to that:

              cat-7717  [000]  1392.752092: del_timer <-schedule_timeout
              cat-7717  [000]  1392.772263: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773224: __tasklet_schedule <-kbd_event
              cat-7717  [000]  1392.773225: schedule_console_callback 
<-kbd_event
              cat-7717  [000]  1392.773225: schedule_work 
<-schedule_console_callback
              cat-7717  [000]  1392.773226: queue_work <-schedule_work
              cat-7717  [000]  1392.773232: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773240: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.773251: schedule_delayed_work 
<-put_queue
              cat-7717  [000]  1392.773251: queue_delayed_work 
<-schedule_delayed_work
              cat-7717  [000]  1392.773258: schedule_console_callback 
<-kbd_event
              cat-7717  [000]  1392.773258: schedule_work 
<-schedule_console_callback
              cat-7717  [000]  1392.773259: queue_work <-schedule_work
              cat-7717  [000]  1392.773267: schedule_console_callback 
<-kbd_event
              cat-7717  [000]  1392.773268: schedule_work 
<-schedule_console_callback
              cat-7717  [000]  1392.773268: queue_work <-schedule_work
              cat-7717  [000]  1392.807931: account_scheduler_latency 
<-enqueue_task_fair
              cat-7717  [000]  1392.839464: __tasklet_schedule <-ath5k_intr

Here's a snippet from 
http://sucs.org/~sits/test/eeepc-debug/20080929/trace-on-battery.txt.bz2 :
              cat-7717  [000]  1392.752092: del_timer <-schedule_timeout
              cat-7717  [000]  1392.752092: acpi_ec_check_status 
<-acpi_ec_wait
              cat-7717  [000]  1392.752093: constant_test_bit 
<-acpi_ec_check_status
              cat-7717  [000]  1392.752095: mutex_unlock 
<-acpi_ec_transaction
              cat-7717  [000]  1392.752096: ...

After weeks of head scratching as to why these latencies didn't occur
using the xandros 2.6.21.4 kernel (but keeping the same userspace) when
my own kernels would always show this problem I finally found the answer
after reading
http://tservice.net.ru/~s0mbre/blog/devel/other/2008_10_01 on kernel
planet - SLUB can cause regressions compared to SLAB. Switching from 
SLUB to SLAB made the problem more or less disappear (which I guess 
makes sense given the large number of kmem_* calls that are made when 
reading the battery).

My understanding with the memory allocators is that SLOB is the smallest 
and simplest. Its forté is that it uses very little memory which makes 
it ideal for embedded systems and is the easiest allocator to 
understand. The downside is that it might tend towards memory 
fragmentation the longer a system runs and is apparently a little bit 
slower than SLAB. The SLAB allocator is something that the linux kernel 
has had for many years and was a good performer until the pressures to 
perform with large SMP systems started to come in to play (at which 
point lots of memory would be used up on fixed structures - 
http://lwn.net/Articles/229984/ ). SLUB is the newest allocator, scales 
up well and has finer grained diagnostics that SLAB.

Prior to today my understanding was that SLUB would be able to replace 
SLAB in all scenarios and perform just as well or better. However now 
I'm not so sure (SLAB appears to be less latent than SLUB). Other than 
SLUB's newness are there cases where SLAB should be chosen instead of 
SLUB (e.g. uniprocessor desktop systems with hundreds of megabytes of 
RAM)? Will SLAB exist as an alternative to SLUB for certain setups?

-- 
Sitsofe | http://sucs.org/~sits
--

Previous thread: [PATCH 0/5] bio-cgroup: Introduction by Ryo Tsuruta on Friday, September 19, 2008 - 4:01 am. (7 messages)

Next thread: Re: [PATCH 4/4 v2] Add documentation for hard disk shock protection interface by Elias Oltmanns on Friday, September 19, 2008 - 5:08 am. (1 message)