sched_mc_power_savings broken with CGROUPS+CPUSETS

Previous thread: [PATCH 2.6.27-rc5] pci: fix resource_size_t printk bugs by Mikael Pettersson on Friday, August 29, 2008 - 5:35 am. (1 message)

Next thread: [patch 0/3] [RFC] zlib crypto module by Geert Uytterhoeven on Friday, August 29, 2008 - 6:41 am. (2 messages)
From: Vaidyanathan Srinivasan
Date: Friday, August 29, 2008 - 6:15 am

Hi,

sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.

I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
SD_LV_CPU whenever sched_mc_power_savings is set.

This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
is just compiled in while this is never hit when they are compiled
out.  The fact that SD_POWERSAVINGS_BALANCE being cleared even when
sched_mc_power_savings = 1 completely breaks the
sched_mc_power_savings heuristics.

To recreate the problem, 
Have sched_mc power savings enabled CONFIG_SCHED_MC=y
Add this BUG_ON()

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings

Try these these on a multi core x86 box.

sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
I do not have a confirmation that the root cause is same in all
successive versions. sched_mc_power_savings works perfect in 
2.6.25.

Please help me root cause the issue.  Please point me to changes that
may potential cause this bug.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

Index: linux-2.6.27-rc5/kernel/sched.c
===================================================================
--- linux-2.6.27-rc5.orig/kernel/sched.c	2008-08-29 17:48:56.000000000 +0000
+++ linux-2.6.27-rc5/kernel/sched.c	2008-08-29 18:16:09.000000000 +0000
@@ -3375,6 +3375,8 @@
 
 out_balanced:
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+	BUG_ON(sched_mc_power_savings && sd->level == SD_LV_CPU && 
+		!(sd->flags & SD_POWERSAVINGS_BALANCE)); 
 	if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
 		goto ret;
 
--

From: Peter Zijlstra
Date: Friday, August 29, 2008 - 6:23 am

I'm still greatly mistified by all that power savings code.

Its hard to read and utterly hard to comprehend - I've been about to rip
the whole stuff out on several occasions. But so far tried to carefully
thread around it maintaining its operation even though not fully
understood.

Someone with clue - preferably the authors of the code in question -
should enlighten us with a patch that adds some comments as to the
intent of said lines of code.

--

From: Vaidyanathan Srinivasan
Date: Friday, August 29, 2008 - 7:05 am

I will try to add my notes around it.  I have been playing around with
sched_mc to make it work for different workloads.

--Vaidy

--

From: Max Krasnyansky
Date: Friday, August 29, 2008 - 1:29 pm

I do not fully understand how balancing is affected by the MC stuff but I can
explain how the mc power saving settings are applied to the domains and the
overall mechanism for that.

As I explained in the previous reply I missed the fact the logic that avoids
redundant rebuilds in partition_sched_domains() will prevent
arch_reinit_sched_domains() from doing the actual rebuild and hence will not
apply the SD_POWERSAVINGS_BALANCE until something changes in cpuset setup.

btw I can certainly attest to the fact that powersaving code is very hard to
read and comprehend :)

Max
--

From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 4:26 am

I don't think iterating the domains and setting the flag is sufficient.
Look at this crap (found in arch/x86/kernel/smpboot.c):

cpumask_t cpu_coregroup_map(int cpu)
{
        struct cpuinfo_x86 *c = &cpu_data(cpu);
        /*
         * For perf, we return last level cache shared map.
         * And for power savings, we return cpu_core_map
         */
        if (sched_mc_power_savings || sched_smt_power_savings)
                return per_cpu(cpu_core_map, cpu);
        else
                return c->llc_shared_map;
}

which means we'll actually end up building different domain/group

Yeah - I was primarity hinting at the sched_group and find_*_group()
fudge, esp find_busiest_group() is an utter nightmare.

I'm still struggeling to understand _why_ we need those group things to
begin with, why aren't the child domains good enough?



--

From: Vaidyanathan Srinivasan
Date: Saturday, August 30, 2008 - 1:42 pm

* Peter Zijlstra <peterz@infradead.org> [2008-08-30 13:26:53]:


The above code helps a quad-core CPU to be treated as two dual core
for performance when sched_mc_power_savings=0 and they will be treated
as one quad core package if sched_mc_power_savings=1 since the power
control (voltage control) is per quad core socket.

On a dual socket machine with two quad core cpus, 

sched_mc_power_savings=0 will build:

CPU0 attaching sched-domain:
 domain 0: span 0,2 level MC
  groups: 0 2
  domain 1: span 0-7 level CPU
   groups: 0,2 1,5 3-4 6-7

while sched_mc_power_savings=1 will build:

CPU0 attaching sched-domain:
 domain 0: span 0,2-4 level MC
  groups: 0 2 3 4
  domain 1: span 0-7 level CPU
   groups: 0,2-4 1,5-7

Last level cache (llc_shared_map) is used to build this map
differently based on power savings settings.

Do you think such detailed documentation around this code will help?

--Vaidy

[snipped]

--

From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 2:43 pm

Same for my dual-core Opteron 12xx, due to this code it normally

I realized what its good for, I'm just not sure I agree with it. I'm
feeling there is something wrong with this, just can't quite put my
finger on it.

I'm just feeling the domain structure should be invariant to such things
- its the same hardware after all, whether we schedule to optimize for
power or performance.

--

From: Max Krasnyansky
Date: Friday, August 29, 2008 - 1:17 pm

That's my fault. I redid domain rebuild/hotplug handling awhile ago and missed
the fact that partition_sched_domains() is trying to avoid unnecessary domain
rebuilds. Primary issue at the time was circular locking issues and the the
testing that I was doing then was exercising a bunch of different scenarios
(ie cpu up/down, cpuset create/destroy, mcpowersave on/off) and I did not
notice that mcpowersave did not actually trigger domain rebuilds.

Anyway, I sent you guys a patch that should fix this issue.
Please confirm.

Max
--

From: Vaidyanathan Srinivasan
Date: Saturday, August 30, 2008 - 1:02 pm

Hi Max,

Thank you very much for identifying the source of the problem.  I will
test your patch and report.

My BUG_ON() may still fail during the window of change.  I will watch
out for that.

--Vaidy
--

Previous thread: [PATCH 2.6.27-rc5] pci: fix resource_size_t printk bugs by Mikael Pettersson on Friday, August 29, 2008 - 5:35 am. (1 message)

Next thread: [patch 0/3] [RFC] zlib crypto module by Geert Uytterhoeven on Friday, August 29, 2008 - 6:41 am. (2 messages)