CFS Cleanups

Submitted by Jeremy
on August 9, 2007 - 4:23pm

Ingo Molnar pushed a series of patches to his Completely Fair Scheduler code upstream that were merged into the mainline kernel. He explained the reason for so many small patches, "the main reason is the safe and gradual elimination of a widely used 64-bit function argument: the 64-bit 'now' timestamp." In addition to the many small patches removing the 64-bit "now" timestamp, he noted:

"These changes are necessary for 3 reasons: firstly they address the 'there's too much 64-bit stuff in the scheduler' observation. Secondly, it's not directly visible but these changes also act as a correctness fix for an obscure (and minor) but not-too-pretty-to-fix accounting bug: idle_balance() had its own internal notion of 'now', separate from that of schedule(). Thirdly, this debloats sched.o quite significantly."


From: Ingo Molnar [email blocked]
To:	Linus Torvalds [email blocked]
Cc:	Andrew Morton [email blocked], linux-kernel@vger.kernel.org
Subject: [git pull request] scheduler updates
Date:	Wed, 8 Aug 2007 22:30:08 +0200


Linus, please pull the latest scheduler git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

the high commit count is scary, but it's all low-risk items: the main 
reason is the safe and gradual elimination of a widely used 64-bit 
function argument: the 64-bit "now" timestamp. About 40 of those commits 
are identity transformations that prepare the real change in a safe way, 
and the rest is obvious and safe as well. Besides the obvious and nice 
cleanup factor, these changes are necessary for 3 reasons: firstly they 
address the "there's too much 64-bit stuff in the scheduler" 
observation. Secondly, it's not directly visible but these changes also 
act as a correctness fix for an obscure (and minor) but 
not-too-pretty-to-fix accounting bug: idle_balance() had its own 
internal notion of 'now', separate from that of schedule(). Thirdly, 
this debloats sched.o quite significantly:

on 32-bit (smp, nondebug), it's almost 1k less code:

   text    data     bss     dec     hex filename
  34869    3066      20   37955    9443 sched.o.before
  33972    3066      24   37062    90c6 sched.o.after

but even on 64-bit platforms it's noticeable:

   text    data     bss     dec     hex filename
  28652    4162      24   32838    8046 sched.o.before
  28064    4162      24   32250    7dfa sched.o.after

and that's a speedup as well, because these parameters were passed all 
around the fastpath.

It was the safest to do it this way (considering that we are post -rc2 
already), together in one commit these changes would have been much less 
obvious to validate and apply. (It's of course all fully bisectable and 
every step builds and boots fine.)

besides this elimination of the 64-bit timestamp parameter passing 
between (almost all) scheduler functions, there are 8 other fixes that 
are not identity transformations:

 - Peter Williams reviewed the smpnice load-balancer and noticed a few 
   leftover items that are unnecessary now (i have re-tested 
   load-balancing behavior and it's all still fine)

 - binary sysctl cleanup from Alexey Dobriyan

 - two small accounting fixes

 - reniced tasks fixes: a key calculation fix (i re-checked key nice
   workloads and this has no real impact [other than improving them 
   slightly] - the other side of the branch fixed up the effects of this 
   - otherwise we'd have noticed this sooner), and two rounding 
   precision improvements that act against error accumulation.

 - sleeper_bonus should be batched by sched_granularity and not by 
   stat_granularity. (this has almost no effect in practice, but a 
   speedup that pushes the only 64-bit division in CFS into a slowpath.)

then are are also two non-code documentation updates and minor cleanups 
and uninlining.

Nevertheless, to be safe i have also done over 200 'make randconfig; 
make -j bzImage' build tests:

   #define UTS_VERSION "#231 SMP Wed Aug 8 21:34:24 CEST 2007"

all of which passed fine. Booted (and extensively tested) on x86-32 and 
x64-32 as well, both UP and SMP - UP, 2-way to 8-way systems.

	Ingo

------------------>

Alexey Dobriyan (1):
      sched: remove binary sysctls from kernel.sched_domain

Josh Triplett (1):
      sched: mark print_cfs_stats static

Peter Williams (2):
      sched: simplify move_tasks()
      sched: fix bug in balance_tasks()

Thomas Voegtle (1):
      sched: mention CONFIG_SCHED_DEBUG in documentation

Ulrich Drepper (1):
      sched: clean up sched_getaffinity()

Ingo Molnar (55):
      sched: batch sleeper bonus
      sched: reorder update_cpu_load(rq) with the ->task_tick() call
      sched: uninline rq_clock()
      sched: schedule() speedup
      sched: clean up delta_mine
      sched: delta_exec accounting fix
      sched: document nice levels
      sched: add [__]update_rq_clock(rq)
      sched: eliminate rq_clock() use
      sched: remove rq_clock()
      sched: eliminate __rq_clock() use
      sched: remove __rq_clock()
      sched: remove 'now' use from assignments
      sched: remove the 'u64 now' parameter from print_cfs_rq()
      sched: remove the 'u64 now' parameter from update_curr()
      sched: remove the 'u64 now' parameter from update_stats_wait_start()
      sched: remove the 'u64 now' parameter from update_stats_enqueue()
      sched: remove the 'u64 now' parameter from __update_stats_wait_end()
      sched: remove the 'u64 now' parameter from update_stats_wait_end()
      sched: remove the 'u64 now' parameter from update_stats_curr_start()
      sched: remove the 'u64 now' parameter from update_stats_dequeue()
      sched: remove the 'u64 now' parameter from update_stats_curr_end()
      sched: remove the 'u64 now' parameter from __enqueue_sleeper()
      sched: remove the 'u64 now' parameter from enqueue_sleeper()
      sched: remove the 'u64 now' parameter from enqueue_entity()
      sched: remove the 'u64 now' parameter from dequeue_entity()
      sched: remove the 'u64 now' parameter from set_next_entity()
      sched: remove the 'u64 now' parameter from pick_next_entity()
      sched: remove the 'u64 now' parameter from put_prev_entity()
      sched: remove the 'u64 now' parameter from update_curr_rt()
      sched: remove the 'u64 now' parameter from ->enqueue_task()
      sched: remove the 'u64 now' parameter from ->dequeue_task()
      sched: remove the 'u64 now' parameter from ->pick_next_task()
      sched: remove the 'u64 now' parameter from pick_next_task()
      sched: remove the 'u64 now' parameter from ->put_prev_task()
      sched: remove the 'u64 now' parameter from ->task_new()
      sched: remove the 'u64 now' parameter from update_curr_load()
      sched: remove the 'u64 now' parameter from inc_load()
      sched: remove the 'u64 now' parameter from dec_load()
      sched: remove the 'u64 now' parameter from inc_nr_running()
      sched: remove the 'u64 now' parameter from dec_nr_running()
      sched: remove the 'u64 now' parameter from enqueue_task()
      sched: remove the 'u64 now' parameter from dequeue_task()
      sched: remove the 'u64 now' parameter from deactivate_task()
      sched: remove the 'u64 now' local variables
      sched debug: remove the 'u64 now' parameter from print_task()/_rq()
      sched: move the __update_rq_clock() call to scheduler_tick()
      sched: remove __update_rq_clock() call from entity_tick()
      sched: clean up set_curr_task_fair()
      sched: optimize activate_task()
      sched: optimize update_rq_clock() calls in the load-balancer
      sched: make the multiplication table more accurate
      sched: round a bit better
      sched: fix update_stats_enqueue() reniced codepath
      sched: refine negative nice level granularity

 Documentation/sched-design-CFS.txt  |    2 
 Documentation/sched-nice-design.txt |  108 +++++++++++
 include/linux/sched.h               |   20 --
 kernel/sched.c                      |  339 ++++++++++++++++++------------------
 kernel/sched_debug.c                |   16 -
 kernel/sched_fair.c                 |  212 ++++++++++------------
 kernel/sched_idletask.c             |   10 -
 kernel/sched_rt.c                   |   48 +----
 8 files changed, 421 insertions(+), 334 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email blocked]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Related Links:

Don't use unsigned type, please.

Anonymous (not verified)
on
August 10, 2007 - 5:15am

Don't use unsigned type, please.

The sign's extension is important to good maintainance of the same source code in both 32-bit and 64-bit platforms.

To avoid many copies of variants of functions, i recommend to understand it:

A) int x = -1;
It's 32-bit in 32-bit platform and 32-bit in 64-bit platform.

B) long y = -1L;
It's 32-bit in 32-bit platform and 64-bit in 64-bit platform.

C) long long z = -1LL;
It's 64-bit in 32-bit platform and 64-bit in 64-bit platform.

Re: Don't use unsigned type, please.

Anonymous (not verified)
on
August 10, 2007 - 5:25am

B) long y = -1L;
It's 32-bit in 32-bit platform and 64-bit in 64-bit platform.

The B) option is the most appropiate for maximum performance.

Its main reason is that with the B) option, the C/C++ compiler generates the minimum number of code's instructions of the corresponding architecture.

Smaller code => Less instructions to execute => Faster code.

Lucky! ;)

The CFS changes described

Anonymous (not verified)
on
August 10, 2007 - 5:38am

The CFS changes described in the article were not about the width of a variable, they were about the elimination of a 64-bit parameter from lots of function calls. That helps both 32-bit and 64-bit architectures, regardless of the width of the variable.

Why not?

pomac (not verified)
on
August 10, 2007 - 5:33am

If you know that you have it under control, then why not use unsigned?

A huge benefit with unsigned is that the compiler will do alot more to speed up the maths.

Re: why not use unsigned?

Anonymous (not verified)
on
August 10, 2007 - 6:38am

why not use unsigned?

Unsigned datatype complicates much in the castings of uint32 -> ulong64 that adds more code to control this transition depending in the functionality of the code.

The castings of int32 -> long64 are simple that the microprocessor realizes it in one instruction using its own extension of sign.

See this example of casting that uses unsigned:

unsigned long long aaa64;
unsigned int bbb32 = 0xFFFFFFFF;

aaa64 = (unsigned long long)bbb32 | 0xFFFFFFFF00000000; // complicated!!!

and this example of casting that uses signed:

long long aaa64;
int bbb32 = -1; // -1 and 0xFFFFFFFF are equivalent.

aaa64 = (long long)bbb32;
// easy!!! -1 from int (0xFFFFFFFF) to -1 from longlong (0xFFFFFFFFFFFFFFFF)

Lucky! ;)

Signed arithmetics is

Anonymous (not verified)
on
August 10, 2007 - 6:59am

Signed arithmetics is inherently more complex if there is division (or multiplication) in the picture. Look at gcc assembly output.

If used judiciously, unsigned can be a very rewarding data type, especially for something as performance-critical as a scheduler - but signed data types indeed have their advantages too. CFS uses both signed and unsigned, 32-bit and 64-bit variables, depending on which is the most appropriate for the given task.

No problem!

Anonymous (not verified)
on
August 10, 2007 - 7:09am

Imagine that in the source code there are:

0.01% of slowest divisions.
0.05% of little slow multiplications.
99.94% of fast adds, subtracts, shifts, rotations, etc.

The divisions and multiplications are not used much in the kernel's source code.

Take a look at the CFS code

Anonymous (not verified)
on
August 10, 2007 - 7:10am

Take a look at the CFS code again. Isnt that supposed to be the topic here? :-)

Shifts?

pomac (not verified)
on
August 10, 2007 - 8:20am

You only do shifts on unsigned datatypes.
Since if you don't then you have to manually handle the sign bit.

Use unsigned and gcc will automagically do shifts where it can.

You're missing something.

Mr_Z
on
August 10, 2007 - 9:00am

You can replace both signed and unsigned divides by powers of 2 with shifts. (It's a little trickier with signed types, but not dramatically so.) For both signed and unsigned types, other divides need to be implemented either with a divide instruction, multiply by the reciprocal, or a call out to a support function. The book "Hacker's Delight" (which I heartily recommend) actually covers a lot of this and more. Computer arithmetic is full of trickery under the hood. :-)

Anyway to divide by a power of two with a signed type, you need multiple instructions. This can still be a win on architectures that have a barrel shifter, but no divide instruction. The key is to handle the "round towards zero" behavior correctly. For example (-1)/2 == 0, but (-1) >> 1 == -1. The division gets broken up into multiple steps:

    tmp1 = (signed) dividend >> (log2(divisor) - 1);  /* Create rounding value from sign bits via sign extension */
    tmp2 = (unsigned) tmp1   >> (32 - log2(divisor)); /* Move these sign bits into portion that will shift away */
    tmp3 = dividend + tmp2;                           /* Round towards zero */
    quotient = tmp3 >> log2(divisor);                 /* Final result */

This presumes "divisor" is known at compile time, so log2(divisor) is a known constant.
--
Program Intellivision and play Space Patrol!

I know =)

pomac (not verified)
on
August 10, 2007 - 6:18pm

Yeah i know, but when it comes to gcc it won't auto expand signed divisions (yet) afair. As to in the kernel, i'd think they'd avoid it.

But yes, i know =), and thanks for the book recomondation, ordering it now actually =)

You remember wrong. On my

Anonymous (not verified)
on
August 11, 2007 - 5:39am

You remember wrong. On my gcc (gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)) this program:

int div2(int i) { return i / 2; }
int div4(int i) { return i / 4; }
int div8(int i) { return i / 8; }
int div16(int i) { return i / 16; }
int div32(int i) { return i / 32; }
unsigned int div2u(unsigned int i) { return i / 2; }
unsigned int div4u(unsigned int i) { return i / 4; }
unsigned int div8u(unsigned int i) { return i / 8; }
unsigned int div16u(unsigned int i) { return i / 16; }
unsigned int div32u(unsigned int i) { return i / 32; }

compiled with "gcc -S -O3" gives (relevant parts):

div2:
srawi 3,3,1
addze 3,3
blr
div4:
srawi 3,3,2
addze 3,3
blr
div8:
srawi 3,3,3
addze 3,3
blr
div16:
srawi 3,3,4
addze 3,3
blr
div2u:
srwi 3,3,1
blr
div4u:
srwi 3,3,2
blr
div8u:
srwi 3,3,3
blr
div16u:
srwi 3,3,4
blr
div32u:
srwi 3,3,5
blr

Morale: always check your facts before posting (especially when it only takes 5 minutes to do so).

You're lucky srawi is

Anonymous (not verified)
on
August 12, 2007 - 12:47pm

You're lucky srawi is available on the PowerPC because normally, gcc certainly *cannot* replace a (power-of-two) div by a shift in the general case where the numerator is signed. Unless it can prove that the input is positive, it has to do the div. Why? Think what happens when you do -1/2 (result is zero) versus (-1)>> 1 (oops, result still -1). I tend to explicitly use shifts whenever possible (i.e. when I don't care about the rounding) instead of relying on the compiler to substitute a shift.

You missed my comment above.

Mr_Z
on
August 12, 2007 - 7:24pm

You can do a div-by-power-of-2 on signed numbers with proper rounding on negative values with just ordinary shifts. On modern CPUs with a DIV instruction, though, it can be cheaper to just do the DIV. If the CPU doesn't have a DIV instruction, though, you're better off with the shifts than a call to the "div" support routine in libgcc.

--
Program Intellivision and play Space Patrol!

Of course Power(PC) is a

Anonymous (not verified)
on
August 12, 2007 - 10:50pm

Of course Power(PC) is a saner platform than x86. ;-)

But as opposed to what GP said, gcc also emits shifts for signed power-of-2 divides on x86, very similar to what venerable Mr_Z posted above:

div2:
movl %edi, %eax
shrl $31, %eax
addl %edi, %eax
sarl %eax
ret
div4:
leal 3(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $2, %edi
movl %edi, %eax
ret
div8:
leal 7(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $3, %edi
movl %edi, %eax
ret
div16:
leal 15(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $4, %edi
movl %edi, %eax
ret
div32:
leal 31(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $5, %edi
movl %edi, %eax
ret

shift in x86

Anonymous (not verified)
on
August 11, 2007 - 2:16am

Hello.
I don't know if this is related to your comment but x86 instruction set has the following shift instructionz :
1- SAR : shift arithmetic right. The vacated upper bits are filled with the sign(most significat bit). So this is a shift suitable for signed integers.
2- SHR : shift logical right. The vacated upper bits are filled with zeros.
3- SAR and SHL : Shift arithmetic left and Shift logicla left. The two instructions are identical for signed and unsigned instrction : lower bits are filled with zeros.

Wrong.

Anonymous (not verified)
on
August 10, 2007 - 7:06am

Unsigned types are zero extended, so 0xFFFFFFFF becomes 0x00000000FFFFFFFF and that is as easy as sign extending for signed types.

Huh? You make no sense.

Mr_Z
on
August 10, 2007 - 8:04am

From unsigned int to unsigned long long, you zero extend because the number is unsigned. I can make the same argument you're making, in reverse, to show how silly it is: If I put an unsigned number in a signed type, I have to do "foo & 0x00000000FFFFFFFFULL" to mask away those pesky unwanted sign bits after I copy the int to a long long. That means the signed type is inferior, right? No. It just means I'm using the wrong type for the job.

Unsigned types are important for certain jobs. For one thing, signed integer overflow is implementation defined in C, whereas unsigned integer overflow is well defined. On some CPUs, 0x7FFFFFFF + 1 = 0x7FFFFFFF for a signed 32-bit type, and it could result in a trap on others. (Granted, this isn't true for most mainstream CPUs, but it's worth noting.) 0xFFFFFFFF + 1 is always 0 for an unsigned 32 bit type though. This becomes very important, especially for measuring time intervals when looking only at the lower 32 bits of the current time. (Gee, why would a scheduler want to do that?)

Oh, I should add, on x86-64 CPUs, 32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register. i.e. a 32-bit result written to EAX results in a zero written to the upper 32-bits of RAX.

--
Program Intellivision and play Space Patrol!

Re: Huh? You make no sense.

Anonymous (not verified)
on
August 10, 2007 - 9:23am

Oh, I should add, on x86-64 CPUs, 32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register. i.e. a 32-bit result written to EAX results in a zero written to the upper 32-bits of RAX.

You're WRONG!

http://www.x86-64.org/documentation/assembly.html says:
Implicit zero extend
Results of 32-bit operations are implicitly zero extended to 64-bit values.

The 32-bit operations use 32-bit registers, not 64-bit registers, but the upper half of 64-bit registers is filled with zeros in 32-bit operations.

The 64-bit operations use 64-bit registers and can use both operations: sign extended or zero extended.

http://www.x86-64.org/documentation/assembly.html says:
Immediates
Immediate values inside instructions remain 32 bits and their value is sign extended to 64 bits before calculation.

The inmediate values are of 32 bits because smaller values for instructions are better and ALWAYS use sign extended operation :)

Huh?

Mr_Z
on
August 10, 2007 - 9:28am

I said: "32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register."

You said I was wrong, quoting this section of the x86-64 documentation: "Implicit zero extend: Results of 32-bit operations are implicitly zero extended to 64-bit values."

Can you tell me how those two statements differ? One says 32-bit results are implicitly zero extended, and the other says 32-bit results are implicitly zero extended. Maybe there's some difference between implicitly zero extending and zero extending implicitly that I'm unaware of, but as far as I can tell, they say the same thing.

--
Program Intellivision and play Space Patrol!

It depends

ak (not verified)
on
August 10, 2007 - 1:41pm

It depends -- addresses are sign extended (this makes the kernel model possible which uses negative
32bit addresses to run in the -2GB space area of the 48bit address room)
And there are explicit instructions to do sign extension from 32bit to 64bit.
But in general you're both right.

You're conflating two ideas.

Mr_Z
on
August 10, 2007 - 4:30pm

My understanding is that address arithmetic is always 64 bit, and so sign extension isn't a worry there. There is sign extension on 32-bit displacements though before the computation. But, since the pointers are 64-bit to begin with, it's meaningless to talk about sign extending the result, since the result is 64 bit also. All that is irrelevant to the memory map issue you raised.

What I think you're thinking of is on current implementations, the architecture requires bits 47 through 63 of the address to be the same. If a computed address differs in those bits, it triggers a fault. The ALUs do not do this sign extension for you. This detail is left to your memory manager to get right.

--
Program Intellivision and play Space Patrol!

resoponse

Anonymous (not verified)
on
August 13, 2007 - 2:31am

yes I totally agree. I also tried it and it worked right away. Thanks for the good piece of information

You are wrong on x86.

Anonymous (not verified)
on
August 10, 2007 - 11:22am

I compared gcc's assembler output for 32-bit to 64-bit conversion for both unsigned and signed.

Both needed exactly one instruction. Signed used 'ctld'. Unsigned used 'xorl'.

Not universally true

Mr_Z
on
August 10, 2007 - 10:52am

Your statements on 64-bit platforms are true on LP64 platforms. There are ILP64 platforms for which int is 64 bit.

LP64 means "long" and "pointer" are 64 bit. ILP64 means "int", "long" and "pointer" are 64 bit. This site goes into greater detail.

--
Program Intellivision and play Space Patrol!

Wow, what utter gibberish.

Zombywuf (not verified)
on
August 13, 2007 - 6:02am

Wow, what utter gibberish.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.